# Data management with *yeoda* (v0.3.0)
[yeoda](https://pypi.org/project/yeoda/) (**y**our **e**arth **o**bservation **d**ata **a**ccess) is a geopspatial data management library developed at the GEO Department at TU Wien for handling earth observation (EO) data. It lets you read data saved on your disk as a file (netCDF, GeoTIFF) and makes it available in datacubes. Datacubes contain standard operations, e.g., filtering, sorting, selecting, and make the data easily processable with standard Python libraries, such as *numpy* or *xarray*. 

In this notebook the general handling of *yeoda* datacubes will be explained.

## Setting up a datacube

A *yeoda* datacube augments existing data by integrating it into a datacube architecture similar to, e.g. [Open Data Cube](https://www.opendatacube.org/about). To set up such a datacube, *yeoda* works in concert with [geopathfinder](https://github.com/TUW-GEO/geopathfinder) (file naming), [veranda](https://github.com/TUW-GEO/veranda) (IO classes) and [pytiletproj/Equi7Grid](https://github.com/TUW-GEO/Equi7Grid) (geo-referencing).

First, collect the files you want to put into your datacube. You can use [geopathfinder](https://github.com/TUW-GEO/geopathfinder) to conveniently gather files matching a certain file naming convention:

In [None]:
import os
from geopathfinder.folder_naming import build_smarttree

USER = os.getcwd().split('/')[2]
root_path = f'/home/{USER}/shared/datasets/fe/data/sentinel2/L2A'
folder_hierarchy = ["sub_grid", "tile_name", "var_name"]

# regex expressions are supported to select only files matching a certain pattern
# (i.e. not starting with Q ending with .tif)
tree = build_smarttree(root_path, folder_hierarchy, register_file_pattern="^[^Q].*.tif$")
filepaths = tree.file_register

print(f"{len(filepaths)} files registered:")
print("\n".join(filepaths[:4]))

You can use [pytiletproj/Equi7Grid](https://github.com/TUW-GEO/Equi7Grid) to define a grid to be used by the datacube.

In [None]:
from equi7grid.equi7grid import Equi7Grid

subgrid = Equi7Grid(10).EU

The file registry and grid can now be used directly as input to *yeoda's* `EODataCube` constructor, to wrap a datacube structure around our files.

In [None]:
from geopathfinder.naming_conventions.acube_naming import ACubeFilename
from yeoda.datacube import EODataCube

dimensions = ["var_name", "dtime_1", "dtime_2", "tile_name"]
s2_cube = EODataCube(filepaths=filepaths, dimensions=dimensions, filename_class=ACubeFilename, grid=subgrid,
                     sdim_name="tile_name", tdim_name="dtime_1")
s2_cube.inventory

Now you're all setup and can perform operations on your freshly minted datacube. Internally, *yeoda* uses a [GeoPandas](https://geopandas.org) dataframe to store the filename and geometry information.  On top of that, datacube functions were defined to filter, split, sort, align, etc. the data. It has to be noted that most functions have a keyword argument `inplace`, same as most [GeoPandas](https://geopandas.org) functions. In the next sections some example usages of these functions will be shown.

This example showcases the most generic flavour of a datacube, however there are also more specialized data cube classes available, which are tailored towards the products operated by the research group Remote Sensing of the GEO Department at TU Wien (TUWGEO). See the next section.

### Setting up product specific datacubes

To work with preprocessed data you can use the classes `SIG0DataCube` for sigma nought and `GMRDataCube` for radiometric terrain-flattened gamma nought data. On the value-added data side, `SSMDataCube` allows you to access the TUWGEO SSM data, and `SCATSARSWIDataCube` SWI data, respectively.

In [None]:
from geopathfinder.naming_conventions.sgrt_naming import SgrtFilename
from yeoda.products.base import ProductDataCube

root_path = f"/home/{USER}/shared/datasets/fe/data/sentinel1/preprocessed/EU500M"
folder_hierarchy = ["tile_name", "var_name"]

tree = build_smarttree(root_path, folder_hierarchy, register_file_pattern="^[^Q].*.tif$")
dimensions = ["time", "var_name", "tile_name", "pol"]
scale_factor = 100 # with yeoda v0.3.0, the scale factor still needs to be defined by the user
sig0_cube = ProductDataCube(filepaths=tree.file_register, dimensions=dimensions, filename_class=SgrtFilename, 
                            grid=Equi7Grid(500).EU, scale_factor=scale_factor)
sig0_cube.inventory

Note, that *yeoda* is not limited to GeoTIFF files, it also supports NetCDF files.

## Dimension operations

The following sections shows how you can manipulate the dimensions of the datacube itself, before doing any further operations based on them.

### Renaming dimensions

If you have to work with a pre-defined naming convention in *geopathfinder* (e.g. the *yeoda* naming convention) and if you do not agree with the naming of the filename parts/dimensions, you can still rename dimensions afterwards:

In [None]:
sig0_cube.rename_dimensions({'tile_name': 'tile'}, inplace=True)
sig0_cube.inventory

### Adding dimensions

You can simply add new filepath-dependent values (e.g. file size, cloud coverage, …) along a new dimension (e.g. named “new_dimension”) with a few lines of code:

In [None]:
extended = sig0_cube.add_dimension("ones", [1] * len(sig0_cube))
extended.inventory

## Sorting

One of the most common operations is to sort the inventory according to some metadata, e.g. the timestamp:

In [None]:
sorted_descending = sig0_cube.sort_by_dimension('time', ascending=False)
sorted_descending.inventory

## Filtering

Once you have your datacube structure setup you can also filter it before doing any processing. For instance, if you want to do some runtime intensive processing on only a small portion of the data. The following sections give a few examples of the available filtering methods. Again most methods provide a `inplace` flag, similar to [GeoPandas](https://geopandas.org).

### Filter by geometry

You can filter for arbitrary geometry or a list of bounding box coordinates. The filtered cube will only contain files within the specified geometry.

In [None]:
import osr

sref = osr.SpatialReference()
sref.ImportFromEPSG(4326)  # LonLat spatial reference system

bbox_inside = [(12.628, 46.385), (15.768, 48.431)]  # [(x_min, y_min), (x_max, y_max)]
filtered_by_bbox = sig0_cube.filter_spatially_by_geom(bbox_inside, sref=sref)
print(f"Number of filtered files with a bbox located inside the data tiles: {len(filtered_by_bbox)}")

bbox_outside = [(4.404, 44.443), (8.826, 47.811)]
filtered_by_bbox = sig0_cube.filter_spatially_by_geom(bbox_outside, sref=sref)
print(f"Number of filtered files with a bbox located outside the data tiles: {len(filtered_by_bbox)}")

### Filter by dimension

A very important function is `filter_by_dimension`, which accepts a list of values and expressions to filter the data along a dimension. The list of `expressions` has the same length as the values list and includes mathematical comparison operators, e.g. `“==”`, `“<=”`, `“>=”`, `“<”`, `“>”` (`“==”` is default). Some examples are:

In [None]:
# only consider VV polarisation
only_vv = sig0_cube.filter_by_dimension(['VV'], name="pol")
only_vv.inventory

In [None]:
from datetime import datetime

# only consider data between 2019-02-01 and 2019-03-01
time_span = [(datetime(2019, 2, 1), datetime(2019, 3, 1))]
time_span_only = sig0_cube.filter_by_dimension(time_span, [('>=', '<')], name='time')
time_span_only.inventory

### Filter by file pattern

You can also directly filter on the filename using a regex pattern:

In [None]:
filtered_by_pattern = sig0_cube.filter_files_with_pattern(".*_066_.*")
filtered_by_pattern.inventory

## Splitting

You can use split operations to segregate your datacube into chunks, which can then be used for processing. For instance, you could split data into months and calculate monthly means.

### Split by dimension

Split a datacube based on dimension values. The splitting conditions are expressed the same way as in `filter_by_dimension`.

In [None]:
values = ['VV', 'VH']
vv_cube, vh_cube = sig0_cube.split_by_dimension(values, name="pol")
print(f"Parent datacube of length {len(sig0_cube)}, split into two datacubes of length {len(vv_cube)} and {len(vh_cube)}.")

### Split monthly
If you want to analyse your data under certain temporal aspects, in this case in a monthly manner, you can split up the original data cube into smaller monthly data cubes (if the data covers more than a month):

In [None]:
months = sig0_cube.split_monthly()
print(f"Parent datacube has been split into {len(months)} monthly datacubes.")

Note, that *yeoda* also provides convenience functions for yearly splits.

## Joining

If you have multiple datacubes, or have split them up to perform some processing, you can concatenate them using join operations. The following section will look closer at a few of them.

### Intersection

You can use this operation to get only those fields of multiple datacubes with matching dimensions or with a specific matching dimension:

In [None]:
only_jan_remains = sig0_cube.intersect(months[0], on_dimension='time')
only_jan_remains.inventory

### Union
If you have two data cubes and you want to unite their information, you can simply do:

In [None]:
jan_and_feb = months[0].unite(months[1])
jan_and_feb.inventory

### Alignment

The `align_dimension` method aligns a datacube with respect to a second datacube along a dimension (`name`). In other words, the order and the length of the dimension will then be the same. This also means that datacube entries are duplicated if they appear more often in the second datacube.

In [None]:
# create a small test datacube
small = sig0_cube.filter_by_dimension(datetime(2015, 2, 1, 4, 47, 30), name='time')
print(f"Small datacube of length {len(small)}")
# align the 'band' dimension with the large datacube
aligned_with_duplicates = small.align_dimension(sig0_cube, 'pol')
aligned_with_duplicates.inventory

## Loading Data

This section demonstrates how to load data. All functions have a common set of keyword arguments, where the most important ones are discussed here:

 - `band`: This argument specifies the band name as a string.
 - `dtype`: There are many types of *Python* data structures to store array-like data, and their selection mainly depends on what you want to do with the loaded data later on. These are offered by *yeoda*:
   - xarray.DataSet (“xarray”)
   - numpy.ndarray (“numpy”)
   - pandas.DataFrame (“dataframe”)
 - `origin`: Depending on the chosen return data type, this parameter defines the origin of the pixel coordinates in the world system. The origin can be one of the following:
   - upper right (“ur”, default)
   - upper left (“ul”)
   - lower right (“lr”)
   - lower left (“ll”)
   - center ("c")

### Load by geometry

You can load data for a region defined by an arbitrary geometry, similar to how you can filter by a geometry. Geometries do not need to be axis-parallel, but data for the spanning axis aligned bounding box will be loaded nonetheless, to fit into an array data-structure. However, it is possible to mask any data outside the specified geometry by setting the `apply_mask` parameter to true.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

polygon = [(10.6419,46.7977), (10.4689,47.2261), (11.3516,47.3510),
           (11.3689,46.9161), (10.9426,46.9979)]

months = sig0_cube.split_monthly()
jan_vv = months[0].filter_spatially_by_geom(polygon, sref=sref)\
                  .filter_by_dimension(['VV'], name="pol")
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    masked_xarray = jan_vv.load_by_geom(polygon, sref=sref, apply_mask=True, dtype="numpy")

plt.figure(figsize=(15, 5))
plt.imshow(10 * np.log10(np.nanmean(10**(masked_xarray/10), axis=0)), cmap=plt.cm.Greys_r)

### Load by coordinates

The `load_by_coords`, accepts a list of X and a list of Y (world system) coordinates as input. If the spatial reference of the coordinates is not equal to the data, you need to specify the spatial reference keyword argument `sref`.

In [None]:
import ogr

# defining a point to sample a time series from
point = ogr.Geometry(ogr.wkbPoint)
point.AddPoint(16.210,47.242)
point.AssignSpatialReference(sref)


# load data by coordinates
point_data = sig0_cube.filter_by_dimension(['VV'], name='pol')
point_data.filter_spatially_by_geom(point, sref=sref, inplace=True)
time_series = point_data.load_by_coords(point.GetX(), point.GetY(), band=1, sref=sref, dtype='numpy')

# prepare data for graph
x_vals = point_data["time"].values
y_vals = time_series.flatten()
mask = np.isfinite(y_vals) # only use valid values

# create a nice plot
plt.figure(figsize=(15, 5))
plt.title('Backscatter timeseries')
plt.plot(x_vals[mask], y_vals[mask], alpha=0.5)
plt.scatter(x_vals[mask], y_vals[mask], s=15, color="red", label="measurements")
plt.xlabel('Date')
plt.ylabel('dB')
plt.grid()
plt.legend()
plt.show()

By specifying the `dtype` to be `"numpy"` parameter of the loading function, we request a plain *NumPy* array instead of the default *xarray*.

### Load by pixels

`load_by_pixels` expects pixel coordinates given by a list of row and column indexes. The keyword arguments `row_size` and `col_size` allow you to define a window, where the specified ranges count from left to right (columns) and from top to bottom (rows) starting at the given row and column coordinates.

In [None]:
# filter data cube for one day
single_day = sig0_cube.filter_spatially_by_geom(point, sref=sref)\
                      .filter_by_dimension(datetime(2015, 1, 4, 5, 17, 16), name='time')

pixels = single_day.load_by_pixels(0, 0, row_size=1200, col_size=1200, dtype="numpy")

# plot the data
plt.figure(figsize=(20, 20))
plt.title('Backscatter observed on 4.1.2015')
img_h = plt.imshow(pixels[0, ...], cmap=plt.get_cmap("Greys_r"))
cb = plt.colorbar(img_h, shrink=0.6)
cb.set_label("Sigma nought backscatter [dB]")
plt.show()