# GEEO Tutorial 0 - Introduction

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/leonsnill/geeo/blob/master/docs/tutorial_0_introducing-geeo.ipynb)

This introductory tutorial provides an overview on the basic structure and usage of **geeo**. 

**geeo** is a processing pipeline and collection of algorithms for obtaining Analysis-Ready-Data (ARD) and higher-level image products from multispectral image archives, including the Landsat and Sentinel-2 archive, using the Google Earth Engine Python API. 
The processing modules are organized along different hierarchical levels:

- LEVEL-2 (`geeo/level2`): Preprocessed, harmonized, spatio-temporally subsetted surface reflectance time series stacks (TSS) and mosaics (TSM)   
- LEVEL-3 (`geeo/level3`): Advanced spectral features including Spectral-Temporal-Metrics (STM) and Pixel-based Composits (PBC), as well as time series interpolation (TSI)
- LEVEL-4 (`geeo/level4`): Semantic features, i.e. information derived from spectral signals. Currently implemented in workflow are Land Surface Phenology (LSP) metrics. An auxiliary machine learning module allows to upload local compatible scikit-learn models to an ee.Classifier.   
- EXPORT (`geeo/export`): Export module handling metadata and projection settings, and constructing export tasks.

Typically, the user does not directly interact with the submodules, but simply provides processing instructions defined via either a text file (.yml) or python dictionary.

The classic - but not limited to - approach to use **geeo** is to:

1. Create a .yml parameter file from an existing blueprint using the `create_parameter_file()` function
2. Adjust the parameter settings in the .yml file to your requirements
3. Running the adjusted parameter file using `run_param()` to trigger the processing
4. Either export the data to Drive/Asset (if specified in parameter file) or use the resulting ee objects in your interactive python session

<br>

<div>
<img src="https://raw.githubusercontent.com/leonsnill/geeo/master/geeo/data/fig/geeo_workflow_manuscript.svg" width="50%" style="display:block; margin: 0 auto;"/>
</div>

## Import (and installation)
Prior to importing **geeo**, import the Earth Engine module. Authenticate **ee** using your GEE-eligible Google account and initialize your Google Cloud project with the Earth Engine API enabled. For details on how to access Google Earth Engine, [click here](https://developers.google.com/earth-engine/guides/access).

**Make sure to set your Google Cloud project name to initialize Earth Engine!**

In [None]:
my_project_name =''

# imports
import ee
ee.Authenticate()
ee.Initialize(project=my_project_name)
try:
    import geeo
except:
    !pip install --quiet git+https://github.com/leonsnill/geeo.git
    import geeo

## The parameter file
**The parameter file is a .yml file that contains all available settings for level-2 to level-4 core processing routines as well as for exporting.** 

### Creating a parameter file
A new parameter file can be created using the `create_parameter_file` function:

In [None]:
# create new parameter .yml file from blueprint into current working directory (or specified path)
geeo.create_parameter_file('introduction', overwrite=True)

Inspect the newly created parameter file (with the above string it is created in your current working directory. If you are on Colab, check the tab on the left to find the file). **The sections of the parameter file follow the structure of the main modules (LEVEL-2, LEVEL-3, LEVEL-4, EXPORT) and are further subdivided into different categories.** The LEVEL-2 section, for example, starts with basic settings regarding the spatial and temporal extent (SPACE AND TIME), the desired sensors and associated quality mask settings (SENSOR AND DATA QUALITY SETTINGS), as well as which features/bands to include (BANDS | INDICES | FEATURES) or calculate. 

For now, let us only take a look at the *SPACE AND TIME* subcategory of the LEVEL-2 instruction block, which contains the following variables and default values:

YEAR_MIN: 2023  
YEAR_MAX: 2023  
MONTH_MIN: 1  
MONTH_MAX: 12  
DOY_MIN: 1  
DOY_MAX: 366  
DATE_MIN: null  
DATE_MAX: null  
ROI: [12.9, 52.2, 13.9, 52.7]  
ROI_SIMPLIFY_GEOM_TO_BBOX: true 

The *SPACE AND TIME* block allows to specify the spatial extent and temporal window for which to process the data. The edited values must match the expected format for each variable. For example, YEAR_MIN expects an integer value specifying the starting year. Mostly, the names and values of the variables are self-explainatory (e.g. the integer format for YEAR, MONTH, and DOY variables), but some variables also accept multiple formats (e.g. the 'ROI' variables accepts a *list* of numeric coordinates, a *string* path to a file, a *GeoPackage* object, a *ee.Geometry*, or a *ee.FeatureCollection*). 

In general, the comments in the parameter file are supposed to provide a brief explaination and depict the allowed formats and settings. You can find a more detailed description on all parameter settings in the [documentation](documentation.md).

### Loading parameters (without directly running the instructions)
If we only wanted to load parameter settings from an existing .yml file into python, we can use the `load_parameters` function that converts the yml-file into a python dictionary containing all defined variables:

In [None]:
# load the newly created parameter file into a python dictionary
prm_dict = geeo.load_parameters('introduction.yml')
prm_dict

As you can see printing the resulting dictionary shows all parameters (keys) and associated values found in the .yml file.

## The parameter dictionary

Under the hood, geeo loaded the .yml-file and converted it into a python dictionary. **The dictionary is the central data structure used to save input and output variables in geeo**. All core processing routines (from level-2 to export) rely on the dictionary structure and also return a dictionary if run individually (`run_level2()` -> `run_level3()` -> `run_export()`).

As such, **geeo also allows for giving processing instructions using a python dictionary directly as input**  (instead of a explicit .yml file). Using a dictionary as direct input can be seen as an 'interactive mode' that allows to easily include geeo in your existing / extended Earth Engine workflow. It is important that the specified keys of the dictionary have the same name as in the parameter file. Keys that are not defined by the user will simply receive the default values from the blueprint (naturally, to know the correct names this means getting familiar with the settings and inspecting the parameter file and/or documentation).

As mentioned, a direct interaction using a dictionary is not requried to use geeo, but it can prove very usefull due to the increased flexibility. For example, if we had a recurring processing chain where the only changing variable is the ROI, we can simply loop over our different geometries, iteratively add them to our dictionary and run the processing.

---

## Running a parameter file

If we have adjusted the settings to our needs, **all we need to do in order to execute the instructions is to call the `run_param()` function onto the yml-file or dictionary.**

In [None]:
prm_run = geeo.run_param('introduction.yml')
prm_run

`run_param` is a wrapper function that executes the chain of `run_level2()` -> `run_level3()` -> `run_level4` -> `run_export()`, where each output of each module is fed into the subsequent one. Each module expects as input the dictionary, and also returns the dictionary + added variables (e.g. processed ee.ImageCollections). If we only want a subset of the processing, we could also run the specific module (+ preceding modules). In practise, all level-3 and level-4 processing is disabled by default.

To run the default settings, except a few adjustments, we can simply create a dictionary ...

In [None]:
# create dictionary whose key names match the variable names in the parameter file
prm_dict = {
    'YEAR_MIN': 2020,
    'YEAR_MAX': 2022
}

... and only adjust the global year range by setting YEAR_MIN and YEAR_MAX, and feed the dict into the `run_param` function:

In [None]:
run_prm = geeo.run_param(prm_dict)
run_prm

As mentioned above, all non-specified parameters (keys in dict) will be set to the default values of the [parameter blueprint file](../geeo/config/parameter_blueprint.yml) used when calling `create_parameter_file()`. 

### Inspecting the output
Let us now inspect the processing output after having run the settings in more detail.

By default, only level-2 processing is enabled, as such, the only sections which have an impact on our current output are:

- SPACE AND TIME
- TIME SERIES STACK (TSS) / SENSOR AND DATA QUALITY SETTINGS
- BANDS | INDICES | FEATURES

Inspect these sections in your `introduction.yml` file in order to comprehend the current settings. You can also take a look at the [documentation](documentation.md) for a more detailed description of each parameter. 

In summary, we are requesting all potential Landsat-4, -5, -7, -8, and -9 from 2020-2022 for the bounding box [12.9, 52.2, 13.9, 52.7] (Ã­.e., Berlin). We are restricting the valid scenes to a maximum cloud cover of 75% and mask (dilated) clouds, cloud shadows, snow/ice, and fill values with medium cloud detection confidence (conservative masking). The masks are not further eroded and dilated. The following bands/features are requested: blue (BLU), green (GRN), red (RED), near-infrared (NIR), shortwave-infrared 1 (SW1), shortwave-infrared 2 (SW1), as well as the Normalized Difference Vegetation Index (NDVI). No user-defined functions are applied to retrieve additional features. No unmixing is conducted. No custom ee.ImageCollection is provided and the TSS is not transferred into a Time Series Mosaic (TSM), an ee.ImageCollection where ee.Images of the same date are mosaicked in order to remove duplicate observations (mostly resulting from product tiling schemes from NASA and ESA). None of the level-3 and level-4 products are calculated, and no export is requested.

Our first variable of interest for now is the Time-Series-Stack `TSS` variable, the underlying ee.ImageCollection for subsequent level-3 and level-4 processing (unless a custom collection is provided).
We can retrieve the TSS ee.ImageCollection from the dictionary now as follows:

In [None]:
TSS = run_prm.get('TSS')
TSS

Note: Using the python Earth Engine API we can get an interactive rendering of ee objects similar to the web-based JavaScript version by using the [eerepr](https://github.com/aazuspan/eerepr) python package. For the rendered version of this tutorial notebook, we commented this part out, but take a look for yourself, this is a highly convenient functionality for interactive sessions like this one.

In [None]:
import eerepr
eerepr.initialize()
TSS  # This will print the ee.ImageCollection as interactive object where you can inspect the images and their metadata

In [None]:
print(TSS.size().getInfo())

As you can see our TSS variable is an `ee.ImageCollection` containing 424 `ee.Image` objects which sufficed our filter criteria above. Each image contains the eight specified bands + the mask as separate band (internally required for some higher-level processing later on). 

In essence, geeo always returns an `ee.Image` or `ee.ImageCollection` objects for the main processing products.

## Data product overview

| **Module**      | **Product / Method**                                                                 | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|-----------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Level-2         | TSS: Time Series Stack                                                              | The (pre-)processed image collection that has been spatially and temporally filtered, has undergone quality masking, and contains the specified bands and features in a standardized naming format. With the exception of specifying a CIC, the TSS is the starting point for any subsequently derived image product (NVO, TSI, STM, PBC, LSP). Data availability, cloud cover, and spatial and temporal filtering all determine the (ir)regularity of the gappy time series. Currently, the Landsat legacy and Sentinel-2, as well as their harmonized counterparts are available.                 |
|                 | CIC: Custom Image Collection                                                        | The user can specify a custom collection to be used for subsequent level-3 and level-4 analysis. Accordingly, level-2 processing routines applied to the TSS are not compatible with a CIC (e.g. cloud masking) and must be applied a priori by the user if necessary. As such, CIC primarily makes sense in an 'interactive mode' where the processed ee.ImageCollection can be fed into a parameter dictionary.                                                                                                                     |
|                 | TSM: Time Series Mosaic                                                             | Satellite data products such as Landsat or Sentinel-2 data are stored in a specific tiling scheme that overlap, i.e. the same measurement appears in two distinct images of the same date. TSM creates a harmonized spatial mosaic of imagery recorded on the same day. This is suitable for exporting the time series and can also impact subsequent statistical calculations when data density varies seasonally and some pixel statistics are derived from double counts of the same measurement. It is important to note that despite the conceptual advantage of the Time Series Mosaic over the Time Series Stack, actual differences are often small or negligible depending on data availability, while processing requirements for converting the Time Series Stack to the Time Series Mosaic can be substantial over large areas and long time periods.                     |
| Level-3         | NVO: Number of Valid Observations                                                   | Calculates the number of valid (unmasked) observations per pixel for a given temporal window (either global time window or subwindows, i.e. folds). As such serves as auxiliary information to assess reliability and robustness of derived data products. For example, certain STMs could not be considered useful or robust if they are derived from only a few observations.                                                                                                         |
|                 | TSI: Time Series Interpolation                                                      | TSI uses interpolation methods to create a (theoretically) gap-free, equidistant time series from the TSS, TSM, or CIC. The user specifies a desired interval (e.g. 16 days) and interpolation method with associated parameters. Many applications require either spatial-temporal continuity (gap-free) and/or equidistance. For example, many classifiers cannot deal with nodata values, resulting in no prediction in case of missing data, or STMs and LSP calculations can be strongly biased by irregular gappy time series as inputs. |
|                 | STM: Spectral Temporal Metrics                                                      | STMs are a commonly applied form of dimensionality reduction in remote sensing, in which pixel-wise statistics across individual bands/features are calculated for a given set of imagery over time. STMs are fairly robust and spatially continuous features required for gap-free image analyses such as habitat or land cover mapping.                                                                                                                                    |
|                 | PBC: Pixel-Based Compositing                                                        | PBC creates cloud-free, radiometrically and phenologically consistent image composites that are contiguous over large areas. Different parameters can serves as criteria to rank pixels according to their suitability and mosaic the data accordingly. For example, a common PBC is the maximum NDVI, creating an image mosaic for the spectral bands from the point in time when peak photosynthetic activity is proxied by the NDVI.                        |
| Level-4         | LSP: Land Surface Phenology                                                         | LSP metrics are designed to capture seasonal events in the life cycle of vegetation. Satellites capture the entire land surface and pick-up the spatio-temporal patterns of these rhythmic events. Commonly, a vegetation index such as the NDVI is used to identify phenological stages such as the start-of-season or peak-season by associating the development of the spectral curve over time with these stages, often using 'simple' thresholding. LSP metrics are known to vary tremendously with the method chosen and data input, and their universal robustness across biomes and ecological interpretability is limited. However, when used with care and ideally related to field observations, LSP metrics provide semantic features useful for studying phenology across scales. |


In the following we provide an exemplary overview for each of the above products in geeo.

In [None]:
prm_all_products = {
    # space and time
    "YEAR_MIN": 2023,
    "YEAR_MAX": 2024,
    "ROI": [12.9, 52.2, 13.9, 52.7],  # Berlin
    # sensors
    "SENSORS": ['S2'],
    "FEATURES": ['NDVI', 'NBR'],
    # products
    "TSM": True,
    "NVO": True,
    "TSI": '1RBF',
    "STM": ['p10', 'p50', 'p90', 'stdDev'],
    "PBC": 'NDVI',
    "LSP": 'POLAR'
}

run_all = geeo.run_param(prm_all_products)
run_all

#### Time Series Stack (TSS)

In [None]:
tss = run_all.get('TSS')
tss

#### Time Series Mosaic (TSM)

In [None]:
tsm = run_all.get('TSM')
tsm

#### Number of Valid Observations (NVO)

In [None]:
nvo = run_all.get('NVO')
nvo

#### Time Series Interpolation (TSI)

In [None]:
tsi = run_all.get('TSI')
tsi

#### Spectral-Temporal-Metrics (STM)

In [None]:
stm = run_all.get('STM')
stm

#### Pixel-Based Composite (PBC)

In [None]:
pbc = run_all.get('PBC')
pbc

#### Land Surface Phenology (LSP)

In [None]:
lsp = run_all.get('LSP')
lsp

### Visualizing the products
We will use [**geemap**](https://geemap.org/) to visualize some of the results.

In [None]:
import geemap

To visualize some of the products we first have to extract/convert some of the ones stored as ee.ImageCollection to an ee.Image. We simply use the first entry for this illustration:

In [None]:
img_tss = tss.first()  # first image
img_tsm = tsm.first()
img_nvo = nvo  # already an image
img_tsi = ee.Image(tsi.toList(tsi.size()).get(10))  # get nth image from collection
img_stm = stm
img_pbc = pbc.first()
img_lsp = lsp.first()

Then adding them all to a map view:

In [None]:
import geemap.colormaps as cm

M = geemap.Map(center=[52, 13.4], zoom=10)
M.add_basemap('HYBRID')
M.addLayer(img_tss, {'bands': ['NDVI'], 'min': 0.1, 'max': 0.9, 'palette': cm.palettes.viridis}, 'TSS')
M.addLayer(img_tsm, {'bands': ['NDVI'], 'min': 0.1, 'max': 0.9, 'palette': cm.palettes.viridis}, 'TSM')
M.addLayer(img_nvo, {'bands': ['NVO'], 'min': 0, 'max': 30, 'palette': cm.palettes.magma}, 'NVO')
M.addLayer(img_tsi, {'bands': ['NDVI'], 'min': 0.1, 'max': 0.9, 'palette': cm.palettes.viridis}, 'TSI')
M.addLayer(img_stm, {'bands': ['NDVI_p50'], 'min': 0.1, 'max': 0.9, 'palette': cm.palettes.viridis}, 'STM p50')
M.addLayer(img_pbc, {'bands': ['NDVI'], 'min': 0.1, 'max': 0.9, 'palette': cm.palettes.viridis}, 'PBC')
M


----