<img src="https://avatars.githubusercontent.com/u/74911464?s=200&v=4"
     alt="OpenEO Platform logo"
     style="float: left; margin-right: 10px;" />
# OpenEO Platform - UC9
Dynamic large area land cover mapping

In [1]:
import openeo
import geopandas as gpd
import pandas as pd
from openeo_classification.landuse_classification import *
from datetime import date
import ipywidgets as widgets
from sklearn.model_selection import train_test_split
import datetime
import json
from pathlib import Path

## Objectives and approach

In this notebook we will be studying land cover mapping. Land cover mapping has been done since the onset of remote sensing, and LC products have been identified as a fundamental variable needed for studying the functional and morphological changes occurring in the Earth's ecosystems and the environment, and plays therefore an important role in studying climate change and carbon circulation (Congalton et al., 2014; Feddema et al., 2005; Sellers et al., 1997). In addition to that, it provides valuable information for policy development and a wide range of applications within natural sciences and life sciences, making it one of the most widely studied applications within remote sensing (Yu et al., 2014, Tucker et al., 1985; Running, 2008; Yang et al., 2013).

With this variety in application fields comes a variety of user needs. Depending on the use case, there may be large differences in the target labels desired, the target year(s) requested, the output resolution needed, the featureset used, the stratification strategy employed, and more. The goal of this use case is to show that OpenEO as a platform can deal with this variability, and we will do so through creating a userfriendly interface in which the user can set a variety of parameters that will tailor the pipeline from -reference set & L2A+GRD > to model > to inference- to the users needs.

## Methodology

#### Reference data 
The reference dataset used in this project is the Land Use/Cover Area frame Survey (LUCAS) Copernicus dataset. LUCAS is an evenly spaced in-situ land use and land cover ground survey exercise that extends over the entire of the European Union. The Copernicus module extends up to 51m in four cardinal directions to delineate polygons for each of these points. The final product contains about 60,000 polygons, from which subsequent points can be sampled (d'Andrimont et al., 2021). The user can specify how many points to sample from these polygons to train his model. In addition, the user can upload extra target data to improve performance.

#### Input data
The service created runs on features constructed from GRD sigma0 and L2A data. This data will be accessed through OpenEO platform from Terrascope and  SentinelHub. The extracted data stems from 01-01-2018 to 31-12-2018, corresponding to the year of collection of the LUCAS dataset (2018) on the basis of which the model was trained. Data from other years can be extracted for prediction, provided that the user uploads their own reference set.

From S2: calculation of 7 indices (NDVI, NDMI, NDGI, ANIR, NDRE1, NDRE2, NDRE5) and keeping 2 bands (B06, B12)
From S1: VV, VH and VV/VH
For all of these, 10 features: p25, p50, p75, sd and 6 t-steps, with flexible range

#### Preprocessing
The L2A data has been masked using the sen2cor sceneclassification, with a buffering approach developed at VITO and made available as a process called mask_scl_dilation.

#### Feature engineering
From the L2A data, 7 indices were calculated: the NDVI, NDMI, NDGI, NDRE1, NDRE2, NDRE5 and ANIR. After calculating the indices, most bands were dropped except for B06 and B12. The outputs are rescaled to 0 to 30000 for the sake of computational efficiency. The indices are aggregated temporally with a step size of 10 days with an overlap of 10 days by taking the median. The output is then interpolated linearly to end up with a timeseries. The linear interpolation calculates an interpolated value for every NA value, except for trailing and leading NA’s.
From the Sentinel-1 GRD collection, backscatter is calculated. The ratio of VV / VH backscatter is calculated and rescaled to 0-30000. The timeseries is then aggregated temporally with a step size of 12 and an overlap of 6 by taking the mean. The interpolation output is repeated even though there should not be any missing values in this set. Next, the S1 cube is resampled spatially.
Next, the two datacubes are merged and 10 features are calculated on each of the band dimensions. These 10 features are the standard deviation, 25th, 50th and 75th percentile, and 6 equidistant t-steps. Through this procedure, we end up with a total of 120 features (12 bands x 10 features).

#### Model
Where previously models had to be trained outside of openEO, we can now train Random Forest models in openEO itself. Hyperparameter tuning can be performed using a custom hyperparameter set, and models can be constructed using either feature fusion or decision fusion, i.e., combine all S1 and S2 features and train one model, or train one model on the S1 features and one on the S2 features, and combine the models later. Random Forest was chosen as implementation as it is a fairly basic algorithm that trains rather quickly. Models can be trained using a stratification layer uploaded by the user, resulting in one model per stratification class. After training, the model(s) are validated and the model is used for prediction.

## Implementation
First, create an area of interest for which you want do this classification, and potentially a stratification layer if you would like to make use of stratification. Tune the other parameters to your personal preference.

An important point to take into account is that for now, the area of interest needs to be within the same UTM zone.

In [2]:
split, algorithm, nrtrees, fusion_technique, aoi_sampling, aoi_inference, start_date, end_date, nr_targets, nr_samples_per_polygon = getStartingWidgets()

Box(children=(Label(value='Train / test split:'), FloatSlider(value=0.75, max=1.0, step=0.05)))

Dropdown(description='Model:', disabled=True, options=('Random Forest',), value='Random Forest')

Box(children=(Label(value='Hyperparameters RF model:'), IntText(value=200, description='Nr trees:')))

Box(children=(Label(value='S1 / S2 fusion:'), RadioButtons(disabled=True, options=('Feature fusion', 'Decision…

FileUpload(value={}, accept='.geojson,.shp', description='Upload AOI sampling', layout=Layout(width='20em'))

FileUpload(value={}, accept='.geojson,.shp', description='Upload AOI inference', layout=Layout(width='20em'))

DatePicker(value=datetime.date(2018, 3, 1), description='Start date')

DatePicker(value=datetime.date(2018, 10, 31), description='End date')

Box(children=(Label(value='Select the amount of target classes:'), IntSlider(value=8, max=37, min=2)))

Box(children=(Label(value='Select the amount of times you want to point sample each reference polygon:'), IntS…

Select your final target classes

In [3]:
target_classes = getTargetClasses(nr_targets)

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

Transform the AOI's into geodataframes containing the right columns. If your AOI's are not valid, you will receive an error message. Also, we will load in the LUCAS Copernicus dataset for your specific target area. This method will also already convert the original labels to integers corresponding to the target classes you selected in the menu above.

In [4]:
stratification_column_label="groupID"
strata_sampling, strata_inference = getStrata(aoi_sampling, aoi_inference, strat_col_label=stratification_column_label)
y = getReferenceSet(aoi_sampling, nr_samples_per_polygon, target_classes)

Loading in the LUCAS Copernicus dataset...
Finished loading data.
Extracting points and converting target labels...
Finished extracting points and converting target labels


In [None]:
additional_y = gpd.read_file("resources/extra_data_spain.geojson").set_crs('epsg:4326')
additional_y["target"][additional_y["target"]=="A00: Artificial land"] = 0
additional_y["target"][additional_y["target"]=="G00: Water areas"] = 6
y = pd.concat([y, additional_y[["geometry","target"]]], ignore_index=True)

We can have a look at the amount of LUCAS samples that were present within this final reference set.

In [8]:
y.groupby("target").count()

Unnamed: 0_level_0,geometry
target,Unnamed: 1_level_1
0,204
1,2916
2,1836
3,1350
4,1194
5,507
6,82
7,3


The LUCAS dataset is imbalanced. That is why it can be useful for some use cases to gather some extra reference data, and append that to the LUCAS reference data.

We then split our reference data into a training and a test set. If you wish, you can include a validation set as well.

In [9]:
y_train, y_test = train_test_split(y, test_size=(1-split.value), random_state=333)

We will now load in the features. The feature set is made up of 7 indices calculated from S2, two S2 bands, two S1 bands and one index calculated from S1. More information on the exact feature calculation can be found in the use case demonstration report.

In [20]:
job_opts = {
    "indexreduction": 0,
    "temporalresolution": "None",
    "tile_size": 128,
}

features, feature_list = load_lc_features("sentinelhub", "both", start_date.value, end_date.value, processing_opts=job_opts)

## Note: if you have a large number of samples (>3000), you cannot load your y data as a JSON in memory. You'll have to write out your JSON to file, store it somewhere (for example, on the Public folder of your Terrascope machine) and pass the file directory instead
def fitRandomForestModel(features, y):
    X = features.aggregate_spatial(json.loads(y.to_json()), reducer="mean")
    ml_model = X.fit_class_random_forest(target=json.loads(y.to_json()), training=split.value, num_trees=nrtrees.value)
    model = ml_model.save_ml_model()
    training_job = model.create_job()
    training_job.start_and_wait()
    return training_job.job_id

Authenticated using refresh token.
Authenticated using refresh token.


  complain("Band name mismatch: {a} != {b}".format(a=cube_dimension_band_names, b=eo_band_names))


We will now start fitting the model. As we are using feature fusion, the S1 and S2 features will be merged, and only one model will be trained over all features combined. However, if you are using stratification, you will be constructing a separate model for every stratum within your AOI (i.e., your AOI contains a field named "stratum" that has various rows with different values).

In [None]:
jobids = {}

for index, stratum in strata_sampling.iterrows():
    y_train_clipped = gpd.clip(y_train, stratum["geometry"])
    jobid = fitRandomForestModel(features, y_train_clipped)
    jobids[stratum[stratification_column_label]] = { "fit_id": jobid }

If you don't want to run this step and start with a pretrained model, you can take one of the following models and pass the job ID to the model parameter of predict_random_forest:
- a 1-stratum model for Belgium: 8f18a3f8-458f-4e78-80d8-88341ee3af52
- a 2-stratum model for Spain;
    - stratum 1 (43000.0): 93ad21c3-dada-46a0-873d-13dd8d30f486
    - stratum 2 (46000.0): 1ddf23e5-212a-4883-b4a0-29a85ef9b363
    
If you haven't trained a model using the cell before, you can load some of these stored models directly into their respective strata in the following way:

In [12]:
jobids = {
    43000.0: {
        "fit_id": '93ad21c3-dada-46a0-873d-13dd8d30f486'
    },
    46000.0: {
        "fit_id": "1ddf23e5-212a-4883-b4a0-29a85ef9b363"
    }
}

After training the respective models, we can do inference. In general you would like to first do inference on a test set, so that you can calculate a number of accuracy metrics, such as the overall accuracy, the F-score, and/or creating a confusion matrix.

In [None]:
base_path = "results"
country = "spain"
fname_geojson = 'y_test.geojson'

for index, stratum in strata_sampling.iterrows():
    validation_path = Path.cwd() / base_path / country / "validation" / str(stratum[stratification_column_label])
    validation_path.mkdir(parents=True,exist_ok=True)
    
    y_final_test = gpd.clip(y_test, stratum["geometry"])
    y_final_test.to_file(filename=str(validation_path / fname_geojson),driver="GeoJSON")
    cube = features.filter_spatial(json.loads(buf(y_final_test["geometry"]).to_json()))
    predicted = cube.predict_random_forest(model=jobids[stratum[stratification_column_label]]["fit_id"], dimension="bands").linear_scale_range(0,255,0,255)
    test_job = predicted.execute_batch(
        title="Validation for strata in Spain",
        out_format="netCDF"
    )
    jobids[stratum[stratification_column_label]]["validate_id"] = test_job.job_id
    test_job.get_results().download_files(str(path_tiff))

Some finished validation jobs:
- a finished job from Belgium, 1 stratum: 6e1196b8-3a74-4572-bef1-690d62c92cde
- a finished job from Spain, 2 strata: 
    - stratum 1: b6d5a1ed-0bf5-4ad6-971f-ecd8c76d5db4

We can then calculate a number of validation metrics, such as the accuracy and the F-score from the test set.

In [None]:
for stratum in jobids:
    gdf, final_res = calculate_validation_metrics(
        str(Path.cwd() / base_path / country / "validation" / str(stratum[stratification_column_label]) / fname_geojson), 
        str(Path.cwd() / base_path / country / "validation" / str(stratum[stratification_column_label]) / "openEO.nc"))
    print(final_res)

Finally, if you are satisfied with the scores you obtained, we can do inference over your AOI of inference specified.

In [None]:
features, feature_list = load_lc_features("sentinelhub", "both", start_date.value, end_date.value, processing_opts=dict(tile_size=256))

for index, stratum in strata_inference.iterrows():
    cube = features.filter_spatial(stratum["geometry"])
    predicted = cube.predict_random_forest(
        model=jobids[stratum[stratification_column_label]]["fit_id"],
        dimension="bands"
    ).linear_scale_range(0,255,0,255)
    inf_job = predicted.execute_batch(out_format="netCDF")
    jobids[stratum[stratification_column_label]]["inference_id"] = inf_job.job_id

## References
d'Andrimont, Raphaël, et al., 2021. "LUCAS Copernicus 2018: Earth-observation-relevant in situ data on land cover and use throughout the European Union." Earth System Science Data 13.3 (2021): 1119-1133.

Congalton, R. G., Gu, J., Yadav, K., Thenkabail, P., & Ozdogan, M. (2014). Global land cover mapping: A review and uncertainty analysis. Remote Sensing, 6(12), 12070-12093.

Feddema, J. J., Oleson, K. W., Bonan, G. B., Mearns, L. O., Buja, L. E., Meehl, G. A., & Washington, W. M. (2005). The importance of land-cover change in simulating future climates. Science, 310(5754), 1674-1678.

Running, S. W. (2008). Ecosystem disturbance, carbon, and climate. Science, 321(5889), 652-653.

Sellers, P. J., Dickinson, R. E., Randall, D. A., Betts, A. K., Hall, F. G., Berry, J. A., ... & Henderson-Sellers, A. (1997). Modeling the exchanges of energy, water, and carbon between continents and the atmosphere. Science, 275(5299), 502-509.

Tucker, C. J., Townshend, J. R., & Goff, T. E. (1985). African land-cover classification using satellite data. Science, 227(4685), 369-375.

Yang, J., Gong, P., Fu, R., Zhang, M., Chen, J., Liang, S., ... & Dickinson, R. (2013). The role of satellite remote sensing in climate change studies. Nature climate change, 3(10), 875-883.

Yu, L., Liang, L., Wang, J., Zhao, Y., Cheng, Q., Hu, L., ... & Gong, P. (2014). Meta-discoveries from a synthesis of satellite-based land-cover mapping research. International Journal of Remote Sensing, 35(13), 4573-4588.