<img src="https://avatars.githubusercontent.com/u/74911464?s=200&v=4"
     alt="OpenEO Platform logo"
     style="float: left; margin-right: 10px;" />
# OpenEO Platform - UC9
Dynamic large area land cover mapping

In [1]:
import openeo
import geopandas as gpd
import pandas as pd
from openeo_classification.landuse_classification import *
from datetime import date
import ipywidgets as widgets
from sklearn.model_selection import train_test_split
import datetime
import json
from pathlib import Path

## Objectives and approach

In this notebook we will be studying land cover mapping. Land cover mapping has been done since the onset of remote sensing, and LC products have been identified as a fundamental variable needed for studying the functional and morphological changes occurring in the Earth's ecosystems and the environment, and plays therefore an important role in studying climate change and carbon circulation (Congalton et al., 2014; Feddema et al., 2005; Sellers et al., 1997). In addition to that, it provides valuable information for policy development and a wide range of applications within natural sciences and life sciences, making it one of the most widely studied applications within remote sensing (Yu et al., 2014; Yang et al., 2013; Running, 2008; Tucker et al., 1985). 

With this variety in application fields comes a variety of user needs. Depending on the use case, there may be large differences in the target labels desired, the target year(s) requested, the output resolution needed, the feature set used, the stratification strategy employed, and more. The goal of this use case is to show that openEO as a platform can deal with this variability, and we will do so through creating a user-friendly interface in which the user can set a variety of parameters that will tailor the land use classification pipeline to the user’s needs.

## Methodology

#### Reference data 
The reference dataset used in this project is the Land Use/Cover Area frame Survey (LUCAS) Copernicus dataset. LUCAS is an evenly spaced in-situ land use and land cover ground survey exercise that extends over the entire of the European Union. The Copernicus module extends up to 51m in four cardinal directions to delineate polygons for each of these points. The final product contains about 60,000 polygons, from which subsequent points can be sampled (d'Andrimont et al., 2021). The user can specify how many points to sample from these polygons to train his model. In addition, the user can upload extra target data to improve performance.

#### Input data
The service created runs on features constructed from GRD sigma0 and L2A data. This data will be accessed through OpenEO platform from Terrascope and  SentinelHub. The extracted data stems from 01-01-2018 to 31-12-2018, corresponding to the year of collection of the LUCAS dataset (2018) on the basis of which the model was trained. Data from other years can be extracted for prediction, provided that the user uploads their own reference set.

#### Preprocessing
The L2A data has been masked using the sen2cor sceneclassification, with a buffering approach developed at VITO and made available as a process called mask_scl_dilation. From the Sentinel-1 GRD collection, backscatter is calculated.

#### Feature engineering
We select and calculate the following products from our input collections:
- 7 indices (NDVI, NDMI, NDGI, ANIR, NDRE1, NDRE2, NDRE5) and 2 bands (B06, B12) from the L2A collection
- VV, VH and VV/VH (ratio) from the GRD sigma0 collection

As you user you are however free to select other S2 indices. This can be done by specifying a dictionary with the indices and their value range, as we will see below.

All layers are rescaled to 0 to 30000 for computational efficiency. The indices/bands are then aggregated temporally (for Sentinel-2 data: 10-day window using the median. For Sentinel-1 data: 12 day window using the mean. The median was used for the S2 collection instead of the mean to prevent possible artifacts caused by cloud shadows). The output is then interpolated linearly and the S1 cube is resampled spatially to a 10m resolution. Finally, 10 features are calculated on each of the band dimensions. These 10 features are the standard deviation, 25th, 50th and 75th percentile, and 6 equidistant t-steps. Through this procedure, we end up with a total of 120 features (12 bands x 10 features).

#### Model
Where previously models had to be trained outside of openEO, we can now train Random Forest models in openEO itself. Hyperparameter tuning can be performed using a custom hyperparameter set. After training, the model is validated and used for prediction.

## Implementation
First, create an area of interest for which you want do this classification, and potentially a stratification layer if you would like to make use of stratification. Tune the other parameters to your personal preference.

In [2]:
split, algorithm, nrtrees, fusion_technique, aoi_sampling, aoi_inference, start_date, end_date, nr_targets, nr_samples_per_polygon = get_starting_widgets()

Box(children=(Label(value='Train / test split:'), FloatSlider(value=0.75, max=1.0, step=0.05)))

Dropdown(description='Model:', disabled=True, options=('Random Forest',), value='Random Forest')

Box(children=(Label(value='Hyperparameters RF model:'), IntText(value=200, description='Nr trees:')))

Box(children=(Label(value='S1 / S2 fusion:'), RadioButtons(disabled=True, options=('Feature fusion', 'Decision…

FileUpload(value={}, accept='.geojson,.shp', description='Upload AOI sampling', layout=Layout(width='20em'))

FileUpload(value={}, accept='.geojson,.shp', description='Upload AOI inference', layout=Layout(width='20em'))

DatePicker(value=datetime.date(2018, 3, 1), description='Start date')

DatePicker(value=datetime.date(2018, 10, 31), description='End date')

Box(children=(Label(value='Select the amount of target classes:'), IntSlider(value=8, max=37, min=2)))

Box(children=(Label(value='Select the amount of times you want to point sample each reference polygon:'), IntS…

Select your final target classes

In [3]:
target_classes = get_target_classes(nr_targets)

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

Text(value='Label 1', description='Label 1:', placeholder='Give your subselection a label')

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

Text(value='Label 2', description='Label 2:', placeholder='Give your subselection a label')

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

Text(value='Label 3', description='Label 3:', placeholder='Give your subselection a label')

SelectMultiple(description='Target class', options=('A00: Artificial land', 'A10: Roofed built-up areas', 'A20…

Text(value='Label 4', description='Label 4:', placeholder='Give your subselection a label')

Transform the AOI's into geodataframes containing the right columns. If your AOI's are not valid, you will receive an error message. Also, we will load in the LUCAS Copernicus dataset for your specific target area. This method will also already convert the original labels to integers corresponding to the target classes you selected in the menu above.

In [4]:
stratification_column_label="groupID"
strata_sampling, strata_inference = get_strata(aoi_sampling, aoi_inference, strat_col_label=stratification_column_label)

### TRAINING

We will first load in the reference set

In [5]:
y = get_reference_set(aoi_sampling, nr_samples_per_polygon, target_classes)
additional_y = gpd.read_file("resources/extra_data_belgium.geojson").set_crs('epsg:4326')
additional_y["target"][additional_y["target"]=="A00: Artificial land"] = 0
additional_y["target"][additional_y["target"]=="G00: Water areas"] = 6
y = pd.concat([y, additional_y[["geometry","target"]]], ignore_index=True)
y = y.drop(["LC1"],axis=1)

Loading in the LUCAS Copernicus dataset...
Finished loading data.
Extracting points and converting target labels...
We transformed your labels into numerical, as the Random Forest model expects numerical values. The mapping is:
                      Target value
Label                             
Built-up                         0
Crops                            1
Woodland & Shrubland             2
Grassland                        3
Finished extracting points and converting target labels


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  additional_y["target"][additional_y["target"]=="A00: Artificial land"] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  additional_y["target"][additional_y["target"]=="G00: Water areas"] = 6


We can have a look at the amount of LUCAS samples that are present within this reference set.

In [6]:
y.groupby("target").count()

Unnamed: 0_level_0,LC1,geometry
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,16,214
1,563,563
2,320,320
3,223,223
6,0,82


The LUCAS dataset is imbalanced. That is why it can be useful for some use cases to gather some extra reference data, and append that to the LUCAS reference data.

We then split our reference data into a training and a test set. If you wish, you can include a validation set as well.

In [7]:
y_train, y_test = train_test_split(y, test_size=(1-split.value), random_state=333)

We will now load in the features. The feature set is made up of 7 indices calculated from S2, two S2 bands, two S1 bands and one index calculated from S1. More information on the exact feature calculation can be found in the use case demonstration report.

You can however decide to pass your own features. You can set the indices used for features with a parameter on load_lc_features called index_dict.
index_dict has the following structure:

index_dict = {

    'NDVI': [-1, 1], 
    'NDMI': [-1, 1],     
    'NDGI': [-1, 1], 
    'NDRE1': [-1, 1], 
    'NDRE2': [-1, 1], 
    'NDRE5': [-1, 1], 
    'ANIR': [0, 1]
}

In [5]:
job_opts = {
    "indexreduction": 0,
    "temporalresolution": "None",
    "tile_size": 128,
}

features, feature_list = load_lc_features("sentinelhub", "both", start_date.value, end_date.value, processing_opts=job_opts)

## Note: if you have a large number of samples (>3000), you cannot load your y data as a JSON in memory. You'll have to write out your JSON to file, store it somewhere (for example, on the Public folder of your Terrascope machine) and pass the file directory instead
def fitRandomForestModel(features, y):
    X = features.aggregate_spatial(json.loads(y.to_json()), reducer="mean")
    ml_model = X.fit_class_random_forest(target=json.loads(y.to_json()), training=split.value, num_trees=nrtrees.value)
    model = ml_model.save_ml_model()
    training_job = model.create_job()
    training_job.start_and_wait()
    return training_job.job_id

{'collection': {'input_range': [0, 8000], 'output_range': [0, 30000]}, 'indices': {'NDVI': {'input_range': [-1, 1], 'output_range': [0, 30000]}, 'NDMI': {'input_range': [-1, 1], 'output_range': [0, 30000]}, 'NDGI': {'input_range': [-1, 1], 'output_range': [0, 30000]}, 'NDRE1': {'input_range': [-1, 1], 'output_range': [0, 30000]}, 'NDRE2': {'input_range': [-1, 1], 'output_range': [0, 30000]}, 'NDRE5': {'input_range': [-1, 1], 'output_range': [0, 30000]}, 'ANIR': {'input_range': [0, 1], 'output_range': [0, 30000]}}}
Authenticated using refresh token.
Authenticated using refresh token.


  complain("Band name mismatch: {a} != {b}".format(a=cube_dimension_band_names, b=eo_band_names))


We will now start fitting the model. As we are using feature fusion, the S1 and S2 features will be merged, and only one model will be trained over all features combined. However, if you are using stratification, you will be constructing a separate model for every stratum within your AOI (i.e., your AOI contains a field named "stratum" that has various rows with different values).

In [None]:
jobids = {}

for index, stratum in strata_sampling.iterrows():
    y_train_clipped = gpd.clip(y_train, stratum["geometry"])
    jobid = fitRandomForestModel(features, y_train_clipped)
    jobids[stratum[stratification_column_label]] = { "fit_id": jobid }

If you don't want to run this step and start with a pretrained model, you can take one of the following models and pass the job ID to the model parameter of predict_random_forest:
- a 1-stratum model for Belgium: 8f18a3f8-458f-4e78-80d8-88341ee3af52
- a 1-stratum model for Italy: 870fd274-b206-4ddd-b5c8-a1a459bc6696
- a 2-stratum model for Spain;
    - stratum 1 (43000.0): 93ad21c3-dada-46a0-873d-13dd8d30f486
    - stratum 2 (46000.0): 1ddf23e5-212a-4883-b4a0-29a85ef9b363
    
If you haven't trained a model using the cell before, you can load some of these stored models directly into their respective strata in the following way:

In [5]:
jobids = {
    43000.0: {
        "fit_id": '93ad21c3-dada-46a0-873d-13dd8d30f486'
    },
    46000.0: {
        "fit_id": "1ddf23e5-212a-4883-b4a0-29a85ef9b363"
    }
}

After training the respective models, we can do inference. In general you would like to first do inference on a test set, so that you can calculate a number of accuracy metrics, such as the overall accuracy, the F-score, and/or creating a confusion matrix.

In [None]:
base_path = "results"
country = "spain"
fname_geojson = 'y_test.geojson'

for index, stratum in strata_sampling.iterrows():
    if stratum[stratification_column_label] == 43000.0:
        print("deze skip ik")
        continue
    validation_path = Path.cwd() / base_path / country / "validation" / str(stratum[stratification_column_label])
    validation_path.mkdir(parents=True,exist_ok=True)
    
    y_final_test = gpd.clip(y_test, stratum["geometry"])
    y_final_test.to_file(filename=str(validation_path / fname_geojson),driver="GeoJSON")
    cube = features.filter_spatial(json.loads(buf(y_final_test["geometry"]).to_json()))
    predicted = cube.predict_random_forest(model=jobids[stratum[stratification_column_label]]["fit_id"], dimension="bands").linear_scale_range(0,255,0,255)
    test_job = predicted.execute_batch(
        title="Validation for strata in Spain",
        out_format="netCDF"
    )
    jobids[stratum[stratification_column_label]]["validate_id"] = test_job.job_id
    test_job.get_results().download_files(str(validation_path))

We can then calculate a number of validation metrics, such as the accuracy and the F-score from the test set.

In [36]:
for stratum in jobids:
    gdf, final_res = calculate_validation_metrics(
        str(Path.cwd() / base_path / country / "validation" / str(stratum) / fname_geojson), 
        str(Path.cwd() / base_path / country / "validation" / str(stratum) / "openEO.nc"))
    print(final_res)

The total amount of test samples you supplied is 1022. Of these, 27 could not be matched to the coordinates of your y samples. If this is more than a few samples, please check if your CRS is aligned.
Accuracy on test set: 0.719
{'accuracy': 0.7195979899497488, 'precision': [0.0, 0.7922374429223744, 0.7931034482758621, 0.5596707818930041, 0.5555555555555556, 0.6395348837209303, 1.0], 'recall': [0.0, 0.9302949061662198, 0.7630331753554502, 0.768361581920904, 0.06944444444444445, 0.6962025316455697, 1.0], 'fscore': [0.0, 0.8557336621454993, 0.7777777777777778, 0.6476190476190475, 0.12345679012345681, 0.6666666666666667, 1.0], 'support': [4, 373, 211, 177, 144, 79, 7]}


  _warn_prf(average, modifier, msg_start, len(result))


The total amount of test samples you supplied is 1001. Of these, 31 could not be matched to the coordinates of your y samples. If this is more than a few samples, please check if your CRS is aligned.
Accuracy on test set: 0.719
{'accuracy': 0.7195876288659794, 'precision': [1.0, 0.8442211055276382, 0.8840579710144928, 0.42105263157894735, 0.6575342465753424, 0.4423076923076923, 1.0], 'recall': [0.15789473684210525, 0.9130434782608695, 0.75, 0.75, 0.3096774193548387, 0.5111111111111111, 0.8181818181818182], 'fscore': [0.2727272727272727, 0.8772845953002611, 0.8115299334811531, 0.5393258426966292, 0.4210526315789474, 0.4742268041237113, 0.9], 'support': [19, 368, 244, 128, 155, 45, 11]}


### PREDICTION

Finally, if you are satisfied with the scores you obtained, we can do inference over your AOI of inference specified.

In [None]:
features, feature_list = load_lc_features("sentinelhub", "both", start_date.value, end_date.value, processing_opts=dict(tile_size=256))

for index, stratum in strata_inference.iterrows():
    cube = features.filter_spatial(stratum["geometry"])
    predicted = cube.predict_random_forest(
        model=jobids[stratum[stratification_column_label]]["fit_id"],
        dimension="bands"
    ).linear_scale_range(0,255,0,255)
    inf_job = predicted.execute_batch(out_format="GTiff")
    jobids[stratum[stratification_column_label]]["inference_id"] = inf_job.job_id

To authenticate: visit https://aai.egi.eu/oidc/device and enter the user code 'NPoZGY'.
Authorized successfully.
Authenticated using device code flow.
Authenticated using refresh token.


  complain("Band name mismatch: {a} != {b}".format(a=cube_dimension_band_names, b=eo_band_names))


0:00:00 Job '078c7422-e641-4cb3-9125-75146dd29b59': send 'start'
0:00:32 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:00:37 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:00:44 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:00:53 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:01:04 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:01:17 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:01:32 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:01:52 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:02:17 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:02:47 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:03:25 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:04:12 Job '078c7422-e641-4cb3-9125-75146dd29b59': queued (progress N/A)
0:05:11 Job '078c7422-e641-4cb3-9125-75146dd29b

## References
d'Andrimont, Raphaël, et al., 2021. "LUCAS Copernicus 2018: Earth-observation-relevant in situ data on land cover and use throughout the European Union." Earth System Science Data 13.3 (2021): 1119-1133.

Congalton, R. G., Gu, J., Yadav, K., Thenkabail, P., & Ozdogan, M. (2014). Global land cover mapping: A review and uncertainty analysis. Remote Sensing, 6(12), 12070-12093.

Feddema, J. J., Oleson, K. W., Bonan, G. B., Mearns, L. O., Buja, L. E., Meehl, G. A., & Washington, W. M. (2005). The importance of land-cover change in simulating future climates. Science, 310(5754), 1674-1678.

Running, S. W. (2008). Ecosystem disturbance, carbon, and climate. Science, 321(5889), 652-653.

Sellers, P. J., Dickinson, R. E., Randall, D. A., Betts, A. K., Hall, F. G., Berry, J. A., ... & Henderson-Sellers, A. (1997). Modeling the exchanges of energy, water, and carbon between continents and the atmosphere. Science, 275(5299), 502-509.

Tucker, C. J., Townshend, J. R., & Goff, T. E. (1985). African land-cover classification using satellite data. Science, 227(4685), 369-375.

Yang, J., Gong, P., Fu, R., Zhang, M., Chen, J., Liang, S., ... & Dickinson, R. (2013). The role of satellite remote sensing in climate change studies. Nature climate change, 3(10), 875-883.

Yu, L., Liang, L., Wang, J., Zhao, Y., Cheng, Q., Hu, L., ... & Gong, P. (2014). Meta-discoveries from a synthesis of satellite-based land-cover mapping research. International Journal of Remote Sensing, 35(13), 4573-4588.