<img src="https://avatars.githubusercontent.com/u/74911464?s=200&v=4"
     alt="OpenEO Platform logo"
     style="float: left; margin-right: 10px;" />
# OpenEO Platform - UC3
Crop type feature engineering

In [1]:
import openeo
from openeo.processes import ProcessBuilder, array_modify, normalized_difference, drop_dimension, quantiles, sd, mean, array_apply, array_concat,array_create, if_
from openeo.rest.conversions import timeseries_json_to_pandas
from eo_utils import *
from matplotlib import pyplot as plt
from shapely.geometry import box
import pandas as pd
from helper import *
import hvplot.xarray



In this notebook, we will have a look at crop type classification using random forest. We will start off with feature engineering. For this we will calculate a number of vegetation indices that have been shown to perform well in predicting wheat, specifically. From these indices we will calculate a number of features such as the percentiles and standard deviation, that will be used as features in our ML model.

Step 1: Select a polygon on the map below. For the sake of the analysis below, make sure your polygon covers just one land use type (for example, a forest, a field or a small group of fields).

In [2]:
center = [50.8, 4.75]
zoom = 13

eoMap = openeoMap(center,zoom)
addS2Tiles(eoMap)
eoMap.map

Map(center=[50.8, 4.75], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out…

In [3]:
bbox = eoMap.getBbox()

Fill in the land cover type that you think the polygon you selected contains.

In [4]:
connection = openeo.connect("https://openeo-dev.vito.be")
# connection.authenticate_oidc()
connection.authenticate_basic("bart","bart123")

<Connection to 'https://openeo-dev.vito.be/openeo/1.0/' with BearerAuth>

## Load the dataset

Step 2: *Connect* to the vito backend and load all the bands you need to calculate the vegetation indices needed and the SCL for cloud masking. The calculations will be done for the polygon you selected in the previous step, and it will take all images from 2019.
After that, use the scene classification band to perform *cloud masking*.

In [11]:
year = 2019
temp_ext = [str(year-1)+"-11-01", str(year+1)+"-02-01"]

temporal_partition_options = {
        "indexreduction": 0,
        "temporalresolution": "None",
        "tilesize": 256
    }
default_partition_options = {
    "tilesize": 256
}

s2 = connection.load_collection("TERRASCOPE_S2_TOC_V2",                                 
                                temporal_extent=temp_ext,
                                spatial_extent={'west': bbox[0], 'east': bbox[2], 
                                                'south': bbox[1], 'north': bbox[3],
                                                'crs': 'EPSG:4326'},
                                bands=["B03","B04","B05","B06","B07","B08","B11","SCL"])
s2._pg.arguments['featureflags'] = temporal_partition_options
s2_masked = s2.process("mask_scl_dilation", data=s2, scl_band_name="SCL")
s2_masked_res = s2_masked
# s2_masked_res = s2_masked.apply_dimension(dimension="bands",process=lambda x: lin_scale_range(x, 0,6000,0,250))

Step 3: The next step is to create 10-day periods by reducing the timeseries using the median, and interpolate linearly the 10-day composites for which no data was available.

In [6]:
s1 = connection.load_collection("S1_GRD_SIGMA0_ASCENDING",                                 
                                temporal_extent=[str(year)+"-01-01", str(year+1)+"-01-01"],
                                spatial_extent={'west': bbox[0], 'east': bbox[2], 
                                                'south': bbox[1], 'north': bbox[3],
                                                'crs': 'EPSG:4326'},
                                bands=["VH","VV"]
                               )
s1._pg.arguments['featureflags'] = temporal_partition_options

composite_s1 = s1.apply_dimension(dimension="bands",
                                    process=lambda x: array_modify(data=x, values=x.array_element(0)/x.array_element(1), index=0))
# composite_s1 = composite_s1.apply_dimension(dimension="bands",process=lambda x: lin_scale_range(x,0,1,0,250))
composite_s1 = composite_s1.resample_cube_spatial(s2_masked_res)

In [7]:
s1_s2_merged = s2_masked_res.merge_cubes(composite_s1).rename_labels("bands",s2_masked_res.metadata.band_names + ["ratio","VV","VH"])

## Feature selection for crop type mapping

For the final use case, the following crop types need to be mapped: 

* Summer cereal (= spring wheat, spring barley, spring rye, oats) 
* Winter cereal (= winter wheat, winter barley, winter rye) 
* Maize 
* Potato 
* Sugar beet 
* Rapeseed (= winter + spring rapeseed) 

A separability analysis was conducted based on the reference data currently available in the WorldCereal Common Input Baseline. We only focused on European reference data (i.e. LPIS and SIGPAC data). All classification features currently supported by the WorldCereal system were tested. Separability was first computed for each crop type separately in a binary one-vs-all case. Afterwards, an average separability of each feature was computed and features were ranked accordingly. Based on the ranking of features for individual crop types, the overall ranking for all types and ease of implementation, a proposal is made of features to be included for the openEO CCN.  

Attention! In WorldCereal, all features are normalized using the growing degree days approach. This has been proven to significantly increase robustness of classification features across space and time. For openEO CCN, this normalization step is probably less crucial, as the study is “limited” to Europe only. Just bear in mind that feature separability was also computed based on normalized features, so results might not entirely be applicable. Still, the main results and conclusions are expected to be the same between normalized and non-normalized case. 

**Sentinel-2 features**
For the final feature selection, the following timeseries should be computed/derived (see satio.rsindices for definitions of the band combinations mentioned below) from Sentinel-2:
* B06 
* B12 
* NDVI 
* NDMI 
* NDGI 
* ANIR 
* NDRE1 
* NDRE2 
* NDRE5 

Based on all these timeseries, the following features should be computed: 
* Percentiles (p10, p50, p90) 
* Standard deviation 
* Tsteps. This results in a set of 6 features. Ts0, ts1, ts2, ts3, ts4 and ts5, of which ts0, ts1 and ts5 are the most important ones. The others can be dropped to limit the total number of features to be used. 

**Sentinel-1 features**
Timeseries to be derived from sentinel-1 data: 
* VV 
* VH 
* VH/VV ratio 

Features to be computed on these timeseries (ordered from high to lower importance): 
* Percentiles (p10, p50, p90) 
* Tsteps
* Standard deviation 

**Other features**
* Altitude derived from Copernicus 30m DEM
* Derived from AgERA5 meteo data: Sum of precipitation over growing season
* P10, p50 and p90 of mean temperature 


## Feature computation

We will already calculate some of these features here and plot them.

Features are generally computed over the temporal dimension, and result in multiple bands being added. In the openEO API, the 'apply_dimension' process 
allows us to work on the temporal dimension, and store the result in the bands dimension.

So first, we calculate the bands we will use

In [25]:
## deze gaat nog altijd mis als je alle banden probeert te doen met een recursion depth error
## combi NDVI NDMI NDGI NDRE1 NDRE2 NDRE5 loopt goed
## alleen ANIR, combi NDVI, ANIR en combi NDVI NDMI ANIR loopt goed
## vanaf 1 index meer bij die laatste loopt 't mis
indices = compute_indices(s1_s2_merged, ["NDVI","NDMI","NDGI","ANIR","NDRE1","NDRE2","NDRE5"], 250)

0:00:00 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': send 'start'
0:00:34 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': queued (progress N/A)
0:00:39 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': queued (progress N/A)
0:00:46 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': queued (progress N/A)
0:00:53 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': queued (progress N/A)
0:01:03 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': queued (progress N/A)
0:01:16 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': queued (progress N/A)
0:01:31 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': running (progress N/A)
0:01:50 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': running (progress N/A)
0:02:14 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': running (progress N/A)
0:02:44 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': running (progress N/A)
0:03:21 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': running (progress N/A)
0:04:08 Job '2fc7cb97-e713-4ca0-8d17-dae6200d4ab9': running (progress N/A)
0:05:06 Job '2fc7cb97-e713-4ca0-8d17-dae6

JobFailedException: Batch job 2fc7cb97-e713-4ca0-8d17-dae6200d4ab9 didn't finish properly. Status: error (after 0:06:06).

Then we will calculate the features that we need, we will calculate the quantiles, the mean and the standard deviation and t-steps of all indices which can all be done in one line.

In [13]:
def computeStats(input_timeseries:ProcessBuilder):
    tsteps = list([input_timeseries.array_element(6*index) for index in range(0,6)])
    return array_concat(array_concat(input_timeseries.quantiles(probabilities=[0.1,0.5,0.9]),[input_timeseries.mean(), input_timeseries.sd()]),tsteps)#.linear_scale_range(0,2000,0,250)


features = indices.apply_dimension(dimension='t',target_dimension='bands', process=computeStats)

tstep_labels = [ "t"+ str(6*index) for index in range(0,6) ]
features = features.rename_labels('bands',[band + "_" + stat for band in indices.metadata.band_names for stat in ["p10","p50","p90","mean","sd"] + tstep_labels ])

job = features.execute_batch(out_format="netCDF")
results = job.get_results()
results.download_file("./data/indices.nc")
features.download("./data/indices.nc",format="netCDF")

Subsequently, we can plot a few of these features to see what they look like.

In [9]:
features_ds = xr.open_dataset("./data/indices_allesbehalveanir.nc",engine="h5netcdf")
features_ds.to_array().sel(variable=['B04_p50','B08_p50','NDVI_p50']).hvplot(x='x',y='y',by='variable',subplots=True,width = 300,dynamic=False,colorbar=False,cmap="YlGn")

## Dataset sampling

Once we have our features defined, we will want to generate a set of input features for model calibration. Instead of calculating features
for a very large area, openEO can sample the datacube at selected locations.

Sampling a datacube in openEO currently requires polygons as sampling features. Other types of geometry, like lines
and points, should be converted into polygons first by applying a buffering operation. Using the spatial resolution
of the datacube as buffer size can be a way to approximate sampling at a point.

To indicate to openEO that we only want to compute the datacube for certain polygon features, we use the
`~openeo.rest.datacube.DataCube.filter_spatial` method.

Next to that, we will also indicate that we want to write multiple output files. This is more convenient, as we will
want to have one or more raster outputs per sampling feature, for convenient further processing. To do this, we set
the 'sample_by_feature' output format property, which is available for the netCDF and GTiff output formats.

Combining all of this, results in the following sample code:




In [15]:
sampled_features = features.filter_spatial("https://artifactory.vgt.vito.be/testdata-public/parcels/test_10.geojson")
# job = sampled_features.send_job(title="Sentinel2", description="Sentinel-2 features",out_format="netCDF",sample_by_feature=True)
job2 = sampled_features.execute_batch(title="Sentinel2", description="Sentinel-2 features",out_format="netCDF",sample_by_feature=True)
results2 = job2.get_results()
results2.download_file("./data/feats.nc")

0:00:00 Job '8d09b102-d187-4584-bc4c-13f203c3594b': send 'start'
0:00:29 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:00:34 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:00:40 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:00:48 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:00:58 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:01:10 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:01:26 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:01:45 Job '8d09b102-d187-4584-bc4c-13f203c3594b': queued (progress N/A)
0:02:09 Job '8d09b102-d187-4584-bc4c-13f203c3594b': running (progress N/A)
0:02:39 Job '8d09b102-d187-4584-bc4c-13f203c3594b': running (progress N/A)
0:03:16 Job '8d09b102-d187-4584-bc4c-13f203c3594b': running (progress N/A)
0:04:03 Job '8d09b102-d187-4584-bc4c-13f203c3594b': running (progress N/A)
0:05:01 Job '8d09b102-d187-4584-bc4c-13f203

JobFailedException: Batch job 8d09b102-d187-4584-bc4c-13f203c3594b didn't finish properly. Status: error (after 0:40:04).

Sampling only works for batch jobs, because it results in multiple output files, which can not be conveniently transferred
in a synchronous call.


After sampling the data, we end up with a lot of pixels from various fields.
However, to get a better representation, it is better to select only a number of pixels from the fields that we sampled because otherwise the model would be impacted more by large fields.
So first, we select 10 pixels from every field that was sampled.

Then, we will start building a random forest model. We do this through defining a UDF.