# Create the target variable for the random Forest regression

This notebook creates the target variable for the Random Forest regression. This approach utilizes the Very-High Resolution (VHR) data in the private PLANETSCOPE collection. The sampling of 150 study sites with an extent of 16ha each, has been conducted locally before acquiring the VHR data.

### Requirements

In [343]:
import openeo
from openeo.processes import *
import xarray as xr
import numpy as np
import pandas as pd
import geopandas as gpd
import csv
import time
import os
from os import path
import netCDF4
import rasterio
import rioxarray
import matplotlib.pyplot as plt
from glob import glob

## Connection

Connect to the OpenEO back-end using the OpenEO client

In [344]:
connection = openeo.connect("openeo.cloud")

Authenticate via EGI Check-in (OpenID Connect)

In [345]:
connection.authenticate_oidc("egi")

Authenticated using refresh token.


<Connection to 'https://openeocloud.vito.be/openeo/1.0.0/' with BearerAuth>

The VHR data from `PLANETSCOPE` is stored in the back-end

In [346]:
connection.describe_collection("PLANETSCOPE")

OpenEoApiError: [500] Internal: No backend with id 'vito'

The `PLANETSCOPE`collection is commercial data that has been acquired specifically for this Use Case through the ESA Networ of Resources (NOR), it is not openly accessible to all users. It is required to use a specific FeatureFlag denominated **BYOC** (Bring your own collection). It is read from a textfile that cannot be shared publicly.  
This means that **the target variable generation for the FCC use case is not reproducible for all users without the BYOC code**

In [281]:
byoc_id_file = "extdata/byoc.txt"
byoc_id_raw  = open(byoc_id_file,"r").read().splitlines()

## External Data

In order to process all the test sites a shapefile is read containing the boundaries of the smaller test areas. We applied a 0.0001° buffer in order to obtain all Pixels in the test sites

In [308]:
import geopandas
import matplotlib

shp_path = "extdata/SuitableSitesVHR_selected_country.shp"
gdf      = gpd.read_file(shp_path)
gdf      = gdf.buffer(0.0001)


  


In [309]:
bboxs = gdf.bounds
bbox2 = bboxs.to_numpy()

## Connect to VHR data

In this exemplary Notebook the target variable is calculated for the first test site in Germany for the year 2018

In [310]:
i = bbox2[0]
byoc_id = byoc_id_raw[0]
year = 2018

It might be difficult to use  to properly distinguish between forests and other land cover classes, especially those that also represent vegetation such as crops or grasslands. *Yang et al. (2019)* showed a nice representation of standardized forest NDVI signatures that nicely show the seasonal trend within a year based on different forest types. There are several time intervals of interest when generating a forest mask. Due to the high frequency of revisits by the `PLANETSCOPE` satellites each time interval is represented multiple times with several images. The time intervals of interest are:
- **summer_time**: Time of the maximum productivity of vegetation from mid April to mid September
- **winter_time**: Time of the minimum productivity of vegetation from mid November to mid February
- **annual_time**: The whole year range
- **total_time**: Takes into account the **annual_year** as well as some shoulder values in Winter (until mid-february the following year)

In [311]:
summer_time =  [np.datetime64(str(year)+"-04-15"),np.datetime64(str(year)+"-09-15")]
winter_time =  [np.datetime64(str(year)+"-11-15"),np.datetime64(str(year+1)+"-02-15")]
annual_time =  [np.datetime64(str(year)+"-01-01"),np.datetime64(str(year)+"-12-31")]
total_time  =  [np.datetime64(str(year)+"-01-01"),np.datetime64(str(year+1)+"-02-15")]

The `PLANETSCOPE` data is loaded primarily for **total_time** before the NDVI is calculated. Afterwards a clear-sky mask is applied to exclude all Pixel that are altered by clouds, haze and other atmospheric effects.

In [312]:
plnt = connection.load_collection(
    collection_id  = "PLANETSCOPE",
    spatial_extent = {"west": i[0], "south": i[1], "east": i[2], "north": i[3]},
    temporal_extent= [str(total_time[0]), str(total_time[1])]
    )
plnt._pg.arguments['featureflags'] = {'byoc_collection_id': byoc_id}

plnt_ndvi = plnt.ndvi(nir="B4",red="B3")
mask      = plnt.band("UDM2_Clear").apply(lambda x: x.neq(1))
plnt_ndvi_msk = plnt_ndvi.mask(mask=mask)

# plnt_ndvi_msk_save = plnt_ndvi_msk.save_result(format="NetCDF")

## Create Seasonal masks

### Summer

In this step the **summer_time** is applied to the data using the `filter_temporal` process. For analysing the whole summer period based on `median` and `sd` metrics, the `reduce_dimension` process is used on the *t* dimension.

We calculate two indicators important for the mask generation:
- **s_msk_med_hig**: Median summer NDVI above 0.6
- **s_msk_sd_low**: Summer NDVI standard deviation below 0.1

In [313]:
summer_time =  [np.datetime64(str(year)+"-04-15"),np.datetime64(str(year)+"-09-15")]
plnt_summer =  plnt_ndvi_msk.filter_temporal(extent= [str(summer_time[0]), str(summer_time[1])])

s_msk_med     = plnt_summer.reduce_dimension(dimension='t',reducer=median)
s_msk_med_hig = s_msk_med > 0.6

s_msk_sd      = plnt_summer.reduce_dimension(dimension='t',reducer=sd)
s_msk_sd_low  = s_msk_sd < 0.1

### Winter

In this step the **winter_time** is applied to the data using the `filter_temporal` process. For analysing the whole summer period based on `median` metric, the `reduce_dimension` process is used on the *t* dimension.

Here we calculate two indicators important for the mask generation:
- **w_msk_med_hig**: Median summer NDVI above 0.6
- **w_msk_med_low**: Median summer NDVI between 0 and 0.4

In [314]:
winter_time =  [np.datetime64(str(year)+"-11-15"),np.datetime64(str(year+1)+"-02-15")]
plnt_winter =  plnt_ndvi_msk.filter_temporal(extent= [str(winter_time[0]), str(winter_time[1])])

w_msk_med     = plnt_winter.reduce_dimension(dimension='t',reducer=median)
w_msk_med_hig = w_msk_med > 0.6

w_low_upper   = w_msk_med < 0.4
w_low_lower   = w_msk_med > 0
w_msk_med_low = w_low_upper * w_low_lower


### Year

In this step the **year_time** is applied to the data using the `filter_temporal` process. For analysing the whole summer period based on `median` metric, the `reduce_dimension` process is used on the *t* dimension.

Here we calculate two indicators important for the mask generation:
- **y_msk_med_hig**: Median yearly NDVI above 0.6

In [315]:
annual_time =  [np.datetime64(str(year)+"-01-01"),np.datetime64(str(year)+"-12-31")]
plnt_year   =  plnt_ndvi_msk.filter_temporal(extent= [str(annual_time[0]), str(annual_time[1])])

y_msk_med     = plnt_year.reduce_dimension(dimension='t',reducer=median)
y_msk_med_hig = y_msk_med > 0.6

## Create thematic masks

Based on the **seasonal_masks** also thematic ones are created:

- **f_evergreen_mask**: A mask with high summer and high winter values typically for evergreen (conifer) forests
- **f_deciduous_mask**: A mask with high summer and very low winter values typically hinting at deciduous forests
- **f_mixed_mask**: Mixed forests typically have a high yearly and high summer value but are the most difficult to spot. This mask sustains the other to exlude other vegetation types

In [316]:
f_evergreen_mask  = s_msk_med_hig * w_msk_med_hig
f_deciduous_mask  = s_msk_med_hig * w_msk_med_low
f_mixed_mask      = y_msk_med_hig * s_msk_med_hig

Finally the masks are being combined to a final forest canopy cover mask with th following criteria:
- **cmask_forest**: Estimates whether it is either an evergreen, deciduous forest or mixed (or multiple of those)
- **cmask_forest_error**: Excludes Pixel with a high standard deviation in summer. These are often hinting at management processes typical for crops and grassland vegetation

In [317]:
cmask_forest = f_evergreen_mask + f_deciduous_mask + f_mixed_mask
cmask_forest = cmask_forest > 0
cmask_forest_sd = cmask_forest * s_msk_sd_low

cmask_save = cmask_forest_sd.save_result(format="NetCDF")
# print(cmask_save.to_json())

## Resample

After the target variable is computed for each Pixel the result will be resampled to 30m (comprises 100 Planet Pixel and 9 Sentinel-2 and Sentinel-1 Pixel). This allows for a more stable estimation of the target variable during the random Forest regression

In [341]:
test_res = cmask_forest_sd.resample_spatial(resolution=30,method="average")
res_save = test_res.save_result(format="NetCDF")

In [342]:
job = res_save.send_job(title = "VH0_result_Resample30_average")
job.start_job()

OpenEoApiError: [500] Internal: Failed to create job on backend 'vito': OpenEoApiError('[502] unknown: Bad Gateway')

In [340]:
jobId= job.job_id
connection.job(jobId)

OpenEoApiError: [500] unknown: [502] unknown: Bad Gateway

<RESTJob job_id='vito-99cc4292-537a-4ea5-af90-454d652e8421'>