# Introduction

This notebook describes the procedure to determine the data for two parts of the openEO Platform Use Case 8: "Fractional Canopy Cover":\
- **The local test site sampling** procedure to decide the test sites within the study area for which the very high resolution data (VHR) will be acquired.\
- **The local validation data sampling** necessary for the quality assessment of the final map.\

The **test site sampling**  is the first step of the use case as it delineates suitable test sites as training for a random Forest regression approach. Given the restriction of cost and the wide area to cover in the UC the study sites need to be chosen carefully. We chose to search for 150 test sites with an extent of 16ha each spread equally across the whole study area. Due to the fact that this step had to be done in a very early stage of the project the scripts rely on local versions of the EO layers needed for the computation. This comprises both a heavy computation of 16ha Polygons over the whole area and the implementation of a personalized scoring.\
The **validation data sampling** has been conducted in parallel to ensure that all the locally sampled datasets are present in an early stage to later focus on the regression and prediction.\

The EO-layers used in this part of the UC are:\
- The Copernicus Forest High Resolution Layer [Tree Cover Density](https://land.copernicus.eu/pan-european/high-resolution-layers/forests/tree-cover-density)  
- The Copernicus Forest High Resolution Layer [Dominant Leaf Type](https://land.copernicus.eu/pan-european/high-resolution-layers/forests/dominant-leaf-type)  
- The [2018 CORINE Land Cover -CLC- data set](https://land.copernicus.eu/pan-european/corine-land-cover/clc2018)

This long notebook has slightly different recurrent terminologies:\
- *Study Area*: The Whole bounding box of roughly 1 Mio sqkm.\
- *Tile*: Tiling of the Copernicus forest HRL.\
- *Potential test sites*: All 16ha Polygons in the Study area.\
- *Grid*: Gridded (spatial) representation of the potential test sites.\
- *Test sites*: The final selection of the 150 test areas for the VHR data request.

**NOTE**: This has workflow has been implemented locally and is therefore only partially reprodicible on other distribution. The intermediary data sets, however, are included in */resources/UC8/Local/*


In [9]:
import geopandas as gpd
import os, glob
import xarray as xr
from osgeo import gdal
from osgeo import ogr
import numpy as np
import rasterio as rio
import pandas as pd
from shapely import geometry
from shapely.geometry import mapping
import shapely.geometry
import rioxarray as rxr
import pandas as pd

# Test site selection

## Import

First of all the files with the study site extents are loaded:

In [16]:
eusalpbb = gpd.read_file("notebooks/resources/UC8/Local/alpinespace_eusalp_boundingbox.shp")
eusalpbb_laea = eusalpbb.to_crs('EPSG:3035')

In [13]:
eusalpbb_sites.to_file('eusalpbb.geojson')

Next, the data sets must be imported. This comprises both the study area as well as the HRL and CLC files. Since this approach was created before the collections were available on the back-ends in order to have a quick VHR datacube the files are loaded locally.\
First we import the CLC as stars proxy object. We can do that just once as it is not tiled or divided otherwise

In [19]:
crlc = rxr.open_rasterio("Path_to_Corine_land_cover_layer", masked=True,)

In [20]:
clc_classes = pd.read_csv("notebooks/resources/UC8/Local/clc_classes_final.csv")

Next we import a csv file containing the tiled Copernicus HRL data as stored locally in the file system

In [32]:
hrl_join = pd.read_csv("notebooks/resources/UC8/Local/hrl_table_geo.csv")

## Gridding

These sites have the extent of 16ha as defined for the 150 test sites within the study site. Therefore several grids are constructed with polygons of 16ha throughout the study area leading to a total of 3.56 Mio Polygons to be analyzed. The grids are calculated based on the Copernicus HRL tiling for convenience


In [18]:
# Reproject to projected coordinate system
#eusalpbb = eusalpbb.to_crs('EPSG:3857')
total_bounds = eusalpbb_laea.total_bounds
minX, minY, maxX, maxY = total_bounds
x, y = (minX, minY)
geom_array = []
square_size = 75000
while y < maxY:
    while x < maxX:
        geom = geometry.Polygon([(x,y), (x, y+square_size), (x+square_size, y+square_size), (x+square_size, y), (x, y)])
        geom_array.append(geom)
        x += square_size
    x = minX
    y += square_size
 
eusalpbb_sites = gpd.GeoDataFrame(geom_array, columns=['geometry']).set_crs('EPSG:3035')
eusalpbb_sites.insert(0, 'ID', range(0, 0 + len(eusalpbb_sites)))

In [28]:
geo_grid = []
for i in range(0, len(hrl_join)):
    r_grid = (hrl_join['Grid'][i])
    pre, ext = os.path.splitext(r_grid)
    geo = pre+'.geojson'
    geo_grid.append(geo)
hrl_join['geo_grid'] = geo_grid  


In [34]:
for i in range(0, len(hrl_join)):
    data = hrl_join['TCD'][i]                     # Load the Tree Cover Density representing the extent
    output = hrl_join["Tile"][i]+ '_grid.geojson' # Name of the file
    ra = rio.open(data)
    bounds  = ra.bounds
    minX, minY, maxX, maxY = np.array(bounds)
    geom_array = []
    x, y = (minX, minY)
    square_size = 400                             # Make Grid across the raster with 400x400m (16ha)
    while y < maxY:
        while x < maxX:
            geom = geometry.Polygon([(x,y), (x, y+square_size), (x+square_size, y+square_size), (x+square_size, y), (x, y)])
            geom_array.append(geom)
            x += square_size
        x = minX
        y += square_size
    fishnet = gpd.GeoDataFrame(geom_array, columns=['geometry']).set_crs('EPSG:3035')  
    fishnet.to_file(output, driver='GeoJSON')      # Write the Grid
    break

Once the process is finished and the grids exported they are the base for the computation of the metrics.\

## Metrics

After all the potential test sites have been delineated the metrics for the single potential test sites have to be determined. This is done by calculating the metrics of the HRL and CLC Layers for each individual Polygon. This step requires a lot of data to be loaded individually while the metrics are being calculated. In order to use ONE function in a parallelized way, the whole metrics generation is done using the `calcMetric` function below


In [2]:
# r1 object is the tree cover density stars proxy object
# r2 parameter is the tree type stars proxy object
# r3 object is the Corine Lanc Cover stars proxy object
# j is an iterator looping through the grids. This is more efficient than a dedicated for-loop

In [None]:
for i in range(0, len(hrl_join)):
    r1 = rxr.open_rasterio(hrl_join['TCD'][i]) 
    r2 = rxr.open_rasterio(hrl_join['DLT'][i])
    r_grid = gpd.read_file(hrl_join['geo_grid'][i])
    result = calcMetric(r1,r2,crlc,r_grid)
    result_gpd = gpd.GeoDataFrame(result, columns=['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother']).set_crs('EPSG:3035')
    outfile = hrl_join["Tile"][i]+ '_shapefile_select_noscore.geojson'
    result_gpd.to_file(outfile, driver='GeoJSON')
    break

In [14]:
def calcMetric(r1,r2,r3,j):
    b1 = pd.DataFrame(columns = ['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother']) 
    for index, row in j.iterrows():
        cr_tcd = r1.rio.clip(row)
        cr_trtype = r2.rio.clip(row)
        cr_crlc = r3.rio.clip(row, from_disk=True)
    
        #Calculate tree density#
        cr_tcd_mn = int(np.mean(cr_tcd))
        if cr_tcd_mn > 90 or cr_tcd_mn < 10:
            continue
        # Calculate the Tree Type / Density and Dominance
        cr_trtype_b = cr_trtype[0]
        unique, counts = np.unique(cr_trtype_b, return_counts=True)
        tab = np.asarray((unique, counts)).T
        tab = np.sort(tab)[::-1]
        if len(tab) <2:
            continue 
        brd = int(sum(tab[1]))/int(sum(sum(tab)))
        conf = int(sum(tab[0]))/int(sum(sum(tab)))
        dominance = (brd-conf)/(brd+conf)
        dominant_type = "BroadLeaved" if dominance > 0 else "Coniferous"
        dominance_abs = round(abs(dominance)*100)
        forest_perc = round((brd+conf)*100)
        if forest_perc > 90 or forest_perc < 10:
            continue
        
        # Calculate the CLC Objects
        cr_crlc = cr_crlc[0]
        unique, counts = np.unique(cr_crlc, return_counts=True)
        #print(unique, counts)
        dataset = pd.DataFrame({'Raster_ID': unique, 'n': counts})
        dataset_m = dataset.merge(clc_classes, on='Raster_ID', how='left')
        dataset_m = dataset_m[['L1_class', 'Target' , 'n']]
        if dataset_m.isin(['Forests']).any().any():
            if len(dataset_m.index) == 1:
                pass
            pass
        clc_nclasses = len(dataset_m.index)
        clc_forest = round(((dataset_m.loc[dataset_m['Target'] =="Forests", 'n'])/dataset['n'].sum())*100).tolist()
        clc_other_d = dataset_m.loc[dataset_m['Target'] !="Forests"]
        clc_other_ds = dataset_m.sort_values(by=['n'])
        clc_other = dataset_m['Target'].values[0]
        #if clc_forest > 90 or clc_forest < 10:
        #    pass
        if clc_nclasses == 5 or clc_nclasses < 2:
            pass
        # Combine the results in one dataframe
        data = []
        data = pd.DataFrame({'geometry': row.geometry,'DomAbs': dominance_abs, 'DomType': dominant_type, 'HRLPerc': forest_perc, 'Density': cr_tcd_mn, 'CLCclasses': clc_nclasses, 'CLCperc': clc_forest, 'CLCother': clc_other})
        b1 = b1.append(data)
    return(b1)
    

## Scoring

Based on the Metrics a scoring system needs to be implemented. This means that the target variables need to be defined based on a universally valid scheme to determine which potential sites are actually most important for the detection of the FCC target variable. There are three scoring schemes:\
- **Tree Density**: The tree density of the 16ha site. The score is highest when approx. 50% as there are both forested and non-forested Pixels for training \
- **CLC Classes**: 2 to three classes preferred. Once class would be only forest and more than three might introduce noises or difficulties to discriminate between classes. \
- **Tree Dominance**. The more dominant one tree type the better. This reduces the probability of mixed forests


In [11]:
#Tree density and forest percentage
Values = list(range(0, 101))
Scores = np.concatenate([([i]*10) for i in [1,2,3,4,5,5,4,3,2,1]], axis=0)
Scores = np.insert(Scores, 0, 0)
score_density = pd.DataFrame({'Values': Values, 'Scores': Scores})

In [12]:
# CLC Classes
Values = list(range(0, 11))
Scores = [1,5,5,3,1,0,0,0,0,0]
Scores = np.insert(Scores, 0, 0)
score_nclc = pd.DataFrame({'Values': Values, 'Scores': Scores})

In [13]:
# Tree Dominance
Values = list(range(0, 101))
Scores = np.concatenate([([i]*20) for i in [1,2,3,4,5]], axis=0)
Scores = np.insert(Scores, 0, 0)
score_dom = pd.DataFrame({'Values': Values, 'Scores': Scores})

In [14]:
data_noscore = []
for file in sorted(glob.glob('/home/btufail@eurac.edu/Documents/SRR3_notebooks/AllShapes/*.geojson')):
    data_noscore.append(file)


In total five scores are used for the final decision. While the **CLC classes** and the **Tree Dominance** are used once on the respective layers the **Tree Density** scoring scheme is used for three separate layers: HRL tree percentage, CLC tree percentage and HRL tree density. An overview can be seen in the following picture:

In [15]:
all = pd.DataFrame(columns = ['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother', 'Scoredom', 'ScorePerc1', 'ScorePerc2', 'ScoreCLC', 'ScoreDens', 'ScoreAll', 'inTile'])
for file in data_noscore:
    check = gpd.read_file(file)
    poly_2 = []
    intile=[]
    for index, row in check.iterrows():
        poly = row['geometry']
        for index1, tile in eusalpbb_sites.iterrows():
            tile_poly = tile['geometry']
            within = tile_poly.contains(poly)
            if within:
                intile.append(tile['ID'])
                poly_2.append(row)
            else:
                pass
    poly_2 = pd.DataFrame(poly_2)
    DomAbs_list = poly_2.DomAbs.tolist()
    Scoredom = [int(score_dom.loc[score_dom['Values'] == int(i), 'Scores']) for i in DomAbs_list]
    HRLPerc_list = poly_2.HRLPerc.tolist()
    ScorePerc1 = [int(score_density.loc[score_density['Values'] == int(i), 'Scores']) for i in HRLPerc_list]
    CLCperc_list = poly_2.CLCperc.tolist()
    ScorePerc2 = [int(score_density.loc[score_density['Values'] == int(i), 'Scores']) for i in CLCperc_list]
    CLCclasses_list = poly_2.CLCclasses.tolist()
    ScoreCLC = [int(score_nclc.loc[score_nclc['Values'] == int(i), 'Scores']) for i in CLCclasses_list]
    Density_list = poly_2.Density.tolist()
    ScoreDens = [int(score_density.loc[score_density['Values'] == int(i), 'Scores']) for i in Density_list]
    ScoreAll = []
    for x in range (0, len (Scoredom)):  
        ScoreAll.append(Scoredom[x] + ScorePerc1[x] + ScorePerc2[x] + ScoreCLC[x] + ScoreDens[x])  
    polyscore_2 = poly_2.assign(Scoredom=Scoredom, ScorePerc1=ScorePerc1, ScorePerc2=ScorePerc2, ScoreCLC=ScoreCLC, ScoreDens=ScoreDens, ScoreAll=ScoreAll, inTile=intile)
    
    all = all.append(polyscore_2, ignore_index=True)
comb = gpd.GeoDataFrame(all, columns=['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother', 'Scoredom', 'ScorePerc1', 'ScorePerc2', 'ScoreCLC', 'ScoreDens', 'ScoreAll', 'inTile']).set_crs('EPSG:3035')

## Selection

Once all the polygons are associated to a scoring the complete data set needs to be reduced to the final 150 test sites.

### By Tiles

A first step is to reduce the tiles to the 150 with the best / most suitable test sites. This is done by sorting and reducing them:


In [17]:
unique, counts = np.unique((comb['inTile']), return_counts=True)
comb_bytile = pd.DataFrame({'ID': unique,'Polygons': counts})

In [18]:
polybytile = eusalpbb_sites.merge(comb_bytile, on='ID', how='left')
polybytile = polybytile.dropna()
polybytile = polybytile.sort_values(by=['Polygons'], ascending=False)
selectedTiles = polybytile.head(150)

### By Score

Now the overall potential test sites are reduced and sorted by the scores.

In [93]:
all2 = pd.DataFrame(columns = ['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother', 'Scoredom', 'ScorePerc1', 'ScorePerc2', 'ScoreCLC', 'ScoreDens', 'ScoreAll', 'inTile'])
for index, row in selectedTiles.iterrows():
    sel = comb.loc[comb['inTile'] == index]
    sel2 = sel.sort_values(by=['ScoreAll'], ascending=False)
    sel3 = sel2.loc[sel2['ScoreAll'] == max(sel2['ScoreAll'])]
    sel4 = sel3.sort_values(by=['inTile'])
    all2 = all2.append(sel4)
comb2 = gpd.GeoDataFrame(all2, columns=['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother', 'Scoredom', 'ScorePerc1', 'ScorePerc2', 'ScoreCLC', 'ScoreDens', 'ScoreAll', 'inTile']).set_crs('EPSG:3035')

## Final test sites

Finally, the test sites must be reduced to 150 for the VHR data as agreed in the KO meeting. This is the actual extent for which we were capable to obtain data. As they should be spread throughout the study site one Polygon is extracted for each of the previously defined tiles `selectedTiles`

In [130]:
all3 = pd.DataFrame(columns = ['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother', 'Scoredom', 'ScorePerc1', 'ScorePerc2', 'ScoreCLC', 'ScoreDens', 'ScoreAll', 'inTile'])
for i in comb2['inTile'].unique():
    tosample = comb2.loc[comb2['inTile'] == i]
    smp = tosample.sample(1)
    all3 = all3.append(smp)
comb3 = gpd.GeoDataFrame(all3, columns=['geometry', 'DomAbs', 'DomType', 'HRLPerc', 'Density', 'CLCclasses', 'CLCperc', 'CLCother', 'Scoredom', 'ScorePerc1', 'ScorePerc2', 'ScoreCLC', 'ScoreDens', 'ScoreAll', 'inTile']).set_crs('EPSG:3035')

The `comb3` dataset represents the final selected Polygons per Tile as base for the Planetscope data request. This request has been performed by Sinergise and the corresponding collection is hosted on the VITO back-end and [visible in the openEO Platform Collectons](https://openeo.cloud/data-collections/).

The simple features vector file for the final selection is in the [SRR-3 Github repository](notebooks/resources/UC8/vector_data/SuitableSitesVHR_selected_country.nc)

# Plotting selected final sites

In [2]:
import xarray as xr
import shapely.wkt
import geopandas as gpd
import folium

### Converting from Netcdf to Geodataframe for ease of handling

In [3]:
final_sites = xr.open_dataset("/home/btufail@eurac.edu/Documents/SRR3_notebooks/notebooks/resources/UC8/Local/SuitableSitesVHR_selected_country.nc")

In [6]:
poly = final_sites.variables['ogc_wkt']
dataframe = []
for i in poly:
    first = i.values.item(0).decode("utf-8") 
    P = shapely.wkt.loads(first)
    dataframe.append(P)
    gdf = gpd.GeoDataFrame(crs='epsg:3035', geometry=dataframe)
gdf_84 = gdf.to_crs({'init': 'epsg:4326'}) 

### Plotting

In [17]:
m = folium.Map(location=[47.14, 9.52], zoom_start=5, tiles="CartoDB positron")
for _, r in gdf_84.iterrows():
    sim_geo = gpd.GeoSeries(r["geometry"]).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j, style_function=lambda x: {"fillColor": "orange"})
    geo_j.add_to(m)
m

### Exporting as Geojson

In [65]:
gdf_84.to_file("/home/btufail@eurac.edu/Documents/SRR3_notebooks/notebooks/resources/UC8/Local/SuitableSitesVHR_selected_country.geojson", driver='GeoJSON')