## Description

This notebook will extract training data from the ODC using geometries within a geojson. The dataset will use the NNI level labels within the 'data/nni_training_spain.geojson' file.

In [1]:
%matplotlib inline

import os
import datacube
import numpy as np
import geopandas as gpd
from sklearn import model_selection
from datacube.utils.geometry import assign_crs

import sys
sys.path.insert(1, '../')
from tools.plotting import map_shapefile
from tools.bandindices import calculate_indices
from tools.datahandling import mostcommon_crs
from tools.classification import collect_training_data

import warnings
warnings.filterwarnings("ignore")

In [2]:
#Connect to the datacube
dc = datacube.Datacube(app='Sentinel-2')

## Analisis parameters
* path : The path to the input vector file from witch we wil extract training data.
* field : This is the name of the columns in your shapefile attribute table that contains the class lables. The class lables must be integers.

In [3]:
path = 'data/nni_training_egypt.geojson' 
field = 'class'

## Find the number of CPUs

In [4]:
#ncpus=round(get_cpu_quota()) Calculate the number of cpus we set it to 1
ncpus = 1
print('ncpus = '+str(ncpus))

ncpus = 1


# Preview input data
We can load and preview our input data shapefile using geopandas.

In [5]:
# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
input_data.head()

Unnamed: 0,class,geometry
0,3,"POLYGON ((30.71267 30.50485, 30.71085 30.50450..."
1,3,"POLYGON ((31.97251 30.45663, 31.97206 30.45661..."
2,4,"POLYGON ((31.96894 30.46447, 31.96850 30.46445..."
3,1,"POLYGON ((31.92517 30.44214, 31.92481 30.44212..."
4,0,"POLYGON ((31.92315 30.45008, 31.92272 30.45006..."


In [6]:
# Plot training data in an interactive map
map_shapefile(input_data, attribute=field)

Label(value='')

Map(center=[30.39767429210498, 31.349509732463098], controls=(ZoomControl(options=['position', 'zoom_in_text',…

# Extracting training data
The function collect_training_data takes our geojson containing class labels and extracts training data (features) from the datacube over the locations specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing any NaN or inf values.The below variables can be set within the collect_training_data function:

* zonal_stats: An optional string giving the names of zonal statistics to calculate across each polygon (if the geometries in the vector file are polygons and not points). Default is None (all pixel values are returned). Supported values are 'mean', 'median', 'max', and 'min'.

In addition to the zonal_stats parameter, we also need to set up a datacube query dictionary for the Open Data Cube query such as measurements (the bands to load from the satellite), the resolution (the cell size), and the output_crs (the output projection). These options will be added to a query dictionary that will be passed into collect_training_data using the parameter collect_training_data(dc_query=query, ...). The query dictionary will be the only argument in the feature layer function which we will define and describe in a moment.

In [7]:
#set up our inputs to collect_training_data
zonal_stats = 'mean'

# Set up the inputs for the ODC query
# Create a reusable query
query = {
    'time': ('2022'),
    'resolution': (-20, 20),
    'measurements': ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
}

# Identify the most common projection system in the input query
output_crs = mostcommon_crs(dc=dc, product='gm_s2_annual', query=query)
print(output_crs)

query.update({"output_crs": output_crs})

epsg:6933


## Defining feature layers
To create the desired feature layers, we pass instructions to collect_training_data through the feature_func parameter.

feature_func: A function for generating feature layers that is applied to the data within the bounds of the input geometry. The feature_func must accept a dc_query dictionary, and return a single xarray.Dataset or xarray.DataArray containing 2D coordinates (i.e x, y - no time dimension). e.g.

    def feature_function(query):
        dc = datacube.Datacube(app='feature_layers')
        ds = dc.load(**query)
        ds = ds.mean('time')
        return ds

In [8]:
def feature_layers(dc, query):
    #load s2 annual geomedian
    ds = dc.load(product='gm_s2_annual',
                 **query)
    #calculate some band indices
    ds = calculate_indices(ds,
                           index=['NDVI', 'NDCI', 'IRECI', 'MTCI', 'OTCI', 'MCARI'
                                       , 'CI_RedEdge', 'CI_GreenEdge', 'TCARI', 'OSAVI', 'TCARI_OSAVI'],
                           drop=True,
                           satellite_mission='s2')
    
    return ds

Run the collect_training_data function

In [9]:
column_names, model_input = collect_training_data(
                                    gdf=input_data,
                                    dc=dc,
                                    dc_query=query,
                                    field=field,
                                    ncpus = ncpus,
                                    zonal_stats=zonal_stats,
                                    feature_func=feature_layers
                                    )

Taking zonal statistic: mean
Collecting training data in serial mode
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edge_1', 'red_edge_2', 'red_edge_3', 'nir']
Dropping bands ['red', 'green', 'red_edg

In [10]:
print(column_names)
print('')
print(np.array_str(model_input, precision=2, suppress_small=True))

['class', 'NDVI', 'NDCI', 'IRECI', 'MTCI', 'OTCI', 'MCARI', 'CI_RedEdge', 'CI_GreenEdge', 'TCARI', 'OSAVI', 'TCARI_OSAVI']

[[3.   0.68 0.3  0.72 2.74 5.47 0.06 1.75 3.46 0.18 0.58 0.31]
 [3.   0.51 0.19 0.44 2.39 4.78 0.05 1.03 2.24 0.15 0.44 0.34]
 [4.   0.51 0.18 0.45 2.61 5.22 0.04 1.15 2.47 0.12 0.44 0.28]
 [1.   0.33 0.13 0.27 1.55 3.1  0.05 0.51 1.4  0.14 0.3  0.47]
 [0.   0.25 0.11 0.18 1.1  2.2  0.04 0.3  1.07 0.12 0.21 0.55]
 [5.   0.36 0.14 0.25 1.71 3.42 0.04 0.59 1.47 0.12 0.31 0.4 ]
 [5.   0.3  0.11 0.2  1.68 3.35 0.03 0.47 1.29 0.09 0.25 0.35]
 [2.   0.48 0.21 0.33 1.65 3.3  0.05 0.78 2.16 0.16 0.4  0.41]
 [3.   0.6  0.27 0.43 2.02 4.03 0.06 1.15 2.8  0.17 0.48 0.35]
 [5.   0.38 0.15 0.27 1.76 3.51 0.04 0.61 1.6  0.13 0.32 0.4 ]
 [5.   0.47 0.17 0.39 2.3  4.59 0.05 0.95 2.11 0.14 0.41 0.34]
 [4.   0.47 0.18 0.41 2.21 4.42 0.05 0.89 2.   0.15 0.41 0.37]
 [4.   0.49 0.18 0.45 2.43 4.86 0.05 0.98 2.2  0.15 0.43 0.35]
 [2.   0.49 0.17 0.38 2.42 4.84 0.04 1.01 2.24 0.12 0.41 

## Create test and traning datasets

In [11]:
# Split into training and testing data
model_train, model_test = model_selection.train_test_split(model_input, 
                                                           stratify=model_input[:, 0],
                                                           train_size=0.8, 
                                                           random_state=0)
print("Train shape:", model_train.shape)
print("Test shape:", model_test.shape)

Train shape: (46, 12)
Test shape: (12, 12)


In [12]:
#set the name and location of the output files
output_train_file = "results/training_data.txt"
#set the name and location of the output file
output_test_file = "results/test_data.txt"

In [13]:
#grab all columns
model_col_indices = [column_names.index(var_name) for var_name in column_names]
#Export files to disk
np.savetxt(output_train_file, model_train[:, model_col_indices], header=" ".join(column_names), fmt="%4f")
np.savetxt(output_test_file, model_test[:, model_col_indices], header=" ".join(column_names), fmt="%4f")