<font face="Calibri" size="2"> <i>SBAE - Notebook Series - Part 3, version 0.2,  September 2022. Andreas Vollrath, UN-Food and Agricultural Organization, Rome</i>
</font>

![title](images/header.png)

# III A - SBAE - spatially balanced subsampling
### Extract a subset of samples from K-Means clusters 
-------

This notebook takes you through the process of creating a sub-sample of the time-series and change data retrieved in II. The objective is to obtain a statisitically balanced subsample that can be used for training data collection, and ideally includes a higher precentage of rare classes such as de-forestation, degradation and gain when compared to a pure random subsampling approach. 

In [None]:
### Load libraries

In [None]:
# data management
import numpy as np
import pandas as pd
import geopandas as gpd

# clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# plotting
import seaborn as sns
import matplotlib.pyplot as plt

# sbae internal functionality  
import helpers as h

### 1 Load geopackage results file

The first step is to load the results file from Notebook II of the SBAE notebook series. This file should contain the outputs from various time-series algorithms and may additionally hold extracts from 

In [None]:
df = pd.read_pickle('results_geopackage_file.gpkg')

print('Available Columns')
df.columns

### 2 Select relavant columns for creating the clusters

Not all columns in the loaded data should go into the clustering process, e.g. the point_id does not tell us anything about the statistical distribution with regard to change. In the cell below is a pre-selection of columns that potentially contain information on change and therefore shall be helpful in creating meaningful clusters for later subsampling.


In [None]:
# select columsn thata re used by Kmeans
cols_to_cluster = [
    'mon_images',
    'elevation',
    'dw_class_mode', 'dw_tree_prob__max',
    'dw_tree_prob__min', 'dw_tree_prob__stdDev', 'dw_tree_prob_mean',
    'bfast_magnitude', 'bfast_means', 
    #'lang_tree_height', 
    'potapov_tree_height',
    'ccdc_magnitude',
    'ltr_magnitude', 'ltr_dur', 'ltr_rate', 
    'cusum_confidence', 'cusum_magnitude', 
    'ts_mean', 'ts_sd', 'ts_min', 'ts_max', 
    'bs_slope_mean', 'bs_slope_sd', 'bs_slope_min', 'bs_slope_max'
]

### 3 Check for NaNs

The clustering process does not accept NaNs in any of the fields. There are 2 strategies:

1. Remove all rows that contain any NaNs
2. Replace all NaNs with a number

In [None]:
print(' Length of original dataframe: ' + str(len(df)))
df_1 = df.copy()
print(' Length of nan-removed dataframe: ' + str(len(df_1[cols_to_cluster].dropna())))

for col in cols_to_cluster:
    print(f' Column {col} contains {len(df_1[df_1[col].isna()])} NaNs')
    # print(f' Column {col} contains {len(df_1[df_1[col].isin([np.inf, -np.inf])])} Infinites')

# 2 K-Means Clustering

In [None]:
nr_of_cluster=50

# run kmeans
kmeans = KMeans(n_clusters=nr_of_cluster, random_state=42).fit(df[cols_to_cluster])

#------------------------------------------------
# Standardize the data
#X_std = StandardScaler().fit_transform(df[cols_to_cluster])
# run kmeans with standardized data
#kmeans = KMeans(n_clusters=nr_of_cluster, random_state=42).fit(X_std)
#------------------------------------------------

# add the cluster column
df['Kmeans'] = kmeans.predict(df[cols_to_cluster])

# print number of points per clusters
clusters, counts = np.unique(df.Kmeans, return_counts=True)
print(clusters)
print(counts)

# plot data
pd.DataFrame({'counts': counts}).plot(kind='bar', title='Nr. of Points per cluster', figsize=(10,5))

# 3 Plots

## 3.1 Plot Statistics of each cluster

In [None]:
cols_to_plot = cols_to_cluster

# in case you want to have that different
#cols_to_plot = [
#    'mon_images',
#    'cusum_confidence', 'cusum_magnitude', 
#    'ts_mean', 'ts_sd', 'ts_min', 'ts_max', 
#    'bs_slope_mean', 'bs_slope_sd', 'bs_slope_min', 'bs_slope_max'
#]


fig, axs = h.plot_stats_per_class(df, 'Kmeans', cols_to_plot)

## 3.2 Highlight specific clsuter on a map

In [None]:
cluster_to_highlight = 11

fig, ax = plt.subplots(1, 1, figsize=(12, 12))
df.plot(ax=ax, column='Kmeans', legend=True, markersize=.1)
df[df['Kmeans']==cluster_to_highlight].plot(ax=ax, markersize=5, facecolor='red')
plt.tight_layout()


# 4 Select subset of samples for each cluster

In [None]:
nr_of_samples_per_cluster = 25
subset_df = pd.DataFrame(columns=df.columns)

for cluster in df.Kmeans.unique():
    
    if len(df[df.Kmeans == cluster]) < nr_of_samples_per_cluster:
        
        subset_df = pd.concat([
            subset_df,
            df[df.Kmeans == cluster].sample(len(df[df.Kmeans == cluster]))
        ])
    else:
        
        subset_df = pd.concat([
            subset_df,
            df[df.Kmeans == cluster].sample(nr_of_samples_per_cluster)
        ])
    
print(f'{len(subset_df)} samples have been selected in total')

fig, ax = plt.subplots(1, 1, figsize=(10, 10))
subset_df = gpd.GeoDataFrame(subset_df, geometry='geometry')
subset_df.plot(column='Kmeans', ax=ax, legend=True, markersize=5)

# 5 Convert to CEO file

In [None]:
out_csv_file = 'path/to/subset_results.csv'

subset_df['LON'] = gpd.GeoDataFrame(subset_df).geometry.x
subset_df['LAT'] = gpd.GeoDataFrame(subset_df).geometry.y
subset_df['PLOTID'] = gpd.GeoDataFrame(subset_df).point_id

cols = subset_df.columns.tolist()
cols = [e for e in cols if e not in ('LON', 'LAT', 'PLOTID')]
new_cols = ['LON', 'LAT', 'PLOTID'] + cols
subset_df = subset_df[new_cols]
subset_df.to_csv(out_csv_file, index=False)