# Example 2 - Sand labelling

<img src="images/banner3.png" width="100%" />

<font face="Calibri">
<br>
<font size="5"> <b>Sand clustering with Silhouette Analysis and KMeans notebook</b></font>

<br>
<font size="4"> <b> Nicolas Pucino; PhD Student @ Deakin University, Australia </b> <br>
<img style="padding:7px;" src="images/sandpiper_sand_retouched.png" width="170" align="right" /></font>

<font size="3">This notebook illustrates how to use Sandpiper to perform Silhouette Analysis and KMeans on all previously extracted points. <br>

<b>This notebook covers the following concepts:</b>

- Silhouete Analysis.
- KMeans clustering.
</font>


</font>

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np

from sandpyper.outils import coords_to_points 
from sandpyper.labels import get_sil_location, get_opt_k, kmeans_sa

pd.options.mode.chained_assignment = None  # default='warn'

Loading the project-related lists

- loc codes
- crs dict string

In [2]:
# The location codes used troughout the analysis
loc_codes=["mar","leo"]

# The Coordinate Reference Systems used troughout this study
crs_dict_string= {
                 'mar': {'init': 'epsg:32754'},
                 'leo': {'init': 'epsg:32755'},
                 }

## Loading, merging and preparing the tables

The function __get_merged_table__ merge the rgb and z tables together and format it in a way it is digestible for further analysis.

In [3]:
%%time

#Loading the tables

rgb_table_path=r"C:\my_packages\doc_data\profiles\rgb.csv"
z_table_path=r"C:\my_packages\doc_data\profiles\elevation.csv"

rgb_table=gpd.read_file(rgb_table_path)
z_table=gpd.read_file(z_table_path)

# As the distance (across-transect) comes from an interpolation, it has too many digits.
# let's round both tables distance columns to 2 significant values and assign their data type as "float".

rgb_table["distance"]=np.round(rgb_table.loc[:,"distance"].values.astype("float"),2)
z_table["distance"]=np.round(z_table.loc[:,"distance"].values.astype("float"),2)

  for feature in features_lst:


Wall time: 47.5 s


Storing Geodataframes as CSV is handy, but __we lose the column data type information__.
Especially important is the __geometry column__, which we need to convert back into __Shapely Point object format__.
To do that, the function __coords_to_points__ can be used across a Series ('geometry'). It can take quite a bit of time, so, if you have a lot of points, get ready!

In [4]:
rgb_table['geometry']=rgb_table.coordinates.apply(coords_to_points)
z_table['geometry']=z_table.coordinates.apply(coords_to_points)

z_table.head()

Unnamed: 0,distance,z,tr_id,raw_date,coordinates,location,survey_date,point_id,x,y,geometry
0,0.0,,26,20190516,POINT (731628.6116079447 5705614.83224069),mar,2019-05-16,66124091m2540700ar00,731628.6116079447,5705614.8322406905,POINT (731628.612 5705614.832)
1,0.1,,26,20190516,POINT (731628.6963366535 5705614.779127171),mar,2019-05-16,66125091m2530510ar00,731628.6963366535,5705614.779127171,POINT (731628.696 5705614.779)
2,0.2,,26,20190516,POINT (731628.7810653624 5705614.726013653),mar,2019-05-16,66126091m2520420ar00,731628.7810653624,5705614.726013653,POINT (731628.781 5705614.726)
3,0.3,,26,20190516,POINT (731628.8657940712 5705614.672900134),mar,2019-05-16,66127091m2510230ar00,731628.8657940712,5705614.672900134,POINT (731628.866 5705614.673)
4,0.4,,26,20190516,POINT (731628.9505227801 5705614.619786615),mar,2019-05-16,66128091m2500140ar00,731628.9505227801,5705614.619786615,POINT (731628.951 5705614.620)


In [5]:
# Here, we merge the two tables (storing elevation and rgb information)

data_merged = pd.merge(z_table,rgb_table[["band1","band2","band3","point_id"]],on="point_id",validate="one_to_one")

# replace empty values with np.NaN
data_merged=data_merged.replace("", np.NaN)

# and convert the z column into floats.
data_merged['z']=data_merged.z.astype("float")

data_merged.head()

Unnamed: 0,distance,z,tr_id,raw_date,coordinates,location,survey_date,point_id,x,y,geometry,band1,band2,band3
0,0.0,,26,20190516,POINT (731628.6116079447 5705614.83224069),mar,2019-05-16,66124091m2540700ar00,731628.6116079447,5705614.8322406905,POINT (731628.612 5705614.832),,,
1,0.1,,26,20190516,POINT (731628.6963366535 5705614.779127171),mar,2019-05-16,66125091m2530510ar00,731628.6963366535,5705614.779127171,POINT (731628.696 5705614.779),,,
2,0.2,,26,20190516,POINT (731628.7810653624 5705614.726013653),mar,2019-05-16,66126091m2520420ar00,731628.7810653624,5705614.726013653,POINT (731628.781 5705614.726),,,
3,0.3,,26,20190516,POINT (731628.8657940712 5705614.672900134),mar,2019-05-16,66127091m2510230ar00,731628.8657940712,5705614.672900134,POINT (731628.866 5705614.673),,,
4,0.4,,26,20190516,POINT (731628.9505227801 5705614.619786615),mar,2019-05-16,66128091m2500140ar00,731628.9505227801,5705614.619786615,POINT (731628.951 5705614.620),,,


In [9]:
# Here, we add two features, slope and curvature, computed from the elevation series,
# in case we wnat to use for KMeans clustering.
# Note that when passing from one transect to another, slope and curvature computations are wrong.
# However, we will clip those areas as they are in the water or in the backdune.

data_merged["slope"]=np.gradient(data_merged.z)
data_merged["curve"]=np.gradient(data_merged.slope)

data_merged.head()

Unnamed: 0,distance,z,tr_id,raw_date,coordinates,location,survey_date,point_id,x,y,geometry,band1,band2,band3,slope,curve
0,0.0,,26,20190516,POINT (731628.6116079447 5705614.83224069),mar,2019-05-16,66124091m2540700ar00,731628.6116079447,5705614.8322406905,POINT (731628.612 5705614.832),,,,,
1,0.1,,26,20190516,POINT (731628.6963366535 5705614.779127171),mar,2019-05-16,66125091m2530510ar00,731628.6963366535,5705614.779127171,POINT (731628.696 5705614.779),,,,,
2,0.2,,26,20190516,POINT (731628.7810653624 5705614.726013653),mar,2019-05-16,66126091m2520420ar00,731628.7810653624,5705614.726013653,POINT (731628.781 5705614.726),,,,,
3,0.3,,26,20190516,POINT (731628.8657940712 5705614.672900134),mar,2019-05-16,66127091m2510230ar00,731628.8657940712,5705614.672900134,POINT (731628.866 5705614.673),,,,,
4,0.4,,26,20190516,POINT (731628.9505227801 5705614.619786615),mar,2019-05-16,66128091m2500140ar00,731628.9505227801,5705614.619786615,POINT (731628.951 5705614.620),,,,,


In [7]:
# Our rasters have NaN values set to -32767.0. Thus, we replace them with np.Nan.
data_merged.z.replace(-32767.0,np.nan,inplace=True)

## Iterative silhouette analysis


The __get_sil_location__ function will iteratively perform KMeans clustering and Silhouette Analysis with increasing number of clusters (k, specified in the `ks` parameter) for every survey, using the feature set specified in the parameter `feature_set`.

This will return a dataframe with Average Silhouette scores with different k for all surveys, which we use to find sub-optimal number of clusters with __get_opt_k__ function.

Then, with the sub-optimal k, we finally run KMeans with __kmeans_sa__ function on all the surveys to obtain clustered points to visually discriminate between sand and non-sand in a Qgis environment.

In [10]:
%%time
# Run interatively KMeans + SA

feature_set=["band1","band2","band3"]
sil_df=get_sil_location(data_merged,
                        ks=(2,30), 
                        feature_set=feature_set,
                       random_state=10)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

Working on : mar, 2019-05-16.


  0%|          | 0/28 [00:00<?, ?it/s]



For n_clusters = 2 The average silhouette_score is : 0.760174545594254




For n_clusters = 3 The average silhouette_score is : 0.5705147015039415




For n_clusters = 4 The average silhouette_score is : 0.5741013380184272




For n_clusters = 5 The average silhouette_score is : 0.5242957713720818




KeyboardInterrupt: 

##  Sub-optimal k

Find sub-optimal k by searching inflexion points where an additional cluster do not considerably degrade the overall clustering performance.

In [14]:
opt_k=get_opt_k(sil_df, sigma=0 )
opt_k

{'leo_2018-06-06': 10,
 'leo_2018-07-13': 9,
 'leo_2018-09-20': 12,
 'leo_2019-02-11': 11,
 'leo_2019-03-28': 11,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 6,
 'mar_2018-06-21': 7,
 'mar_2018-07-27': 7,
 'mar_2018-09-25': 8,
 'mar_2018-11-13': 7,
 'mar_2018-12-11': 7,
 'mar_2019-02-05': 5,
 'mar_2019-03-13': 5,
 'mar_2019-05-16': 3}

If we are not satisfied with the sub-optimal k returned by the algorithm, we can manually specify each survey k
by defining a dictionary.

In [12]:
# Based on our observations on a dataset comprising 87 surveys, 10 clusters (k=10) is generally a good tradeoff.

opt_k={'leo_2018-06-06': 10,
 'leo_2018-07-13': 10,
 'leo_2018-09-20': 10,
 'leo_2019-02-11': 10,
 'leo_2019-03-28': 10,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 10,
 'mar_2018-06-21': 10,
 'mar_2018-07-27': 10,
 'mar_2018-09-25': 10,
 'mar_2018-11-13': 10,
 'mar_2018-12-11': 10,
 'mar_2019-02-05': 10,
 'mar_2019-03-13': 10,
 'mar_2019-05-16': 10}

or, update one value only. For instance, in mar_2019-05-16 dataset, it is unlikely that 3 clusters are enough.<br>
So, we replace only that value with 10.


In [15]:
opt_k['mar_2019-05-16']=10
opt_k

{'leo_2018-06-06': 10,
 'leo_2018-07-13': 9,
 'leo_2018-09-20': 12,
 'leo_2019-02-11': 11,
 'leo_2019-03-28': 11,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 6,
 'mar_2018-06-21': 7,
 'mar_2018-07-27': 7,
 'mar_2018-09-25': 8,
 'mar_2018-11-13': 7,
 'mar_2018-12-11': 7,
 'mar_2019-02-05': 5,
 'mar_2019-03-13': 5,
 'mar_2019-05-16': 10}

## Optimised K-Means clustering

With the sub-optimal k dictionary and keeping the same feature set, we finally cluster the dataset.

In [19]:
feature_set=["band1","band2","band3"]
data_classified=kmeans_sa(data_merged,opt_k, feature_set=feature_set)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_merged.dropna(inplace=True)


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_in["label_k"] = clusterer.fit_predict(minmax_scaled_df)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_in["label_k"] = clusterer.fit_predict(minmax_scaled_df)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_in["label_k"] = clusterer.fit_predict(minmax_scaled_df)
A value is trying to b

  0%|          | 0/6 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_in["label_k"] = clusterer.fit_predict(minmax_scaled_df)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_in["label_k"] = clusterer.fit_predict(minmax_scaled_df)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_in["label_k"] = clusterer.fit_predict(minmax_scaled_df)
A value is trying to b

In [20]:
data_classified=pd.merge(data_classified[["point_id","label_k"]],data_merged, how="left", on="point_id", validate="one_to_one")
data_classified

Unnamed: 0,point_id,label_k,distance,z,tr_id,raw_date,coordinates,location,survey_date,x,y,geometry,band1,band2,band3,slope,curve
0,67143080l2610320eo00,3,0.2,1.105616,47,20180606,POINT (299873.4167173313 5773731.881880409),leo,2018-06-06,299873.4167173313,5773731.881880409,POINT (299873.417 5773731.882),141.0,142.0,132.0,-0.006003,0.002122
1,67142080l2670630eo00,3,0.3,1.101189,47,20180606,POINT (299873.516093276 5773731.893034852),leo,2018-06-06,299873.51609327603,5773731.893034852,POINT (299873.516 5773731.893),148.0,148.0,143.0,-0.003264,0.001769
2,67142080l2600940eo00,3,0.4,1.099089,47,20180606,POINT (299873.6154692209 5773731.904189295),leo,2018-06-06,299873.61546922085,5773731.904189295,POINT (299873.615 5773731.904),140.0,142.0,129.0,-0.002465,0.001138
3,67146080l2650750eo00,6,0.5,1.096259,47,20180606,POINT (299873.7148451657 5773731.915343738),leo,2018-06-06,299873.71484516567,5773731.915343738,POINT (299873.715 5773731.915),162.0,165.0,155.0,-0.000988,0.001301
4,67141080l2600560eo00,3,0.6,1.097113,47,20180606,POINT (299873.8142211105 5773731.92649818),leo,2018-06-06,299873.8142211105,5773731.92649818,POINT (299873.814 5773731.926),152.0,154.0,137.0,0.000136,0.001117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
185180,60108091m2528500ar21,9,28.1,1.753726,0,20190516,POINT (731474.0709976825 5705142.514267173),mar,2019-05-16,731474.0709976825,5705142.514267173,POINT (731474.071 5705142.514),198.0,190.0,164.0,-0.007479,0.001382
185181,60103091m2518200ar22,4,28.2,1.748035,0,20190516,POINT (731474.1704055312 5705142.503400728),mar,2019-05-16,731474.1704055312,5705142.503400728,POINT (731474.170 5705142.503),196.0,187.0,161.0,-0.006537,0.000006
185182,60107091m2598900ar23,9,28.3,1.740652,0,20190516,POINT (731474.2698133799 5705142.492534284),mar,2019-05-16,731474.2698133799,5705142.492534284,POINT (731474.270 5705142.493),200.0,192.0,165.0,-0.007468,-0.000615
185183,60102091m2588500ar24,9,28.4,1.733099,0,20190516,POINT (731474.3692212285 5705142.481667838),mar,2019-05-16,731474.3692212285,5705142.481667838,POINT (731474.369 5705142.482),200.0,191.0,164.0,-0.007767,-0.000747


### GOOD!

save the __data_classified__ dataframe as a CSV file and head to the __Example_3_Labels_correction_and_multitemporal_table notebook__.

In [21]:
data_classified.to_csv(r"C:\my_packages\doc_data\labels\data_classified.csv", index=False)

___