<font size="5"><center> <b>Sandpyper: sandy beaches SfM-UAV analysis tools</b></center></font>
<font size="4"><center> <b> Example 2 - Sand labelling </b></center> <br>

    
<center><img src="images/banner.png" width="80%"  /></center>

<font face="Calibri">
<br>
<font size="5"> <b>Sand clustering with Silhouette Analysis and KMeans notebook</b></font>

<br>
<font size="4"> <b> Nicolas Pucino; PhD Student @ Deakin University, Australia </b> <br>

<font size="3">This notebook illustrates how to use Sandpyper to perform Silhouette Analysis and KMeans on all previously extracted points. <br>

<b>This notebook covers the following concepts:</b>

- Silhouete Analysis.
- KMeans clustering.
</font>


</font>

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np

from sandpyper.outils import coords_to_points 
from sandpyper.labels import get_sil_location, get_opt_k, kmeans_sa

pd.options.mode.chained_assignment = None  # default='warn'

Loading the project-related lists

- loc codes
- crs dict string

In [2]:
# The location codes used troughout the analysis
loc_codes=["mar","leo"]

# The Coordinate Reference Systems used troughout this study
crs_dict_string= {
                 'mar': {'init': 'epsg:32754'},
                 'leo': {'init': 'epsg:32755'},
                 }

## Loading, merging and preparing the tables

The function __get_merged_table__ merge the rgb and z tables together and format it in a way it is digestible for further analysis.

In [3]:
%%time

#Loading the tables

rgb_table_path=r"C:\my_packages\doc_data\profiles\rgb.csv"
z_table_path=r"C:\my_packages\doc_data\profiles\elevation.csv"

rgb_table=gpd.read_file(rgb_table_path)
z_table=gpd.read_file(z_table_path)

# As the distance (across-transect) comes from an interpolation, it has too many digits.
# let's round both tables distance columns to 2 significant values and assign their data type as "float".

rgb_table["distance"]=np.round(rgb_table.loc[:,"distance"].values.astype("float"),2)
z_table["distance"]=np.round(z_table.loc[:,"distance"].values.astype("float"),2)

  for feature in features_lst:


Wall time: 53.6 s


Storing Geodataframes as CSV is handy, but __we lose the column data type information__.
Especially important is the __geometry column__, which we need to convert back into __Shapely Point object format__.
To do that, the function __coords_to_points__ can be used across a Series ('geometry'). It can take quite a bit of time, so, if you have a lot of points, get ready!

In [4]:
rgb_table['geometry']=rgb_table.coordinates.apply(coords_to_points)
z_table['geometry']=z_table.coordinates.apply(coords_to_points)

z_table.head()

Unnamed: 0,distance,z,tr_id,raw_date,coordinates,location,survey_date,point_id,x,y,geometry
0,0.0,-0.0041950703598558,21,20190516,POINT (731646.903760184 5705523.468988597),mar,2019-05-16,61121091m2580400ar00,731646.903760184,5705523.468988597,POINT (731646.904 5705523.469)
1,0.1,-0.0041955984197556,21,20190516,POINT (731646.8212142694 5705523.525434784),mar,2019-05-16,61126091m2590410ar00,731646.8212142694,5705523.525434784,POINT (731646.821 5705523.525)
2,0.2,-0.0041948105208575,21,20190516,POINT (731646.7386683549 5705523.581880971),mar,2019-05-16,61125091m2540920ar00,731646.7386683549,5705523.581880971,POINT (731646.739 5705523.582)
3,0.3,-0.004195149987936,21,20190516,POINT (731646.6561224404 5705523.638327157),mar,2019-05-16,61124091m2500430ar00,731646.6561224404,5705523.638327157,POINT (731646.656 5705523.638)
4,0.4,-0.0041956659406423,21,20190516,POINT (731646.5735765258 5705523.694773344),mar,2019-05-16,61122091m2550840ar00,731646.5735765258,5705523.694773344,POINT (731646.574 5705523.695)


In [5]:
# Here, we merge the two tables (storing elevation and rgb information)

data_merged = pd.merge(z_table,rgb_table[["band1","band2","band3","point_id"]],on="point_id",validate="one_to_one")

# replace empty values with np.NaN
data_merged=data_merged.replace("", np.NaN)

# and convert the z column into floats.
data_merged['z']=data_merged.z.astype("float")

data_merged.head()

Unnamed: 0,distance,z,tr_id,raw_date,coordinates,location,survey_date,point_id,x,y,geometry,band1,band2,band3
0,0.0,-0.004195,21,20190516,POINT (731646.903760184 5705523.468988597),mar,2019-05-16,61121091m2580400ar00,731646.903760184,5705523.468988597,POINT (731646.904 5705523.469),110.0,131.0,119.0
1,0.1,-0.004196,21,20190516,POINT (731646.8212142694 5705523.525434784),mar,2019-05-16,61126091m2590410ar00,731646.8212142694,5705523.525434784,POINT (731646.821 5705523.525),124.0,145.0,133.0
2,0.2,-0.004195,21,20190516,POINT (731646.7386683549 5705523.581880971),mar,2019-05-16,61125091m2540920ar00,731646.7386683549,5705523.581880971,POINT (731646.739 5705523.582),113.0,133.0,121.0
3,0.3,-0.004195,21,20190516,POINT (731646.6561224404 5705523.638327157),mar,2019-05-16,61124091m2500430ar00,731646.6561224404,5705523.638327157,POINT (731646.656 5705523.638),124.0,143.0,132.0
4,0.4,-0.004196,21,20190516,POINT (731646.5735765258 5705523.694773344),mar,2019-05-16,61122091m2550840ar00,731646.5735765258,5705523.694773344,POINT (731646.574 5705523.695),117.0,137.0,125.0


In [6]:
# Here, we add two features, slope and curvature, computed from the elevation series,
# in case we wnat to use for KMeans clustering.
# Note that when passing from one transect to another, slope and curvature computations are wrong.
# However, we will clip those areas as they are in the water or in the backdune.

data_merged["slope"]=np.gradient(data_merged.z)
data_merged["curve"]=np.gradient(data_merged.slope)

data_merged.head()

Unnamed: 0,distance,z,tr_id,raw_date,coordinates,location,survey_date,point_id,x,y,geometry,band1,band2,band3,slope,curve
0,0.0,-0.004195,21,20190516,POINT (731646.903760184 5705523.468988597),mar,2019-05-16,61121091m2580400ar00,731646.903760184,5705523.468988597,POINT (731646.904 5705523.469),110.0,131.0,119.0,-5.280599e-07,6.579794e-07
1,0.1,-0.004196,21,20190516,POINT (731646.8212142694 5705523.525434784),mar,2019-05-16,61126091m2590410ar00,731646.8212142694,5705523.525434784,POINT (731646.821 5705523.525),124.0,145.0,133.0,1.299195e-07,3.761379e-07
2,0.2,-0.004195,21,20190516,POINT (731646.7386683549 5705523.581880971),mar,2019-05-16,61125091m2540920ar00,731646.7386683549,5705523.581880971,POINT (731646.739 5705523.582),113.0,133.0,121.0,2.242159e-07,-2.788147e-07
3,0.3,-0.004195,21,20190516,POINT (731646.6561224404 5705523.638327157),mar,2019-05-16,61124091m2500430ar00,731646.6561224404,5705523.638327157,POINT (731646.656 5705523.638),124.0,143.0,132.0,-4.277099e-07,-2.410961e-07
4,0.4,-0.004196,21,20190516,POINT (731646.5735765258 5705523.694773344),mar,2019-05-16,61122091m2550840ar00,731646.5735765258,5705523.694773344,POINT (731646.574 5705523.695),117.0,137.0,125.0,-2.579764e-07,-4.051253e-08


In [7]:
# Our rasters have NaN values set to -32767.0. Thus, we replace them with np.Nan.
data_merged.z.replace(-32767.0,np.nan,inplace=True)

## Iterative silhouette analysis


The __get_sil_location__ function will iteratively perform KMeans clustering and Silhouette Analysis with increasing number of clusters (k, specified in the `ks` parameter) for every survey, using the feature set specified in the parameter `feature_set`.

This will return a dataframe with Average Silhouette scores with different k for all surveys, which we use to find sub-optimal number of clusters with __get_opt_k__ function.

Then, with the sub-optimal k, we finally run KMeans with __kmeans_sa__ function on all the surveys to obtain clustered points to visually discriminate between sand and non-sand in a Qgis environment.

In [8]:
%%time
# Run interatively KMeans + SA

feature_set=["band1","band2","band3"]
sil_df=get_sil_location(data_merged,
                        ks=(2,30), 
                        feature_set=feature_set,
                       random_state=10)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

Working on : mar, 2019-05-16.


  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.6862075501506316
For n_clusters = 3 The average silhouette_score is : 0.513965444380363
For n_clusters = 4 The average silhouette_score is : 0.5428028774353153
For n_clusters = 5 The average silhouette_score is : 0.5083834521453034
For n_clusters = 6 The average silhouette_score is : 0.47467070127521116
For n_clusters = 7 The average silhouette_score is : 0.465572322824415
For n_clusters = 8 The average silhouette_score is : 0.4364300120091992
For n_clusters = 9 The average silhouette_score is : 0.44206467829932033
For n_clusters = 10 The average silhouette_score is : 0.41485117577947894
For n_clusters = 11 The average silhouette_score is : 0.41251112197505874
For n_clusters = 12 The average silhouette_score is : 0.3962539435788235
For n_clusters = 13 The average silhouette_score is : 0.41004443532804863
For n_clusters = 14 The average silhouette_score is : 0.4143684457737213
For n_clusters = 15 The average silhouette_score is : 0.

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.6170867984187076
For n_clusters = 3 The average silhouette_score is : 0.5142944562633687
For n_clusters = 4 The average silhouette_score is : 0.5195153673762395
For n_clusters = 5 The average silhouette_score is : 0.47126418883658083
For n_clusters = 6 The average silhouette_score is : 0.465070736254574
For n_clusters = 7 The average silhouette_score is : 0.43477801971798896
For n_clusters = 8 The average silhouette_score is : 0.4175784024938606
For n_clusters = 9 The average silhouette_score is : 0.40440453681363747
For n_clusters = 10 The average silhouette_score is : 0.39952388915266024
For n_clusters = 11 The average silhouette_score is : 0.39487531935869014
For n_clusters = 12 The average silhouette_score is : 0.38658389129783677
For n_clusters = 13 The average silhouette_score is : 0.38709288253414315
For n_clusters = 14 The average silhouette_score is : 0.3696316187549434
For n_clusters = 15 The average silhouette_score is :

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.6430665563715819
For n_clusters = 3 The average silhouette_score is : 0.540039551346565
For n_clusters = 4 The average silhouette_score is : 0.5131406007273728
For n_clusters = 5 The average silhouette_score is : 0.4823729912655783
For n_clusters = 6 The average silhouette_score is : 0.44326176339783124
For n_clusters = 7 The average silhouette_score is : 0.4430877286284131
For n_clusters = 8 The average silhouette_score is : 0.42034826101945183
For n_clusters = 9 The average silhouette_score is : 0.4080092826522824
For n_clusters = 10 The average silhouette_score is : 0.38283631423484177
For n_clusters = 11 The average silhouette_score is : 0.39722587925891
For n_clusters = 12 The average silhouette_score is : 0.38979063459562935
For n_clusters = 13 The average silhouette_score is : 0.39812841463089454
For n_clusters = 14 The average silhouette_score is : 0.38572229388337054
For n_clusters = 15 The average silhouette_score is : 0.

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5630097229928264
For n_clusters = 3 The average silhouette_score is : 0.5212375993013656
For n_clusters = 4 The average silhouette_score is : 0.515813102750951
For n_clusters = 5 The average silhouette_score is : 0.48753135114687446
For n_clusters = 6 The average silhouette_score is : 0.44070920054583995
For n_clusters = 7 The average silhouette_score is : 0.4153087651471306
For n_clusters = 8 The average silhouette_score is : 0.4015114960444891
For n_clusters = 9 The average silhouette_score is : 0.38862800052782553
For n_clusters = 10 The average silhouette_score is : 0.3774256997879442
For n_clusters = 11 The average silhouette_score is : 0.3546759907264308
For n_clusters = 12 The average silhouette_score is : 0.370541646606475
For n_clusters = 13 The average silhouette_score is : 0.35309092574367906
For n_clusters = 14 The average silhouette_score is : 0.37434755814504844
For n_clusters = 15 The average silhouette_score is : 0.

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5960483359838735
For n_clusters = 3 The average silhouette_score is : 0.5041751378059506
For n_clusters = 4 The average silhouette_score is : 0.5159081118620653
For n_clusters = 5 The average silhouette_score is : 0.4489234699169147
For n_clusters = 6 The average silhouette_score is : 0.4270668438861502
For n_clusters = 7 The average silhouette_score is : 0.4131069945187734
For n_clusters = 8 The average silhouette_score is : 0.41345300302916316
For n_clusters = 9 The average silhouette_score is : 0.4004290027534702
For n_clusters = 10 The average silhouette_score is : 0.39701403124493245
For n_clusters = 11 The average silhouette_score is : 0.4040245059489679
For n_clusters = 12 The average silhouette_score is : 0.39192378247864457
For n_clusters = 13 The average silhouette_score is : 0.3936816845507242
For n_clusters = 14 The average silhouette_score is : 0.3936020360388745
For n_clusters = 15 The average silhouette_score is : 0.

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5850268976114777
For n_clusters = 3 The average silhouette_score is : 0.49315265038659256
For n_clusters = 4 The average silhouette_score is : 0.48929570479528917
For n_clusters = 5 The average silhouette_score is : 0.46559444988644716
For n_clusters = 6 The average silhouette_score is : 0.4393599391404145
For n_clusters = 7 The average silhouette_score is : 0.42526360173099664
For n_clusters = 8 The average silhouette_score is : 0.39568582176959
For n_clusters = 9 The average silhouette_score is : 0.37132838162298826
For n_clusters = 10 The average silhouette_score is : 0.3612246555121027
For n_clusters = 11 The average silhouette_score is : 0.34867917063091836
For n_clusters = 12 The average silhouette_score is : 0.35258864437000526
For n_clusters = 13 The average silhouette_score is : 0.36597918644405864
For n_clusters = 14 The average silhouette_score is : 0.37134537967310943
For n_clusters = 15 The average silhouette_score is 

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5677145662920096
For n_clusters = 3 The average silhouette_score is : 0.48182655237293553
For n_clusters = 4 The average silhouette_score is : 0.42523725543229224
For n_clusters = 5 The average silhouette_score is : 0.3834221826431047
For n_clusters = 6 The average silhouette_score is : 0.40465753675032734
For n_clusters = 7 The average silhouette_score is : 0.37021218431083175
For n_clusters = 8 The average silhouette_score is : 0.37197001169108596
For n_clusters = 9 The average silhouette_score is : 0.3656554148141353
For n_clusters = 10 The average silhouette_score is : 0.35679455551789435
For n_clusters = 11 The average silhouette_score is : 0.3534434482651074
For n_clusters = 12 The average silhouette_score is : 0.35974696682495977
For n_clusters = 13 The average silhouette_score is : 0.3614984412200758
For n_clusters = 14 The average silhouette_score is : 0.35599229486107076
For n_clusters = 15 The average silhouette_score is

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5899185922864892
For n_clusters = 3 The average silhouette_score is : 0.5030361512719422
For n_clusters = 4 The average silhouette_score is : 0.4690920099023971
For n_clusters = 5 The average silhouette_score is : 0.4295523520949522
For n_clusters = 6 The average silhouette_score is : 0.4085400116379133
For n_clusters = 7 The average silhouette_score is : 0.3863089140731913
For n_clusters = 8 The average silhouette_score is : 0.3885856913871781
For n_clusters = 9 The average silhouette_score is : 0.3847808822897427
For n_clusters = 10 The average silhouette_score is : 0.38179005393001697
For n_clusters = 11 The average silhouette_score is : 0.3726650417791217
For n_clusters = 12 The average silhouette_score is : 0.38476572079195365
For n_clusters = 13 The average silhouette_score is : 0.3910547209526152
For n_clusters = 14 The average silhouette_score is : 0.3855257121492227
For n_clusters = 15 The average silhouette_score is : 0.3

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.49802906322677315
For n_clusters = 3 The average silhouette_score is : 0.44517633936665313
For n_clusters = 4 The average silhouette_score is : 0.4127366590784887
For n_clusters = 5 The average silhouette_score is : 0.4026213761920998
For n_clusters = 6 The average silhouette_score is : 0.4224054523373244
For n_clusters = 7 The average silhouette_score is : 0.4091357659823786
For n_clusters = 8 The average silhouette_score is : 0.37320364236212517
For n_clusters = 9 The average silhouette_score is : 0.37357626448765896
For n_clusters = 10 The average silhouette_score is : 0.3512254783761634
For n_clusters = 11 The average silhouette_score is : 0.3537975626263718
For n_clusters = 12 The average silhouette_score is : 0.3377821186320535
For n_clusters = 13 The average silhouette_score is : 0.3445844338558182
For n_clusters = 14 The average silhouette_score is : 0.3482585781428525
For n_clusters = 15 The average silhouette_score is : 0

  0%|          | 0/6 [00:00<?, ?it/s]

Working on : leo, 2019-07-31.


  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5537258657124402
For n_clusters = 3 The average silhouette_score is : 0.5177407685037151
For n_clusters = 4 The average silhouette_score is : 0.5167720744184553
For n_clusters = 5 The average silhouette_score is : 0.4815125611135174
For n_clusters = 6 The average silhouette_score is : 0.4628011840773116
For n_clusters = 7 The average silhouette_score is : 0.44677692947421815
For n_clusters = 8 The average silhouette_score is : 0.4171614892200843
For n_clusters = 9 The average silhouette_score is : 0.38729088161528974
For n_clusters = 10 The average silhouette_score is : 0.37243951421691646
For n_clusters = 11 The average silhouette_score is : 0.380213143912599
For n_clusters = 12 The average silhouette_score is : 0.3670635692266302
For n_clusters = 13 The average silhouette_score is : 0.34735539138787785
For n_clusters = 14 The average silhouette_score is : 0.3380966453389417
For n_clusters = 15 The average silhouette_score is : 0.

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5652039307064745
For n_clusters = 3 The average silhouette_score is : 0.526924872537257
For n_clusters = 4 The average silhouette_score is : 0.5239379279701714
For n_clusters = 5 The average silhouette_score is : 0.4961925568052785
For n_clusters = 6 The average silhouette_score is : 0.47059412380663285
For n_clusters = 7 The average silhouette_score is : 0.4517789308888665
For n_clusters = 8 The average silhouette_score is : 0.423384512994583
For n_clusters = 9 The average silhouette_score is : 0.40174825227003436
For n_clusters = 10 The average silhouette_score is : 0.3850894418732941
For n_clusters = 11 The average silhouette_score is : 0.39302650512903176
For n_clusters = 12 The average silhouette_score is : 0.377691706527494
For n_clusters = 13 The average silhouette_score is : 0.3679601889414671
For n_clusters = 14 The average silhouette_score is : 0.35692784866326766
For n_clusters = 15 The average silhouette_score is : 0.36

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5303977888701271
For n_clusters = 3 The average silhouette_score is : 0.5330102976116967
For n_clusters = 4 The average silhouette_score is : 0.4783306251840546
For n_clusters = 5 The average silhouette_score is : 0.4391254436069971
For n_clusters = 6 The average silhouette_score is : 0.42575635782077553
For n_clusters = 7 The average silhouette_score is : 0.4073990648366254
For n_clusters = 8 The average silhouette_score is : 0.38499838109541684
For n_clusters = 9 The average silhouette_score is : 0.3613124544957026
For n_clusters = 10 The average silhouette_score is : 0.34451985967041115
For n_clusters = 11 The average silhouette_score is : 0.3288394734794644
For n_clusters = 12 The average silhouette_score is : 0.33739410157497585
For n_clusters = 13 The average silhouette_score is : 0.3441156474070476
For n_clusters = 14 The average silhouette_score is : 0.3313715726054466
For n_clusters = 15 The average silhouette_score is : 0

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5303765573565244
For n_clusters = 3 The average silhouette_score is : 0.506647014614286
For n_clusters = 4 The average silhouette_score is : 0.489517974998391
For n_clusters = 5 The average silhouette_score is : 0.48127873193593995
For n_clusters = 6 The average silhouette_score is : 0.4551861980473739
For n_clusters = 7 The average silhouette_score is : 0.42838128524178654
For n_clusters = 8 The average silhouette_score is : 0.40240360948455717
For n_clusters = 9 The average silhouette_score is : 0.3807123138810468
For n_clusters = 10 The average silhouette_score is : 0.3604613090845088
For n_clusters = 11 The average silhouette_score is : 0.3436523534532629
For n_clusters = 12 The average silhouette_score is : 0.3320747627640658
For n_clusters = 13 The average silhouette_score is : 0.33308432363315504
For n_clusters = 14 The average silhouette_score is : 0.33970719585904346
For n_clusters = 15 The average silhouette_score is : 0.

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5770775116128227
For n_clusters = 3 The average silhouette_score is : 0.5399930541104823
For n_clusters = 4 The average silhouette_score is : 0.5052394102224822
For n_clusters = 5 The average silhouette_score is : 0.48075078602488874
For n_clusters = 6 The average silhouette_score is : 0.45526850728131896
For n_clusters = 7 The average silhouette_score is : 0.43226460854840604
For n_clusters = 8 The average silhouette_score is : 0.414915883032894
For n_clusters = 9 The average silhouette_score is : 0.39669300717981393
For n_clusters = 10 The average silhouette_score is : 0.4033361407068592
For n_clusters = 11 The average silhouette_score is : 0.390160235211496
For n_clusters = 12 The average silhouette_score is : 0.40192663962505204
For n_clusters = 13 The average silhouette_score is : 0.40118846203627984
For n_clusters = 14 The average silhouette_score is : 0.3875015482465485
For n_clusters = 15 The average silhouette_score is : 0

  0%|          | 0/28 [00:00<?, ?it/s]

For n_clusters = 2 The average silhouette_score is : 0.5026712292122081
For n_clusters = 3 The average silhouette_score is : 0.5204076376663338
For n_clusters = 4 The average silhouette_score is : 0.4930681188022711
For n_clusters = 5 The average silhouette_score is : 0.47433821590270814
For n_clusters = 6 The average silhouette_score is : 0.45318465731011304
For n_clusters = 7 The average silhouette_score is : 0.4264582747356895
For n_clusters = 8 The average silhouette_score is : 0.409395488173577
For n_clusters = 9 The average silhouette_score is : 0.39060635412536227
For n_clusters = 10 The average silhouette_score is : 0.3802600529763615
For n_clusters = 11 The average silhouette_score is : 0.3864671495205191
For n_clusters = 12 The average silhouette_score is : 0.3707382391134763
For n_clusters = 13 The average silhouette_score is : 0.3553489614160111
For n_clusters = 14 The average silhouette_score is : 0.3583010107783304
For n_clusters = 15 The average silhouette_score is : 0.3

In [9]:
sil_df.head()

Unnamed: 0,location,survey_date,k,silhouette_mean
0,mar,2019-05-16,2,0.686208
1,mar,2019-05-16,3,0.513965
2,mar,2019-05-16,4,0.542803
3,mar,2019-05-16,5,0.508383
4,mar,2019-05-16,6,0.474671


##  Sub-optimal k

Find sub-optimal k by searching inflexion points where an additional cluster do not considerably degrade the overall clustering performance.

In [10]:
opt_k=get_opt_k(sil_df, sigma=0 )
opt_k

{'leo_2018-06-06': 10,
 'leo_2018-07-13': 9,
 'leo_2018-09-20': 12,
 'leo_2019-02-11': 11,
 'leo_2019-03-28': 10,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 5,
 'mar_2018-06-21': 7,
 'mar_2018-07-27': 5,
 'mar_2018-09-25': 11,
 'mar_2018-11-13': 3,
 'mar_2018-12-11': 11,
 'mar_2019-02-05': 10,
 'mar_2019-03-13': 3,
 'mar_2019-05-16': 3}

If we are not satisfied with the sub-optimal k returned by the algorithm, we can manually specify each survey k
by defining a dictionary.

In [11]:
# Based on our observations on a dataset comprising 87 surveys, 10 clusters (k=10) is generally a good tradeoff.

opt_k={'leo_2018-06-06': 10,
 'leo_2018-07-13': 10,
 'leo_2018-09-20': 10,
 'leo_2019-02-11': 10,
 'leo_2019-03-28': 10,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 10,
 'mar_2018-06-21': 10,
 'mar_2018-07-27': 10,
 'mar_2018-09-25': 10,
 'mar_2018-11-13': 10,
 'mar_2018-12-11': 10,
 'mar_2019-02-05': 10,
 'mar_2019-03-13': 10,
 'mar_2019-05-16': 10}

opt_k

{'leo_2018-06-06': 10,
 'leo_2018-07-13': 10,
 'leo_2018-09-20': 10,
 'leo_2019-02-11': 10,
 'leo_2019-03-28': 10,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 10,
 'mar_2018-06-21': 10,
 'mar_2018-07-27': 10,
 'mar_2018-09-25': 10,
 'mar_2018-11-13': 10,
 'mar_2018-12-11': 10,
 'mar_2019-02-05': 10,
 'mar_2019-03-13': 10,
 'mar_2019-05-16': 10}

or, update one value only. For instance, in mar_2019-05-16 dataset, it is unlikely that 3 clusters are enough.<br>
So, we replace only that value with 10.


In [12]:
opt_k['mar_2019-05-16']=10
opt_k

{'leo_2018-06-06': 10,
 'leo_2018-07-13': 10,
 'leo_2018-09-20': 10,
 'leo_2019-02-11': 10,
 'leo_2019-03-28': 10,
 'leo_2019-07-31': 10,
 'mar_2018-06-01': 10,
 'mar_2018-06-21': 10,
 'mar_2018-07-27': 10,
 'mar_2018-09-25': 10,
 'mar_2018-11-13': 10,
 'mar_2018-12-11': 10,
 'mar_2019-02-05': 10,
 'mar_2019-03-13': 10,
 'mar_2019-05-16': 10}

## Optimised K-Means clustering

With the sub-optimal k dictionary and keeping the same feature set, we finally cluster the dataset.

In [13]:
feature_set=["band1","band2","band3"]
data_classified=kmeans_sa(data_merged,opt_k, feature_set=feature_set)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/9 [00:00<?, ?it/s]

  0%|          | 0/6 [00:00<?, ?it/s]

In [14]:
data_classified=pd.merge(data_classified[["point_id","label_k"]],data_merged, how="left", on="point_id", validate="one_to_one")
data_classified

Unnamed: 0,point_id,label_k,distance,z,tr_id,raw_date,coordinates,location,survey_date,x,y,geometry,band1,band2,band3,slope,curve
0,67143080l2610320eo00,3,0.2,1.105616,47,20180606,POINT (299873.4167173313 5773731.881880409),leo,2018-06-06,299873.4167173313,5773731.881880409,POINT (299873.417 5773731.882),141.0,142.0,132.0,-0.006003,0.002122
1,67142080l2670630eo00,3,0.3,1.101189,47,20180606,POINT (299873.516093276 5773731.893034852),leo,2018-06-06,299873.51609327603,5773731.893034852,POINT (299873.516 5773731.893),148.0,148.0,143.0,-0.003264,0.001769
2,67142080l2600940eo00,3,0.4,1.099089,47,20180606,POINT (299873.6154692209 5773731.904189295),leo,2018-06-06,299873.61546922085,5773731.904189295,POINT (299873.615 5773731.904),140.0,142.0,129.0,-0.002465,0.001138
3,67146080l2650750eo00,6,0.5,1.096259,47,20180606,POINT (299873.7148451657 5773731.915343738),leo,2018-06-06,299873.71484516567,5773731.915343738,POINT (299873.715 5773731.915),162.0,165.0,155.0,-0.000988,0.001301
4,67141080l2600560eo00,3,0.6,1.097113,47,20180606,POINT (299873.8142211105 5773731.92649818),leo,2018-06-06,299873.8142211105,5773731.92649818,POINT (299873.814 5773731.926),152.0,154.0,137.0,0.000136,0.001117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234347,60100091m2529300ar75,2,79.5,8.513124,0,20190516,POINT (731433.4192254023 5705160.105556185),mar,2019-05-16,731433.4192254023,5705160.105556185,POINT (731433.419 5705160.106),40.0,59.0,28.0,-0.001998,-0.004390
234348,60109091m2599500ar76,7,79.6,8.510223,0,20190516,POINT (731433.3198014995 5705160.116274748),mar,2019-05-16,731433.3198014995,5705160.116274748,POINT (731433.320 5705160.116),70.0,98.0,59.0,-0.009601,-0.125342
234349,60109091m2569800ar77,6,79.7,8.493922,0,20190516,POINT (731433.2203775968 5705160.12699331),mar,2019-05-16,731433.2203775968,5705160.12699331,POINT (731433.220 5705160.127),58.0,89.0,49.0,-0.252682,-0.117465
234350,60109091m2549100ar78,7,79.8,8.004859,0,20190516,POINT (731433.1209536941 5705160.137711871),mar,2019-05-16,731433.1209536941,5705160.137711871,POINT (731433.121 5705160.138),75.0,105.0,66.0,-0.244532,-1.599625


### GOOD!

save the __data_classified__ dataframe as a CSV file and head to the __Example_3_Labels_correction_and_multitemporal_table notebook__.

In [15]:
#data_classified.to_csv(r"C:\my_packages\doc_data\labels\data_classified.csv", index=False)

___