# About
Implement clustering algorithms
1. K-Means
1. Heirarchical clustering
1. DBSCAN

## The Team
| Name| Student ID|
|------------|---------------|
|Cynthia Cai | 5625483 |
|Pratyush Kumar | 5359252|


# Imports

// add the imports to the cell below

In [1]:
import numpy as np 
import pandas as pd
from scipy.spatial import ConvexHull, distance_matrix
from sklearn.metrics.pairwise import euclidean_distances as eucDist
import glob
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="darkgrid")

# Reading the dataset


From the readme for the xyz files, we know that:

Ground truth labels:
|File range|Label|
|--|--|
|    000 - 099: |building|
|    100 - 199: |car|
|    200 - 299: |fence|
|    300 - 399: |pole|
|    400 - 499: |tree|


workflow:

iterate through the files, and collect them in a dataframe

Use [this link](https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas.concat) for concatenating the dataframes

In [2]:
xyzPath = './scene_objects/data/*.xyz'

dataPathsList = glob.glob(xyzPath)

In [3]:
allPointsDF= pd.DataFrame(columns=['x','y','z', 'fileNo', 'groundLabel'])
# featureDF = pd.DataFrame(columns=['Label' , 'convHull', median] )

def df_maker(df1, df2):
    return pd.concat([df1, df2], sort=False, ignore_index=True)

labelToGive = None
for path in dataPathsList:
    indx = int(path.split('/')[-1][0:3])
    # if else to determine label
    if indx>=0 and indx<100:
        labelToGive = 'building' 
    elif indx>=100 and indx<200:
        labelToGive = 'car' 
    elif indx>=200 and indx<300:
        labelToGive = 'fence' 
    elif indx>=300 and indx<400:
        labelToGive = 'pole' 
    elif indx>=400 and indx<500:
        labelToGive = 'tree' 

    # print(indx, labelToGive)        

    # using pandas to read dataset and make a dataFrame
    tempDF = pd.read_csv(path, delimiter=' ', header=None, dtype=np.float64, names=['x','y','z'])
    tempDF.loc[:,'fileNo'] = indx
    tempDF.loc[:,'groundLabel'] = labelToGive

    # merge with megaDFofPoints
    allPointsDF = df_maker(allPointsDF, tempDF)

# allPointsDF.head()

In [4]:
# save to pickle file
# allPointsDF.to_pickle('./scene_objects/compressedData.pkl')

## Making feature points
Identified feature points: `//add more`
* median height(z)
* convex hull

In [5]:
def label_determiner(indx):
    labelToGive=None
    if indx>=0 and indx<100:
        labelToGive = 'building' 
    elif indx>=100 and indx<200:
        labelToGive = 'car' 
    elif indx>=200 and indx<300:
        labelToGive = 'fence' 
    elif indx>=300 and indx<400:
        labelToGive = 'pole' 
    elif indx>=400 and indx<500:
        labelToGive = 'tree' 
    return labelToGive


featureDF = allPointsDF.groupby('fileNo').var()
featureDF.rename(columns={'x':'varX','y':'varY','z':'varZ'}, inplace=True)
featureDF.loc[:,'median_Z'] = allPointsDF.groupby('fileNo').z.median()
# featureDF.loc[:,'mean_Z'] = allPointsDF.groupby('fileNo').z.mean()

# range of x,y,z
featureDF.loc[:,'range_X'] = allPointsDF.groupby('fileNo').x.max() - allPointsDF.groupby('fileNo').x.min()
featureDF.loc[:,'range_Y'] = allPointsDF.groupby('fileNo').y.max() - allPointsDF.groupby('fileNo').y.min()
featureDF.loc[:,'range_Z'] = allPointsDF.groupby('fileNo').z.max() - allPointsDF.groupby('fileNo').z.min()

featureDF.loc[:,'Volume'] = allPointsDF.set_index('fileNo').loc[:,'x':'z'].groupby('fileNo').apply(ConvexHull).apply(lambda x: x.volume)

# points density
featureDF.loc[:,'footprintDensity'] =  allPointsDF.groupby('fileNo').count().x / (featureDF.range_X * featureDF.range_Y)
featureDF.loc[:,'volumeDensity'] =  allPointsDF.groupby('fileNo').count().x / featureDF.Volume

featureDF.loc[:,'label'] = featureDF.reset_index().fileNo.apply(label_determiner)

# standardize DF
standardFeatureDF = (featureDF.iloc[:,:-1] - featureDF.iloc[:,:-1].mean() ) / featureDF.iloc[:,:-1].std()

# join labels to the feature DF
standardFeatureDF = standardFeatureDF.join(other=featureDF.label ,on='fileNo')

featureDF.to_pickle('./scene_objects/featureData.pkl')
standardFeatureDF.to_pickle('./scene_objects/standardFeatureData.pkl')

### Plotting to see resemblamces and clusters, if any
needed: seaborn

In [6]:
# load df's
featureDF = pd.read_pickle('./scene_objects/featureData.pkl')
standardFeatureDF = pd.read_pickle('./scene_objects/standardFeatureData.pkl')

In [None]:
sns.pairplot(data=featureDF, hue="label")

normalize the feature df </br>
[from stackoverflow we see](https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame), that we can just use pandas for a standard scaling, or else, a [standard scaler from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) can also be applied </br>

from [answer here](https://stats.stackexchange.com/questions/417339/data-standardization-vs-normalization-for-clustering-analysis), we see that standard scaler is used for k means , so we are going with that

In [None]:
sns.pairplot(data=standardFeatureDF, hue="label")

# Clustering Algorithms
note: already loaded the featureDF and standardised in the cell above

## K-Means clustering

In [None]:


def k_means():
    """
    summary: this function is not yet ready
    """
    pass

In [None]:
k_means?

## Heirarchical clustering

This [ref was nice](https://www.section.io/engineering-education/hierarchical-clustering-in-python/) for heirarchical clustering understanding
Some other sources:
* [Statquest](https://www.youtube.com/watch?v=7xHsRkOdVwo&ab_channel=StatQuestwithJoshStarmer)
* Penn state [pseudo code](https://online.stat.psu.edu/stat508/lesson/12/12.7)
* pseudo code from [researchgate](https://www.researchgate.net/figure/The-hierarchical-clustering-algorithm-in-pseudocode_fig1_202144697)
* towards data science article to do [step by step](https://towardsdatascience.com/breaking-down-the-agglomerative-clustering-process-1c367f74c7c2) {this is a good one to follow}
* another one [for theory](https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019)
* similar [theory as above](https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive-clustering/)
* real good [step by step explaination](https://medium.com/@darkprogrammerpb/agglomerative-hierarchial-clustering-from-scratch-ec50e14c3826), also the [github code](https://github.com/Darkprogrammerpb/DeepLearningProjects/blob/master/Project40/agglomerative_hierarchial_clustering/Hierarchial%20Agglomerative%20clustering.ipynb)

### To Think in heirarchical clustering:
* Which type of heirarchical clustering are we doing: lets begin with agglomerative clustering
* Within the selected type what distance metrics are we using


In [7]:

tempDF = standardFeatureDF.iloc[:,:-1].copy()

def heirarch_clust(dataDF):
    distances = eucDist(standardFeatureDF.drop('label', axis=1))
    
    pass


# calculate distances
# maybe change the distance computation
distMatDF = pd.DataFrame( distance_matrix(tempDF.values, tempDF.values), index = tempDF.index, columns = tempDF.index)
# distMatDF = pd.DataFrame( np.tril(distMatDF),  index = tempDF.index, columns = tempDF.index)
distMatDF = distMatDF.where(distMatDF!=0, np.nan)
distMatDF

fileNo,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
fileNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,2.574501,2.144393,2.209737,3.472498,1.841868,3.026131,2.644947,2.832788,2.338606,...,2.195088,4.039945,2.608767,2.877201,3.276568,3.311645,3.365487,2.687808,2.490228,2.244077
1,2.574501,,2.271657,2.559484,2.235959,1.487359,1.681056,2.973740,1.745451,2.152603,...,0.791817,4.354254,1.760594,1.540615,1.490205,2.410368,3.107862,0.888874,2.743983,0.747427
2,2.144393,2.271657,,0.504025,1.797943,1.492353,3.116801,1.654982,1.265895,0.865140,...,2.384103,3.464734,1.789230,2.202106,2.944085,3.189730,2.708992,2.220230,2.185784,1.946434
3,2.209737,2.559484,0.504025,,1.852994,1.694753,3.324740,1.590265,1.543798,1.177028,...,2.672125,3.666159,2.131088,2.626599,3.350580,3.593042,3.073367,2.588518,2.540965,2.307101
4,3.472498,2.235959,1.797943,1.852994,,1.997464,2.892962,2.064431,0.986553,1.732406,...,2.715811,4.611560,2.472422,2.585656,3.051772,3.681593,3.642196,2.429393,3.445879,2.365583
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,3.311645,2.410368,3.189730,3.593042,3.681593,3.078434,2.688887,3.983859,3.099673,3.256381,...,2.487018,3.600597,2.497476,1.707411,1.244607,,1.829312,2.255954,2.291556,2.218532
496,3.365487,3.107862,2.708992,3.073367,3.642196,3.299543,3.734616,3.839151,3.074288,2.962046,...,3.139637,1.922267,2.033782,1.884151,2.368662,1.829312,,2.629444,1.297173,2.635838
497,2.687808,0.888874,2.220230,2.588518,2.429393,1.678229,2.105424,3.135257,1.779044,2.035366,...,0.938587,3.761611,1.124038,0.966225,1.304141,2.255954,2.629444,,2.184265,0.544401
498,2.490228,2.743983,2.185784,2.540965,3.445879,2.552354,3.475241,3.263336,2.692377,2.290211,...,2.551737,1.939549,1.451968,1.795807,2.487067,2.291556,1.297173,2.184265,,2.082095



devise new distance matrix and then repeat the sequence:
### TODO: 
* linkage between the clusters
* updation of the distance matrix

clusters to be made:
`vals.idxmin()` and `idVals.iloc[vals.idxmin()]`

In [32]:
tempDF = standardFeatureDF.iloc[:,:-1].copy()

distMatDF = pd.DataFrame( distance_matrix(tempDF.values, tempDF.values), index = tempDF.index, columns = tempDF.index)
# distMatDF = pd.DataFrame( np.tril(distMatDF),  index = tempDF.index, columns = tempDF.index)
# replace 0 distances with np.nan
distMatDF = distMatDF.where(distMatDF!=0, np.nan)
    
clusterKeeper = {}
clusterKeeperList = []
iterationCounter=1

m=len(distMatDF)
while m>1: 

    # cluster size
    print(f"Total sample = {m}")
    # compute distances

    # get indices with min dist
    vals = distMatDF.min(skipna=True)
    idVals = distMatDF.idxmin(skipna=True)

    # print(vals.min(), vals.idxmin()) # GIVES US THE MINIMUM VALUE and the index at which this was found in the vals series
    # print(idVals.iloc[vals.idxmin()])
    
    ind_to_pop = [idVals.loc[vals.idxmin()] , vals.idxmin()]
    print(f"index {ind_to_pop}")

    # update distmatrix at some point
    # add updated new row, col to dist mat  
    # this updated row is basically the minimum of the two eliminated rows
    singleLink_minRow = distMatDF.loc[ind_to_pop].drop(ind_to_pop, axis=1).min()
    singleLink_minRow.rename(f"cluster {iterationCounter}", inplace=True)

    # pop row and col from dist mat
    distMatDF = distMatDF.drop(ind_to_pop, axis=0).drop(ind_to_pop, axis=1)
    print("row,col ",len(distMatDF),len(distMatDF.columns))

    # min distance from other points
    # distMatDF.loc[str(ind_to_pop), :] = singleLink_minRow

    distMatDF = distMatDF.append(singleLink_minRow)
    distMatDF.loc[:,singleLink_minRow.name] = singleLink_minRow
    # update value of m
    m = len(distMatDF)
    # m-=1
    clusterKeeper[f"iteration {iterationCounter}"] = {'indices_popped':ind_to_pop , "df":distMatDF.copy()}
    clusterKeeperList.append( (iterationCounter, ind_to_pop) )
    iterationCounter+=1


Total sample = 500
index [151, 142]
row,col  498 498
Total sample = 499
index [198, 123]
row,col  497 497
Total sample = 498
index [160, 136]
row,col  496 496
Total sample = 497
index ['cluster 2', 144]
row,col  495 495
Total sample = 496
index [389, 370]
row,col  494 494
Total sample = 495
index [148, 135]
row,col  493 493
Total sample = 494
index ['cluster 1', 112]
row,col  492 492
Total sample = 493
index [175, 167]
row,col  491 491
Total sample = 492
index [96, 83]
row,col  490 490
Total sample = 491
index [194, 103]
row,col  489 489
Total sample = 490
index [185, 104]
row,col  488 488
Total sample = 489
index [177, 113]
row,col  487 487
Total sample = 488
index [273, 235]
row,col  486 486
Total sample = 487
index [241, 228]
row,col  485 485
Total sample = 486
index [445, 425]
row,col  484 484
Total sample = 485
index [196, 166]
row,col  483 483
Total sample = 484
index [193, 134]
row,col  482 482
Total sample = 483
index ['cluster 12', 179]
row,col  481 481
Total sample = 482
inde

In [33]:
# extract progression of clusters
# clusterKeeper.keys()
# for key,val in clusterKeeper.items():

#     print(clusterKeeper[key]["indices_popped"])

# temp = clusterKeeper['iteration 499']["indices_popped"] 
# while 'cluster' in ' '.join([str(elem) for elem in temp]) :
#     # print("clu")
    
clusterKeeperList
# print('not clu')
# for i in clusterKeeper['iteration 499']["indices_popped"]    :
#     print()



[(1, [151, 142]),
 (2, [198, 123]),
 (3, [160, 136]),
 (4, ['cluster 2', 144]),
 (5, [389, 370]),
 (6, [148, 135]),
 (7, ['cluster 1', 112]),
 (8, [175, 167]),
 (9, [96, 83]),
 (10, [194, 103]),
 (11, [185, 104]),
 (12, [177, 113]),
 (13, [273, 235]),
 (14, [241, 228]),
 (15, [445, 425]),
 (16, [196, 166]),
 (17, [193, 134]),
 (18, ['cluster 12', 179]),
 (19, [468, 424]),
 (20, [447, 438]),
 (21, [190, 130]),
 (22, [149, 109]),
 (23, [181, 169]),
 (24, ['cluster 17', 118]),
 (25, [168, 164]),
 (26, ['cluster 23', 'cluster 10']),
 (27, ['cluster 5', 375]),
 (28, ['cluster 15', 412]),
 (29, [57, 11]),
 (30, ['cluster 4', 171]),
 (31, [345, 330]),
 (32, [140, 132]),
 (33, [461, 452]),
 (34, [159, 106]),
 (35, ['cluster 20', 454]),
 (36, ['cluster 35', 462]),
 (37, [244, 209]),
 (38, ['cluster 28', 490]),
 (39, [427, 422]),
 (40, [386, 380]),
 (41, ['cluster 9', 68]),
 (42, ['cluster 38', 405]),
 (43, ['cluster 24', 'cluster 8']),
 (44, [268, 206]),
 (45, ['cluster 26', 129]),
 (46, ['clus

In [29]:
_ = [1,2,3,4,'cluster']
# str([i for i in _])
' '.join([str(elem) for elem in _])

'1 2 3 4 cluster'

## DBSCAN