# About
Implement clustering algorithms
1. K-Means
1. Heirarchical clustering
1. DBSCAN

## The Team
| Name| Student ID|
|------------|---------------|
|Cynthia Cai | 5625483 |
|Pratyush Kumar | 5359252|


# Imports

// add the imports to the cell below

In [1]:
import numpy as np 
import pandas as pd
from scipy.spatial import ConvexHull, distance_matrix
from sklearn.metrics.pairwise import euclidean_distances as eucDist
import glob
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="darkgrid")

# Reading the dataset


From the readme for the xyz files, we know that:

Ground truth labels:
|File range|Label|
|--|--|
|    000 - 099: |building|
|    100 - 199: |car|
|    200 - 299: |fence|
|    300 - 399: |pole|
|    400 - 499: |tree|


workflow:

iterate through the files, and collect them in a dataframe

Use [this link](https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas.concat) for concatenating the dataframes

In [None]:
xyzPath = './scene_objects/data/*.xyz'

dataPathsList = glob.glob(xyzPath)

In [None]:
allPointsDF= pd.DataFrame(columns=['x','y','z', 'fileNo', 'groundLabel'])
# featureDF = pd.DataFrame(columns=['Label' , 'convHull', median] )

def df_maker(df1, df2):
    return pd.concat([df1, df2], sort=False, ignore_index=True)

labelToGive = None
for path in dataPathsList:
    indx = int(path.split('/')[-1][0:3])
    # if else to determine label
    if indx>=0 and indx<100:
        labelToGive = 'building' 
    elif indx>=100 and indx<200:
        labelToGive = 'car' 
    elif indx>=200 and indx<300:
        labelToGive = 'fence' 
    elif indx>=300 and indx<400:
        labelToGive = 'pole' 
    elif indx>=400 and indx<500:
        labelToGive = 'tree' 

    # print(indx, labelToGive)        

    # using pandas to read dataset and make a dataFrame
    tempDF = pd.read_csv(path, delimiter=' ', header=None, dtype=np.float64, names=['x','y','z'])
    tempDF.loc[:,'fileNo'] = indx
    tempDF.loc[:,'groundLabel'] = labelToGive

    # merge with megaDFofPoints
    allPointsDF = df_maker(allPointsDF, tempDF)

# allPointsDF.head()

In [None]:
# save to pickle file
# allPointsDF.to_pickle('./scene_objects/compressedData.pkl')

## Making feature points
Identified feature points: `//add more`
* median height(z)
* convex hull

In [None]:
def label_determiner(indx):
    labelToGive=None
    if indx>=0 and indx<100:
        labelToGive = 'building' 
    elif indx>=100 and indx<200:
        labelToGive = 'car' 
    elif indx>=200 and indx<300:
        labelToGive = 'fence' 
    elif indx>=300 and indx<400:
        labelToGive = 'pole' 
    elif indx>=400 and indx<500:
        labelToGive = 'tree' 
    return labelToGive


featureDF = allPointsDF.groupby('fileNo').var()
featureDF.rename(columns={'x':'varX','y':'varY','z':'varZ'}, inplace=True)
featureDF.loc[:,'median_Z'] = allPointsDF.groupby('fileNo').z.median()
# featureDF.loc[:,'mean_Z'] = allPointsDF.groupby('fileNo').z.mean()

# range of x,y,z
featureDF.loc[:,'range_X'] = allPointsDF.groupby('fileNo').x.max() - allPointsDF.groupby('fileNo').x.min()
featureDF.loc[:,'range_Y'] = allPointsDF.groupby('fileNo').y.max() - allPointsDF.groupby('fileNo').y.min()
featureDF.loc[:,'range_Z'] = allPointsDF.groupby('fileNo').z.max() - allPointsDF.groupby('fileNo').z.min()

featureDF.loc[:,'Volume'] = allPointsDF.set_index('fileNo').loc[:,'x':'z'].groupby('fileNo').apply(ConvexHull).apply(lambda x: x.volume)

# points density
featureDF.loc[:,'footprintDensity'] =  allPointsDF.groupby('fileNo').count().x / (featureDF.range_X * featureDF.range_Y)
featureDF.loc[:,'volumeDensity'] =  allPointsDF.groupby('fileNo').count().x / featureDF.Volume

featureDF.loc[:,'label'] = featureDF.reset_index().fileNo.apply(label_determiner)

# standardize DF
standardFeatureDF = (featureDF.iloc[:,:-1] - featureDF.iloc[:,:-1].mean() ) / featureDF.iloc[:,:-1].std()

# join labels to the feature DF
standardFeatureDF = standardFeatureDF.join(other=featureDF.label ,on='fileNo')

featureDF.to_pickle('./scene_objects/featureData.pkl')
standardFeatureDF.to_pickle('./scene_objects/standardFeatureData.pkl')

### Plotting to see resemblamces and clusters, if any
needed: seaborn

In [2]:
# load df's
featureDF = pd.read_pickle('./scene_objects/featureData.pkl')
standardFeatureDF = pd.read_pickle('./scene_objects/standardFeatureData.pkl')

In [None]:
sns.pairplot(data=featureDF, hue="label")

normalize the feature df </br>
[from stackoverflow we see](https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame), that we can just use pandas for a standard scaling, or else, a [standard scaler from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) can also be applied </br>

from [answer here](https://stats.stackexchange.com/questions/417339/data-standardization-vs-normalization-for-clustering-analysis), we see that standard scaler is used for k means , so we are going with that

In [None]:
sns.pairplot(data=standardFeatureDF, hue="label")

# Clustering Algorithms
note: already loaded the featureDF and standardised in the cell above

## K-Means clustering

In [None]:


def k_means():
    """
    summary: this function is not yet ready
    """
    pass

In [None]:
k_means?

## Heirarchical clustering

This [ref was nice](https://www.section.io/engineering-education/hierarchical-clustering-in-python/) for heirarchical clustering understanding
Some other sources:
* [Statquest](https://www.youtube.com/watch?v=7xHsRkOdVwo&ab_channel=StatQuestwithJoshStarmer)
* Penn state [pseudo code](https://online.stat.psu.edu/stat508/lesson/12/12.7)
* pseudo code from [researchgate](https://www.researchgate.net/figure/The-hierarchical-clustering-algorithm-in-pseudocode_fig1_202144697)
* towards data science article to do [step by step](https://towardsdatascience.com/breaking-down-the-agglomerative-clustering-process-1c367f74c7c2) {this is a good one to follow}
* another one [for theory](https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019)
* similar [theory as above](https://www.geeksforgeeks.org/ml-hierarchical-clustering-agglomerative-and-divisive-clustering/)
* real good [step by step explaination](https://medium.com/@darkprogrammerpb/agglomerative-hierarchial-clustering-from-scratch-ec50e14c3826), also the [github code](https://github.com/Darkprogrammerpb/DeepLearningProjects/blob/master/Project40/agglomerative_hierarchial_clustering/Hierarchial%20Agglomerative%20clustering.ipynb)

### To Think in heirarchical clustering:
* Which type of heirarchical clustering are we doing: lets begin with agglomerative clustering
* Within the selected type what distance metrics are we using


In [None]:

tempDF = standardFeatureDF.iloc[:,:-1].copy()

def heirarch_clust(dataDF):
    distances = eucDist(standardFeatureDF.drop('label', axis=1))
    
    pass


# calculate distances
# maybe change the distance computation
distMatDF = pd.DataFrame( distance_matrix(tempDF.values, tempDF.values), index = tempDF.index, columns = tempDF.index)
# distMatDF = pd.DataFrame( np.tril(distMatDF),  index = tempDF.index, columns = tempDF.index)
distMatDF = distMatDF.where(distMatDF!=0, np.nan)
distMatDF


devise new distance matrix and then repeat the sequence:
### TODO: 
* linkage between the clusters
* updation of the distance matrix

clusters to be made:
`vals.idxmin()` and `idVals.iloc[vals.idxmin()]`

In [4]:
tempDF = standardFeatureDF.iloc[:,:-1].copy()

distMatDF = pd.DataFrame( distance_matrix(tempDF.values, tempDF.values), index = tempDF.index, columns = tempDF.index)
# distMatDF = pd.DataFrame( np.tril(distMatDF),  index = tempDF.index, columns = tempDF.index)
# replace 0 distances with np.nan
distMatDF = distMatDF.where(distMatDF!=0, np.nan)
    
clusterKeeper = {}
clustDict={}
clusterKeeperList = []
clustCheck = {}
# clustCHECK WILL have two nodes each
iterationCounter=0
play=[]
m=len(distMatDF)
progression = [ [i] for i in range(m) ] 

while m>1: 

    # cluster size
    # print(f"Total sample = {m}")
    # compute distances

    # get indices with min dist
    vals = distMatDF.min(skipna=True)
    idVals = distMatDF.idxmin(skipna=True)

    # print(vals.min(), vals.idxmin()) # GIVES US THE MINIMUM VALUE and the index at which this was found in the vals series
    # print(idVals.iloc[vals.idxmin()])
    
    ind_to_pop = [idVals.loc[vals.idxmin()] , vals.idxmin()]
    # print(f"index {ind_to_pop}")
    play.append(ind_to_pop)
    # update distmatrix at some point
    # add updated new row, col to dist mat  
    # this updated row is basically the minimum of the two eliminated rows
    singleLink_minRow = distMatDF.loc[ind_to_pop].drop(ind_to_pop, axis=1).max()
    singleLink_minRow.rename(f"cluster {iterationCounter}", inplace=True)

    # pop row and col from dist mat
    distMatDF = distMatDF.drop(ind_to_pop, axis=0).drop(ind_to_pop, axis=1)
    # print("row,col ",len(distMatDF),len(distMatDF.columns))

    # min distance from other points

    distMatDF = distMatDF.append(singleLink_minRow)
    distMatDF.loc[:,singleLink_minRow.name] = singleLink_minRow
    # update value of m
    m = len(distMatDF)
    # m-=1
    clusterKeeper[f"iteration {iterationCounter}"] = {'indices_popped':ind_to_pop , "df":distMatDF.copy()}
    clusterKeeperList.append( (iterationCounter, ind_to_pop) )
    clustDict[f"cluster {iterationCounter}"] = ind_to_pop
    
    indPop1, indPop2 = ind_to_pop

    clustCheck[f"cluster {iterationCounter}"] = {'node1':indPop1 , "node2":indPop2, 'fullnodes':ind_to_pop}
    print("before" , clustCheck[f'cluster {iterationCounter}'])
    
    # Case: if first index is a cluster
    if (indPop1 in clustCheck.keys()) and (indPop2 in clustCheck.keys()): #both are clusters
        clustCheck[f"cluster {iterationCounter}"] = {'node1':clustCheck[indPop1]['fullnodes'].copy() , "node2":clustCheck[indPop2]['fullnodes'].copy() }
        tempFull = clustCheck[f"cluster {iterationCounter}"]["node1"].copy()
        # try:
        tempFull.append(clustCheck[f"cluster {iterationCounter}"]["node2"].copy()) #if it is a list
        # except:
        #     tempFull.append(clustCheck[f"cluster {iterationCounter}"]["node2"]) # if it isnt a list and thus can't be copied
        clustCheck[f"cluster {iterationCounter}"]["fullnodes"] = tempFull  


    # Case: if first index is a cluster
    elif indPop1 in clustCheck.keys(): #means first position is cluster
        clustCheck[f"cluster {iterationCounter}"] = {'node1':clustCheck[indPop1]['fullnodes'].copy() , "node2":indPop2 }
        tempFull = clustCheck[f"cluster {iterationCounter}"]["node1"].copy()
        try:
            tempFull.append(clustCheck[f"cluster {iterationCounter}"]["node2"].copy()) #if it is a list
        except:
            tempFull.append(clustCheck[f"cluster {iterationCounter}"]["node2"]) # if it isnt a list and thus can't be copied
        clustCheck[f"cluster {iterationCounter}"]["fullnodes"] = tempFull

    # Case: if second index is a cluster
    elif indPop2 in clustCheck.keys(): #means first position is cluster
        clustCheck[f"cluster {iterationCounter}"] = {'node1':indPop1 , "node2":clustCheck[indPop2]['fullnodes'].copy()}
        tempFull = clustCheck[f"cluster {iterationCounter}"]["node1"].copy()
        try:
            tempFull.append(clustCheck[f"cluster {iterationCounter}"]["node2"].copy())
        except:
            tempFull.append(clustCheck[f"cluster {iterationCounter}"]["node2"])

        clustCheck[f"cluster {iterationCounter}"]["fullnodes"] =  tempFull

    print("after" , clustCheck[f'cluster {iterationCounter}'])

    iterationCounter+=1
distMatDF

before {'node1': 151, 'node2': 142, 'fullnodes': [151, 142]}
after {'node1': 151, 'node2': 142, 'fullnodes': [151, 142]}
before {'node1': 198, 'node2': 123, 'fullnodes': [198, 123]}
after {'node1': 198, 'node2': 123, 'fullnodes': [198, 123]}
before {'node1': 160, 'node2': 136, 'fullnodes': [160, 136]}
after {'node1': 160, 'node2': 136, 'fullnodes': [160, 136]}
before {'node1': 389, 'node2': 370, 'fullnodes': [389, 370]}
after {'node1': 389, 'node2': 370, 'fullnodes': [389, 370]}
before {'node1': 148, 'node2': 135, 'fullnodes': [148, 135]}
after {'node1': 148, 'node2': 135, 'fullnodes': [148, 135]}
before {'node1': 175, 'node2': 167, 'fullnodes': [175, 167]}
after {'node1': 175, 'node2': 167, 'fullnodes': [175, 167]}
before {'node1': 96, 'node2': 83, 'fullnodes': [96, 83]}
after {'node1': 96, 'node2': 83, 'fullnodes': [96, 83]}
before {'node1': 194, 'node2': 103, 'fullnodes': [194, 103]}
after {'node1': 194, 'node2': 103, 'fullnodes': [194, 103]}
before {'node1': 185, 'node2': 104, 'ful

fileNo,cluster 498
fileNo,Unnamed: 1_level_1
cluster 498,


## DBSCAN

# Validation

In [None]:
def validateModels(classifiedData, originalData):
    pass