# COMP 7150 Assignment 4

**Deadline**: April 1, 2024. Before midnight (12AM)


YOUR NAME: ___

---

**Reading**
* Notebooks 8, 11: standardization of data
* Notebook 9: how to use KMeans clustering
* Notebook 10: how to use PCA
* Notebook 12: Application of PCA

In this assignment, you will use the abalone dataset.

To answer questions, please substantiate your answers with reasoning, explanation, and code.  

In [40]:
import warnings
warnings.filterwarnings('ignore')

In [78]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import seaborn

In [42]:
import pandas
abalone = pandas.read_csv('../Datasets/abalone.csv')
abalone.sample(3)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
2348,M,0.525,0.43,0.165,0.8645,0.376,0.1945,0.2515,16
956,M,0.495,0.4,0.135,0.61,0.272,0.1435,0.144,7
2421,I,0.52,0.415,0.16,0.595,0.2105,0.142,0.26,15


---
**Problem 1**

Define a Python function named `evaluate_kmean`, which has these parameters:
* X - the features provided to KMeans
* k - the number of clusters (`n_cluster`) provided to KMeans
* random_state - the random state provided to KMeans

This function should build a KMeans model on the features and return the silhouette score of the resulting clustering.  Note: use the same features to predict and score the clustering.

In [43]:

def evaluate_kmean(X, k, rand_state):
    
    model= KMeans(n_clusters=k, random_state=rand_state)
    model.fit(X)    
    
    sil_score = silhouette_score(X, model.labels_, metric='euclidean')
    
    return sil_score



---
**Problem 2**

Compare the performance of KMeans on abalone dataset on these three features:
* all variables in the abalone dataset.  Note: you might need to process the categorical variable(s).
* all variables in the abalaone dataset, standardized.
* all variables in the abalaone dataset, normalized.

The comparison should be done on various values on k (the number of clusters) ranging from 2 to 30.  

Report and describe your finding.  Which of the 3 features give the best clustering score, and what is the best number of clusters used for clustering?

In [44]:
abalone.sample(3)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
2928,M,0.61,0.48,0.14,1.0625,0.516,0.225,0.2915,11
4146,M,0.695,0.53,0.21,1.51,0.664,0.4095,0.385,10
2796,M,0.63,0.525,0.195,1.3135,0.4935,0.2565,0.465,10


In [45]:
abalone = pandas.get_dummies(abalone)
#drop_first=True

In [46]:
abalone.sample(4)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Sex_F,Sex_I,Sex_M
3300,0.615,0.5,0.205,1.1055,0.4445,0.227,0.39,16,1,0,0
2945,0.635,0.485,0.19,1.3765,0.634,0.2885,0.406,11,1,0,0
2353,0.72,0.56,0.175,1.7265,0.637,0.3415,0.525,17,1,0,0
583,0.46,0.355,0.14,0.4935,0.216,0.133,0.115,13,0,1,0


In [47]:
len(abalone)

4177

In [48]:
features= abalone

In [52]:
normalized_abalone = pandas.DataFrame(
    MinMaxScaler().fit_transform(features).round(4),
    columns = features.columns
)

normalized_abalone.sample(3)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Sex_F,Sex_I,Sex_M
337,0.6959,0.6639,0.1637,0.4537,0.3174,0.3627,0.422,0.5357,0.0,0.0,1.0
2671,0.6959,0.6807,0.1416,0.3575,0.2986,0.3436,0.2541,0.25,1.0,0.0,0.0
383,0.5338,0.5378,0.1062,0.1964,0.1513,0.16,0.1928,0.3929,0.0,0.0,1.0


In [51]:
standardized_abalone = pandas.DataFrame(
    StandardScaler().fit_transform(features).round(4),
    columns = features.columns
)
standardized_abalone.sample(3)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Sex_F,Sex_I,Sex_M
3699,1.0494,1.1803,1.2071,1.1008,1.073,1.053,1.1579,0.3308,-0.6748,-0.688,1.3167
2242,-0.408,-0.281,-0.108,-0.699,-0.8375,-1.0091,-0.3868,-0.91,1.4818,-0.688,-0.7595
830,-0.9077,-1.0368,-0.9449,-1.0274,-0.9163,-1.187,-1.0621,-1.2202,-0.6748,-0.688,1.3167


## Cluster Analysis for different data formation:

In [54]:
regular_score=[]
normalized_score=[]
standardized_score=[]

for i in range(2,31):
    
    regular_ans=evaluate_kmean(abalone, i, 2024)
    regular_score.append(regular_ans)
    
    normalized_ans=evaluate_kmean(normalized_abalone, i, 2024)
    normalized_score.append(normalized_ans)
    
    standardized_ans=evaluate_kmean(standardized_abalone, i, 2024)
    standardized_score.append(standardized_ans)
    
    #print(f" for cluster {i}: Regular Score:{regular_ans} , normalized score: {normalized_ans} , standardized score: {standardized_ans}")

clusters= pandas.DataFrame({'regular_score': regular_score, 
                            'normalized_score': normalized_score, 
                            'standardized_score': standardized_score}) 

clusters.set_index(pandas.Index(range(2, 31)), inplace=True)

clusters

Unnamed: 0,regular_score,normalized_score,standardized_score
2,0.527621,0.507573,0.385307
3,0.446426,0.732103,0.284524
4,0.407519,0.627467,0.341456
5,0.378306,0.536564,0.350105
6,0.35758,0.451534,0.362148
7,0.348655,0.437008,0.375989
8,0.35364,0.419734,0.362164
9,0.34157,0.406031,0.348923
10,0.377727,0.382949,0.333848
11,0.415043,0.365964,0.346326


### Analysis:

- for regular given feature, as clusters increase, the silhoutte score tends to improve.
- for normalized scores, as clusters increase, the silhoutte score tends to decrease.
- for standardized scores, as clusters increase, the silhoutte score tends to decrease as well.

In [74]:
# test=max(clusters['regular_score'])
# clusters.iloc(test)
print("Silhoutte scores Analysis for different clusters: \n")
    
print("regular clusters: ", max(clusters['regular_score']),clusters[clusters['regular_score']==max(clusters['regular_score'])].index.values)
print("normalized clusters: ", max(clusters['normalized_score']),clusters[clusters['normalized_score']==max(clusters['normalized_score'])].index.values)
print("standardized clusters: ", max(clusters['standardized_score']),clusters[clusters['standardized_score']==max(clusters['standardized_score'])].index.values)    
    

Silhoutte scores Analysis for different clusters: 

regular clusters:  0.5443162878222592 [30]
normalized clusters:  0.7321034964852111 [3]
standardized clusters:  0.3853065586100005 [2]


### Clustering for standardized data with 2 clusters.

In [75]:
stand_model= KMeans(n_clusters=2, random_state=2024)
stand_model.fit(standardized_abalone)



In [76]:
standardized_abalone['cluster']= stand_model.labels_

In [77]:
standardized_abalone.sample(3)

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Sex_F,Sex_I,Sex_M,cluster
1946,0.9245,1.1803,0.4898,0.747,0.8297,0.5968,0.6909,0.641,-0.6748,-0.688,1.3167,1
227,-1.3241,-1.3895,-1.3035,-1.2721,-1.2678,-1.2052,-1.213,-0.91,-0.6748,1.4535,-0.7595,0
582,1.0077,1.1803,1.2071,1.2946,1.3613,1.4864,1.4094,2.8123,1.4818,-0.688,-0.7595,1


In [None]:
#seaborn.relplot(data=standardized_abalone, x='', y='', hue='cluster', height=3)

---
**Problem 3**

Following the previous problem, select the best value of k (one that gives the highest silhouette score).

Define a Python function named `evaluate_kmean_with_pca`, which has these parameters:
* X - the features provided to KMeans
* n - the number of principal components (`n_components`) provided to PCA
* random_state - the random state provided to KMeans

The function will transform the features X, using PCA.  It will create a KMeans model, with the optimal number of clusters found in the previous problem. Then, use the model to fit the transformed features, make predictions using the transformed features, and calculate the silhouette score of the resulting clustering.


In [83]:
'''
function to transform the features X to PCA, create a KMeans model with optimal clusters(given 3 by default), fit 
the model with transformed data, calculate silhoutte score of the model and return it.

'''


def evaluate_kmean_with_pca(X, n, rand_stat, no_cluster):
    
    pca = PCA(n_components=n)
    converted_pca = pandas.DataFrame(pca.fit_transform(X))
    
    pca_model = KMeans(n_clusters= no_cluster, random_state=rand_stat)
    pca_model.fit(converted_pca)    
    
    sil_score = silhouette_score(converted_pca, pca_model.labels_, metric='euclidean')
    
    return sil_score

    

---
**Problem 4**

Identify the number of components (`n_components`) that gives the highest score, with each of these three features:
* all variables in the abalone dataset.  Note: you might need to process the categorical variable(s).
* all variables in the abalaone dataset, standardized.
* all variables in the abalaone dataset, normalized.

Report your finding.


In [84]:
regular_pca_score=[]
normalized_pca_score=[]
standardized_pca_score=[]

for i in range(2,11):
    regular_pca=evaluate_kmean_with_pca(abalone, i, 2024, 30)
    regular_pca_score.append(regular_pca)

    normalized_pca=evaluate_kmean_with_pca(normalized_abalone, i, 2024, 3)
    normalized_pca_score.append(normalized_pca)
    
    standardized_pca=evaluate_kmean_with_pca(standardized_abalone, i, 2024, 2)
    standardized_pca_score.append(standardized_pca)
    
    #print(f" for cluster {i}: Regular Score:{regular_ans} , normalized score: {normalized_ans} , standardized score: {standardized_ans}")

pca_clusters= pandas.DataFrame({'regular_pca_score': regular_pca_score, 
                            'normalized_pca_score': normalized_pca_score, 
                            'standardized_pca_score': standardized_pca_score}) 

pca_clusters.set_index(pandas.Index(range(2, 11)), inplace=True)

pca_clusters

AttributeError: 'numpy.float64' object has no attribute 'append'

---
**Problem 5**

At this point, you have a good idea if the original features (X), or the standardized features, or the normalized features resulted in the best clustering performance.

Use these features to identify the best combination of the number of clusters (k) and the number of principal components (n) used to cluster.

Define a Python function called `evaluate`, with these parameters:
* X - the features provided to KMeans
* n - the number of principal components (`n_components`) provided to PCA
* k - the number of clusters (`n_cluster`) provided to KMeans
* random_state - the random state provided to KMeans

Use this function to identify the values of `k` and `n` that gives the highest clustering (silhouette) score.

---
**Problem 6**

Continuing from problem 5, where you for the values of `k` and `n` with the highest score.

Which feature(s) does the first principal component correlate to the most? Which feature(s) does the second principal component correlate to the most?

---
**Problem 7**

Continuing from problem 5, where you for the values of `k` and `n` with the highest score, what do the clusters represent?