<a href="https://colab.research.google.com/github/guptaankit894/AAIM/blob/main/google_colab_files/05_Clustering_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data description**¶

There are 77 expression levels of proteins/protein modifications (columns) that produce detectable signals in the nuclear cortex of Ts65Dn trisomic mice. These proteins are related to a functional hippocampus (learning and memory). In this data there are 38 control mice and 34 trisomic mice (Down syndrome), having 72 mice in total. 15 measurements were carried out for each protein per mouse/sample. In total 570 measurements accounted for control mice and 510 for trisomic mice, having 1080 independent measurements per protein. Some of the values are empty.

**Type of data**
This data set is multi-levelled as is describing 8 classes of mice in function of genotype, behaviour and treatment. According to genotype, we have control (c) and trisomic (t) mice. If we talk about behaviour, some mice have been stimulated to learn (CS) and others have not (SC). According to treatment, some mice have been treated with the drug memantine (m) to assess the ability of the drug in the learning process and others with saline solution (s). The aim is to identify subsets of proteins that are discriminant between the classes.

Classes:
c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)
c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)
c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)
c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)

t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)

t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)
t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)
t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)

In [None]:
# Importing Libraries
import pandas as pd # data structure
import matplotlib.pyplot as plt # Plotting purpose
import numpy as np # Numerical computations
import seaborn as sns; sns.set(color_codes=True)  # for plot styling
from scipy.cluster.hierarchy import dendrogram, linkage # for dendrogram computation and plotting "linkage" function will be used agglomerative clustering
from sklearn.decomposition import PCA  # Principal component analysis
from sklearn.cluster import KMeans, AgglomerativeClustering  # CLustering methods
from sklearn import preprocessing #for normalization of features

In [None]:
!pip install xlrd # for reading and formatting data from excel (xls) files

In [None]:
# Reading data
data = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00342/Data_Cortex_Nuclear.xls", header = 0)
data.head()

In [None]:
# Data Cleaning
data = data.dropna()  # Dropping null, and NaNs (Not a number)
data = data.drop(['Behavior','Genotype','MouseID','Treatment'], axis=1) # Dropping irrelevant fields from the dataset
data.head() # Printing first 5 columns to see the data in the dataframe

In [None]:
sample = data.pop('class') # extracting the class colum from the dataset to extract the number of classes

#Creating a colour palette for the dendrogram
lut = dict(zip(sample.unique(),'bgrcmykw'))
row_colors = sample.map(lut)
row_colors.head()

#Creating a dendrogram with heatmap to visualise data
data_a = pd.DataFrame(data)
g = sns.clustermap(data_a,row_colors=row_colors, z_score=0,)

#Scaling data  to graph a heat map in terms of correlation coefficient
data_s = preprocessing.scale(data)
data_s = pd.DataFrame(data)

#Correlation matrix and heatmap
data_s.corr()
fig, ax = plt.subplots()
fig.set_size_inches(14, 10)
ax=sns.heatmap(data_s.corr()) # showing correlation from the heatmap

In [None]:
# finding the suitable number of clusters from the data
wcss = [] #  squared distances of the samples
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    #kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(data) # Fitting data for clustering
    wcss.append(kmeans.inertia_)

plt.figure()
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Squared distances of the samples')
plt.show()

**Dimensionality Reduction using Principal Component Analysis**

In [None]:
pca=PCA(n_components=2) # just 2 components for simplicity

#Fit PCA to the dataset (only variables, excluding class)
pca.fit(data)

#Calculating rotated PCA scores
datatrans=pca.transform(data)
classes=sample

In [None]:
# K-means clustering for PC components
kmeans = KMeans(n_clusters = 7, init = 'k-means++', random_state = 42)
pred_kmeans = kmeans.fit_predict(datatrans)
#pred_kmeans = pred_kmeans+1

plt.figure()
sns.scatterplot(x=datatrans[:, 0], y=datatrans[:, 1], hue=pred_kmeans,  palette="Set2", s=80)
#plt.scatter(datatrans[:,0],datatrans[:,1],c=pred_kmeans)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

In [None]:
# Ground truth
plt.figure()
classes=pd.factorize(classes) # converting classes to numerics
sns.scatterplot(x=datatrans[:, 0], y=datatrans[:, 1], hue=classes[0], palette="Set2", s=80)
#plt.scatter(datatrans[:,0],datatrans[:,1],c=classes[0])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

** Agglomerative Clustering**

In [None]:
datatrans1=pd.DataFrame(datatrans) # To store the predictions along with the data.
datatrans1.head()
hierarchical_clustering = AgglomerativeClustering(n_clusters=7)
datatrans1["Hierarchical_Cluster"]= hierarchical_clustering.fit_predict(datatrans)

In [None]:

#plt.figure(figsize=(10, 5))
sns.scatterplot(x=datatrans[:, 0], y=datatrans[:, 1], hue=datatrans1["Hierarchical_Cluster"], palette="Set2", s=80)
plt.title("Hierarchical Clustering Visualization (PCA Reduced)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(title="Cluster")
plt.show()

In [None]:
linkage_matrix_pca = linkage(datatrans, method='ward')  # Creating a linkage matrix to create a dendrogram
plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix_pca, truncate_mode='level', p=5, leaf_rotation=90, leaf_font_size=10)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()

**Performance comparison**

In [None]:
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
# Compute Metrics
# The higher the better- Higher value means accurate location of points in a particular cluster
print("K-Means Silhouette Score:", silhouette_score(datatrans, pred_kmeans))
print("Agglomerative Silhouette Score:", silhouette_score(datatrans, datatrans1["Hierarchical_Cluster"].values))

# The lower the better- Average similarity between each cluster and its similar cluster
print("K-Means Davies-Bouldin Index:", davies_bouldin_score(datatrans, pred_kmeans))
print("Agglomerative Davies-Bouldin Index:", davies_bouldin_score(datatrans, datatrans1["Hierarchical_Cluster"].values))

# The higher the better-Measures the ratio of between-cluster variance to within-cluster variance.
print("K-Means Calinski-Harabasz Index:", calinski_harabasz_score(datatrans, pred_kmeans))
print("Agglomerative Calinski-Harabasz Index:", calinski_harabasz_score(datatrans, datatrans1["Hierarchical_Cluster"].values))