# AOML Assignment 2
# Team Reshiram_119_909_912_920

### Prarthana Prakash Kini : PES1UG22AM119
### Tejas V Bhat : PES1UG22AM909
### Ayush Muralidharan : PES1UG22AM912
### Atharv Revankar : PES1UG22AM920

In [None]:
pip install umap-learn

#### Importing necessary python modules and methods

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer
from sklearn.metrics.pairwise import rbf_kernel
import umap
from sklearn.mixture import BayesianGaussianMixture
import seaborn as sns

## Data Loading and analysis

In [None]:
data = pd.read_csv("/kaggle/input/aoml-assignment-2-clustering/data.csv")

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.isnull().sum()

##### StandardScaler is applied to standardize the features, ensuring they have a mean of 0 and a standard deviation of 1



In [None]:
unique_ids = data['id']
features = data


scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

In [None]:
features

#### As we have 5 features we must perform dimensionality reduction for effective clustering and visualisation.
#### We have experimented with 3 dimensionality reduction techniques, both linear and non linear:

### 1. Dimensionality Reduction with PCA
A PCA model is created to reduce the feature set to 2 principal components, which capture the maximum variance in a 2-dimensional space.
- n_components=2: Specifies 2 principal components.
- tol=0.01: Sets the tolerance level for convergence.
- random_state=5: Ensures reproducibility of results

In [None]:
pca = PCA(n_components=2, 
          tol=0.01, 
          random_state=5
)
pca_features = pca.fit_transform(scaled_features)

### 2. Dimensionality Reduction with t-SNE

- TSNE is configured with n_components=2 to reduce the data to 2 dimensions for visualization.
- perplexity=100 controls the balance between local and global data structure.

In [None]:
tsne = TSNE(n_components=2,perplexity = 300)

In [None]:
tsne_features = tsne.fit_transform(scaled_features)

### 3. Dimensionality Reduction with UMAP

- n_components=2: Reduces the data to 2 dimensions for visualization and clustering.
- min_dist=0.01: Controls how closely points are packed together. A smaller value allows points to be closer, creating tighter clusters.
- n_epochs=500: Sets the number of training epochs, where a higher value may improve embedding stability.
- random_state=42: Ensures reproducibility of the results by using a fixed seed.

In [None]:
umap_reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1,learning_rate=0.01, random_state=42)
umap_features = umap_reducer.fit_transform(scaled_features)

#### We have chosen TSNE as we have found it to give the most optimum performace for clustering.

## Spectral Clustering with RBF Kernel

Spectral clustering is a technique that uses the eigenvalues of a similarity matrix (in this case, the RBF kernel) to perform dimensionality reduction before applying a clustering algorithm, such as KMeans. It is particularly useful for non-linearly separable clusters.

 Compute RBF Kernel:
   - gamma_value: The gamma parameter controls the influence of each point in the RBF kernel. 
   - rbf_kernel: Computes the RBF (Radial Basis Function) kernel matrix on the UMAP-reduced features (umap_features), capturing pairwise similarities.

 Apply Spectral Clustering:
   - n_clusters: Specifies the desired number of clusters.
   - affinity: Uses the precomputed RBF kernel matrix as the similarity measure for clustering.
   - assign_labels: Labels are assigned to clusters using the KMeans algorithm.
   - random_state: Ensures reproducibility by setting a fixed seed.



In [None]:
gamma_value = 1/2 
affinity_matrix = rbf_kernel(tsne_features, gamma=gamma_value)

spectral_clustering = SpectralClustering(
    n_clusters=6,  
    affinity='precomputed',  
    assign_labels='kmeans', 
    random_state=42
)
spectral_labels = spectral_clustering.fit_predict(affinity_matrix)


## Evaluate Clustering Quality:
   - Silhouette Score: Measures how similar each point is to its own cluster compared to other clusters. Higher values indicate better clustering.
   - Davies-Bouldin Score: A lower value indicates better clustering. This metric evaluates both intra-cluster similarity and inter-cluster separation.



In [None]:
silhouette_spectral = silhouette_score(tsne_features, spectral_labels)
davies_bouldin_spectral = davies_bouldin_score(tsne_features, spectral_labels)
print(f"Spectral Silhouette Score: {silhouette_spectral}")
print(f"Spectral Davies-Bouldin Score: {davies_bouldin_spectral}")

In [None]:
sns.scatterplot(x=tsne_features[:, 0], y=tsne_features[:, 1], hue=spectral_labels, palette='Set2')
plt.title("TSNE Spectral Cluster Visualization")
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.show()

## K-Means Clustering

K-Means is a popular clustering algorithm that partitions the dataset into a specified number of clusters by continuously recalculating cluster centroids and grouping the points closest to each centroid into clusters.

 *Initialize K-Means*:
   - *n_clusters*: Specifies the number of clusters to form, set to 6 as per the requirement.
   - *init*: Uses the K-Means++ method for initialization, which improves convergence by choosing initial centroids that are distant from each other.
   - *n_init*: The algorithm runs 10 times with different centroid seeds, and the best output in terms of inertia is selected.
   - *max_iter*: Sets the maximum number of iterations for a single run, ensuring sufficient iterations for convergence.
   - *random_state*: Ensures reproducibility by setting a fixed seed.

 *Apply K-Means*:
   - *fit_predict*: Computes K-Means clustering on the UMAP-reduced features (umap_features) and assigns cluster labels (kmeans_labels) to each point.

In [None]:
kmeans = KMeans(n_clusters=6, 
                init='k-means++',
                n_init=10, 
                max_iter=500, 
                random_state=42
)

kmeans_labels = kmeans.fit_predict(tsne_features)

In [None]:
silhouette_kmeans = silhouette_score(tsne_features, kmeans_labels)
davies_bouldin_kmeans = davies_bouldin_score(tsne_features, kmeans_labels)
print(f"K-Means Silhouette Score: {silhouette_kmeans}")
print(f"K-Means Davies-Bouldin Score: {davies_bouldin_kmeans}")

In [None]:
sns.scatterplot(x=tsne_features[:, 0], y=tsne_features[:, 1], hue=kmeans_labels, palette='Set2')
plt.title("TSNE Spectral Cluster Visualization")
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.show()

## Bayesian Gaussian Mixture (BGM) Clustering

Bayesian Gaussian Mixture (BGM) is a probabilistic clustering method that extends the Gaussian Mixture Model (GMM) by introducing Bayesian regularization. This regularization allows BGM to automatically determine the optimal number of clusters, making it robust for datasets with varying complexity.

Initialize Bayesian Gaussian Mixture:
   - n_components : Specifies the maximum number of clusters to be detected.
   - covariance_type : Uses a full covariance matrix for each cluster, allowing it to model clusters of any shape and orientation.

In [None]:
bgm = BayesianGaussianMixture(
    n_components=6,                     
    covariance_type='full',               
)
bgm_labels = bgm.fit_predict(tsne_features)

silhouette_bgm = silhouette_score(tsne_features, bgm_labels)
davies_bouldin_bgm = davies_bouldin_score(tsne_features, bgm_labels)
print(f"BayesianGaussianMixture Silhouette Score: {silhouette_bgm}")
print(f"K-Means Davies-Bouldin Score: {davies_bouldin_bgm}")


In [None]:
sns.scatterplot(x=tsne_features[:, 0], y=tsne_features[:, 1], hue=bgm_labels, palette='Set2')
plt.title("TSNE BGM Cluster Visualization")
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.show()

### Gaussian Mixture Model (GMM) Clustering

Gaussian Mixture Model (GMM) is a probabilistic clustering algorithm that assumes data points are generated from a mixture of several Gaussian distributions. It models the data as a combination of multiple Gaussian components, making it effective for capturing complex cluster shapes.

1. Initialize Gaussian Mixture Model:
   - n_components=6: Specifies the number of Gaussian components (clusters) to fit in the data, set to 6.
   - covariance_type='full': Uses a full covariance matrix for each component, allowing for clusters with arbitrary shapes.
   - random_state=42: Ensures reproducibility by setting a fixed seed.

In [None]:
gmm = GaussianMixture(n_components=6, covariance_type='full', random_state=42)
gmm.means_init = kmeans.cluster_centers_
gmm.fit(tsne_features)
gmm_labels = gmm.predict(tsne_features)

In [None]:
silhouette_gmm = silhouette_score(tsne_features, gmm_labels)
davies_bouldin_gmm = davies_bouldin_score(tsne_features, gmm_labels)
print(f"GMM Silhouette Score: {silhouette_gmm}")
print(f"GMM Davies-Bouldin Score: {davies_bouldin_gmm}")

In [None]:
sns.scatterplot(x=tsne_features[:, 0], y=tsne_features[:, 1], hue=gmm_labels, palette='Set2')
plt.title("TSNE GMM Cluster Visualization")
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.show()

We are considering GaussianMixtureModel as our primary clustering model as it gives the highest **Silhouette score** and lowest **Davies-Bouldin** Score

In [None]:
submission = pd.DataFrame({
    'id': unique_ids,
    'Cluster': gmm_labels  
})
submission.to_csv("submission.csv", index=False)
print("Submission file created successfully.")