# Preprocessing and Training any of the 4 models in the Homework

<h1> 1. Clustering Model - KMeans </h1>

In [57]:
import sys
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.model_selection import train_test_split
import streamlit as st

### Get the absolute path of the current file

In [58]:
# Get the absolute path of the current file
current_file_path = Path('./cluster_k_means.ipynb').resolve()

# Get the directory of the current file
project_dir = current_file_path.parent

# Add the project directory to sys.path
sys.path.insert(0, str(project_dir))
from data.input_data_1 import DatasetCreator

### Step 1: Create Datasets

In [59]:
dataset_creator = DatasetCreator()
blob_dataset = dataset_creator.create_blob_dataset()
points_dataset = dataset_creator.create_points_dataset()
X_blob, y_blob = blob_dataset['X'], blob_dataset['y']
X_points = points_dataset['X']

### Preprocesing: No need to split into train and test as clustering doesn't require labeled data

### Step 2: Train the model using the Blob dataset

### (a): Create and Fit KMeans and MiniBatchKMeans models for make_blob dataset

In [60]:
kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
kmeans= kmeans.fit(X_blob,y_blob)


In [61]:
minibatch_kmeans = MiniBatchKMeans(n_clusters=3, random_state=42, batch_size=100,n_init='auto')
minibatch_kmeans = minibatch_kmeans.fit(X_blob)

### (b) Make predictions with blob dataset

In [62]:
kmeans_labels = kmeans.predict(X_blob)

minibatch_kmeans_labels = minibatch_kmeans.predict(X_blob)

### Step 3: Train the model using (Points Dataset)

### (a): Create and Fit KMeans and MiniBatchKMeans models on points dataset

In [63]:
kmeans_points = KMeans(n_clusters=3, random_state=42,n_init='auto')
kmeans_points = kmeans_points.fit(X_points)

minibatch_kmeans_points = MiniBatchKMeans(n_clusters=3, random_state=42, batch_size=100,n_init='auto')
minibatch_kmeans_points = minibatch_kmeans_points.fit(X_points)

### (b) Make predictions using the points data set

In [64]:
kmeans_points_labels = kmeans_points.predict(X_points)

minibatch_kmeans_points_labels = minibatch_kmeans_points.predict(X_points)

### To create several alternative models by changing parameters like the number of clusters and distance metrics, and to visualize the cluster centers, you can extend the starter code. Below, I've added additional models with varying parameters and included the visualization of cluster centers.


### Step 1: Create and Fit KMeans and MiniBatchKMeans models with different parameters

In [65]:

models = {
    'KMeans_3_clusters': KMeans(n_clusters=3, random_state=42,n_init='auto'),
    'KMeans_4_clusters': KMeans(n_clusters=4, random_state=42,n_init='auto'),
    'KMeans_5_clusters': KMeans(n_clusters=5, random_state=42,n_init='auto'),
    'MiniBatchKMeans_3_clusters': MiniBatchKMeans(n_clusters=3, random_state=42,n_init='auto', batch_size=100),
    'MiniBatchKMeans_4_clusters': MiniBatchKMeans(n_clusters=4, random_state=42,n_init='auto', batch_size=100),
    'MiniBatchKMeans_5_clusters': MiniBatchKMeans(n_clusters=5, random_state=42,n_init='auto', batch_size=100),
}




### Fit models on the blob dataset

In [66]:
labels_blob = {}
centers_blob = {}
for name, model in models.items():
    model.fit(X_blob)
    labels_blob[name] = model.predict(X_blob)
    centers_blob[name] = model.cluster_centers_

### Fit models on the points dataset


In [67]:
labels_points = {}
centers_points = {}
for name, model in models.items():
    model.fit(X_points)
    labels_points[name] = model.predict(X_points)
    centers_points[name] = model.cluster_centers_

### To evaluate the clustering models using quality metrics like Adjusted Rand Index, Calinski-Harabasz Index, and Davies-Bouldin Index. The following code includes these metrics for each model and dataset:

In [68]:
from sklearn.metrics import adjusted_rand_score, calinski_harabasz_score, davies_bouldin_score

### Step 1: Evaluate the models

In [69]:

def evaluate_clusters(X, labels, true_labels=None):
    scores = {}
    if true_labels is not None:
        scores['Adjusted Rand Index'] = adjusted_rand_score(true_labels, labels)
    scores['Calinski-Harabasz Index'] = calinski_harabasz_score(X, labels)
    scores['Davies-Bouldin Index'] = davies_bouldin_score(X, labels)
    return scores

### Evaluate models on the blob dataset

In [70]:

print("Blob Dataset Evaluation:")
for name, labels in labels_blob.items():
    scores = evaluate_clusters(X_blob, labels, y_blob)
    print(f"{name}: {scores}")

Blob Dataset Evaluation:
KMeans_3_clusters: {'Adjusted Rand Index': 0.02305205607874582, 'Calinski-Harabasz Index': 665.3162286236743, 'Davies-Bouldin Index': 0.964932679999536}
KMeans_4_clusters: {'Adjusted Rand Index': 0.003072481868613137, 'Calinski-Harabasz Index': 687.2696369798575, 'Davies-Bouldin Index': 0.9291562581565466}
KMeans_5_clusters: {'Adjusted Rand Index': 0.015456803812327361, 'Calinski-Harabasz Index': 669.4779116506213, 'Davies-Bouldin Index': 0.9039265960732601}
MiniBatchKMeans_3_clusters: {'Adjusted Rand Index': 0.033945225501128926, 'Calinski-Harabasz Index': 656.6051685971224, 'Davies-Bouldin Index': 0.9530500320732589}
MiniBatchKMeans_4_clusters: {'Adjusted Rand Index': 0.009373541444086213, 'Calinski-Harabasz Index': 680.4089703562004, 'Davies-Bouldin Index': 0.9200916465369746}
MiniBatchKMeans_5_clusters: {'Adjusted Rand Index': 0.018140739563712465, 'Calinski-Harabasz Index': 657.0336739996833, 'Davies-Bouldin Index': 0.9303666193689694}


### We will look for models with: Higher Adjusted Rand Index (ARI) , Higher Calinski-Harabasz Index (CHI), Lower Davies-Bouldin Index (DBI)

### Analysis: 
#### Adjusted Rand Index (ARI): 
#### Highest ARI: MiniBatchKMeans_3_clusters (0.0339)

#### Calinski-Harabasz Index (CHI):
#### Highest CHI: KMeans_4_clusters (687.27)

#### Davies-Bouldin Index (DBI):
#### Lowest DBI: KMeans_5_clusters (0.9039)

### Based on these metrics, KMeans with 4 clusters appears to be a reasonable choice for this dataset, as it achieves a good balance between cluster separation and similarity.

### Evaluate models on the points dataset

In [71]:

print("\nPoints Dataset Evaluation:")
for name, labels in labels_points.items():
    scores = evaluate_clusters(X_points, labels)
    print(f"{name}: {scores}")


Points Dataset Evaluation:
KMeans_3_clusters: {'Calinski-Harabasz Index': 279877.00629097293, 'Davies-Bouldin Index': 0.7209220394693859}
KMeans_4_clusters: {'Calinski-Harabasz Index': 351468.23896293604, 'Davies-Bouldin Index': 0.5366004839496465}
KMeans_5_clusters: {'Calinski-Harabasz Index': 375168.4520598905, 'Davies-Bouldin Index': 0.6435355835548625}
MiniBatchKMeans_3_clusters: {'Calinski-Harabasz Index': 252601.97984683796, 'Davies-Bouldin Index': 0.746665000249288}
MiniBatchKMeans_4_clusters: {'Calinski-Harabasz Index': 350114.5690228873, 'Davies-Bouldin Index': 0.5280279036533555}
MiniBatchKMeans_5_clusters: {'Calinski-Harabasz Index': 459362.7263493073, 'Davies-Bouldin Index': 0.4861995167455294}


### MiniBatchKMeans with 5 clusters appears to be the best-performing model based on these metrics. It has the highest Calinski-Harabasz Index and the lowest Davies-Bouldin Index among all configurations listed.