<a href="https://colab.research.google.com/github/master1223347/Assorted-ML-Projects/blob/main/Notebook6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Update on November 25th, 2025*

Due to the OpenML servers undergoing maintenance during the Thanksgiving weekend, the MNIST dataset will need to be downloaded via the Keras library for the time being. Both options are shown in the data-loading section below.

## Instructions

For this assignment, you will use the MNIST dataset to apply clustering methods to discover patterns in data, compare their performance using Silhouette scores, and investigate how dimensionality reduction affects clustering results.

**Do not delete any instructor-provided cells from this Notebook.** If you accidentally delete a cell, you can either undo the action or load a copy of the original assignment Notebook in a new browser tab and copy over the missing cells.

**You can add cells to this Notebook.** To add a markdown (text) cell, hover your cursor beneath the cell where you want to insert and click the "+Text" button. To add a Python (code) cell, click the "+Code" button.

### Steps
- Prepare the data
  - Truncate (see below)
  - Perform scaling
- Build clustering models
  - First without dimensionality reduction
  - Second with Principal Component Analysis
- Evaluate the results
  - Using the Adjusted Rand Index (see below)
  - Using Silhouette scores
  - Visualize pairs of principal components using the provided plotting code
- Report your findings

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import cluster
from sklearn import datasets
from sklearn import decomposition
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing

In [None]:
seed = 42

## Revisiting the MNIST Dataset

The MNIST dataset contains 70,000 grayscale images of handwritten digits (0-9). Each image is 28x28 pixels, resulting in 784 features. For computational efficiency, we will use a subset of the data.

### Load the Data

#### Downloading with the Keras Interface

In [None]:
from tensorflow import keras

(X_train_keras, y_train_keras), (X_test_keras, y_test_keras) = keras.datasets.mnist.load_data()

X = np.concatenate((X_train_keras, X_test_keras), axis=0)
y = np.concatenate((y_train_keras, y_test_keras), axis=0)

X = X.reshape(X.shape[0], -1)

X.shape, y.shape

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
[1m11490434/11490434[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


((70000, 784), (70000,))

#### Downloading with the scikit-learn Interface (via OpenML)

In [None]:
bunch_obj = datasets.fetch_openml("mnist_784", as_frame=False)
X, y = bunch_obj.data, bunch_obj.target
X.shape, y.shape

((70000, 784), (70000,))

### Truncate the Data

For expediency, you do not need to use the entire dataset. You can increase the sample size (number of rows) below, but do not decrease it.

In [None]:
sample_size = 5000

In [None]:
X_trunc, y_trunc = X[:sample_size], y[:sample_size]
X_trunc.shape, y_trunc.shape

((5000, 784), (5000,))

### Preprocessing

Scale the data for optimal performance.

In [None]:
scaler = preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X_trunc)

### Clustering with Original Features

Train each of the 3 models:
- K-Means Clustering
- DBSCAN
- Hierarchical Agglomerative Clustering

For models that require specifying a number of clusters, recall that there are 10 classes in this dataset.

#### Adjusted Rand Index

As an additional metric to evaluate clusters, code to compute the Adjusted Rand Index has been provided. This metric compares the true labels (which we have for MNIST) against the clusters created by each algorithm. Values of this metric range from -0.5 to 1.0, where 1.0 represents an optimal outcome.

#### K-Means

In [None]:
kmeans = cluster.KMeans(n_clusters=10, random_state=seed, n_init='auto')
kmeans.fit(X_scaled)
labels_kmeans = kmeans.labels_

In [None]:
metrics.adjusted_rand_score(y_trunc, labels_kmeans)

0.28549263781143597

#### DBSCAN

In [None]:
dbscan = cluster.DBSCAN(eps=0.5, min_samples=5) # Parameters may need tuning
labels_dbscan = dbscan.fit_predict(X_scaled)

In [None]:
metrics.adjusted_rand_score(y_trunc, labels_dbscan)

0.0

#### Agglomerative

In [None]:
agglomerative = cluster.AgglomerativeClustering(n_clusters=10)
labels_agglom = agglomerative.fit_predict(X_scaled)

In [None]:
metrics.adjusted_rand_score(y_trunc, labels_agglom)

0.40206863188272335

### Compare Silhouette Scores

In [None]:
print(metrics.silhouette_score(X_scaled, labels_kmeans))

if len(set(labels_dbscan)) > 1:
    non_noise = labels_dbscan != -1
    if non_noise.any() and len(set(labels_dbscan[non_noise])) > 1:
        print(metrics.silhouette_score(X_scaled[non_noise], labels_dbscan[non_noise]))
    else:
        print("DBSCAN Score: N/A")
else:
    print("DBSCAN Score: N/A")

print(metrics.silhouette_score(X_scaled, labels_agglom))

0.017585351836207713
DBSCAN Score: N/A
-0.02786196628582733


## Clustering with Principal Component Analysis

Principal Component Analysis reduces the dimensionality of the data while preserving as much variance as possible.

For this exercise, rather than specifying a specific number of principal components, instead use `n_components` to retain 95% of the data's variance.

In [None]:
pca = decomposition.PCA(n_components=0.95, random_state=seed)
X_pca = pca.fit_transform(X_scaled)
X_pca.shape

(5000, 264)

### Train Models

#### K-Means

In [None]:
kmeans_pca = cluster.KMeans(n_clusters=10, random_state=seed, n_init='auto')
kmeans_pca.fit(X_pca)
labels_kmeans_pca = kmeans_pca.labels_

In [None]:
metrics.adjusted_rand_score(y_trunc, labels_kmeans_pca)

0.2543995343611312

#### DBSCAN

In [None]:
dbscan_pca = cluster.DBSCAN(eps=0.5, min_samples=5)
labels_dbscan_pca = dbscan_pca.fit_predict(X_pca)

In [None]:
metrics.adjusted_rand_score(y_trunc, labels_dbscan_pca)

0.0

#### Agglomerative

In [None]:
agglomerative_pca = cluster.AgglomerativeClustering(n_clusters=10)
labels_agglom_pca = agglomerative_pca.fit_predict(X_pca)

In [None]:
metrics.adjusted_rand_score(y_trunc, labels_agglom_pca)

0.4221818282201933

### Compare Silhouette Scores

## Visualize Pairs of Principal Components

Try a few different pairs of principal components and see if you can find a pair that appears to have captured the classes well.

In [None]:
# Change these variables to visualize different principal component pairs.
pc_x = 0
pc_y = 1

In [None]:
label_map = {
    "K-Means": labels_kmeans_pca,
    "DBSCAN": labels_dbscan_pca,
    "Agglomerative": labels_agglom_pca
}

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, (title, labels) in enumerate(label_map.items()):
    axes[i].scatter(
        X_pca[:, pc_x],
        X_pca[:, pc_y],
        c=labels,
        cmap="viridis",
        s=50,
        alpha=0.7
    )
    axes[i].set_title(f"{title} Clustering (PCA)")
    axes[i].set_xlabel(f"PC {pc_x}")
    axes[i].set_ylabel(f"PC {pc_y}")

## Final Reflection

- Which clustering algorithm appears to have performed the best?
  - Is there any tuning you tried that might have influenced this outcome?
- Based on the Adjusted Rand Index and Silhouette scores, did PCA appear to help here?
  - Does trying different numbers of principal components or required variance change the outcome?

# Task
```python
scaler = preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X_trunc)
```

## scale_data

### Subtask:
Scale the truncated data (X_trunc) using StandardScaler and store the result in X_scaled.


## Summary:

### Data Analysis Key Findings
*   The truncated data, `X_trunc`, has been successfully scaled using `StandardScaler`, and the resulting scaled data is stored in the variable `X_scaled`.

### Insights or Next Steps
*   Scaling the data is a crucial preprocessing step, particularly for machine learning algorithms that are sensitive to the magnitude and distribution of features, ensuring better model performance and convergence.
