<a href="https://colab.research.google.com/github/luke2134/Assignment-2-K-Means-DBSCAN-Clustering/blob/main/Delete%20K-Means%20%26%20DBSCAN%20Clustering%20on%20the%20Olivetti%20Faces%20Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 2: K-Means & DBSCAN Clustering**

**Step 1: Retrieve and Load the Olivetti Faces Dataset**

In [1]:
from sklearn.datasets import fetch_olivetti_faces

# Load the Olivetti faces dataset
data = fetch_olivetti_faces(shuffle=True, random_state=42)
X = data.data
y = data.target

print("Data shape:", X.shape)
print("Number of unique individuals:", len(set(y)))


downloading Olivetti faces from https://ndownloader.figshare.com/files/5976027 to /root/scikit_learn_data
Data shape: (400, 4096)
Number of unique individuals: 40


**Step 2: Split the Dataset into Training, Validation, and Test Sets**

In [2]:
from sklearn.model_selection import train_test_split

# Split the data into 60% training, 20% validation, and 20% testing
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", X_test.shape)


Training set shape: (240, 4096)
Validation set shape: (80, 4096)
Test set shape: (80, 4096)


**Step 3: Train a Classifier Using k-Fold Cross-Validation**

In [3]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

# Initialize an SVM classifier
svm_clf = SVC(kernel='linear', random_state=42)

# Perform 5-fold cross-validation on the training set
cv_scores = cross_val_score(svm_clf, X_train, y_train, cv=5, scoring='accuracy')

print("Cross-validation accuracy scores:", cv_scores)
print("Mean CV accuracy:", cv_scores.mean())


Cross-validation accuracy scores: [0.875      0.875      0.97916667 0.9375     0.95833333]
Mean CV accuracy: 0.925


**Step 4: Use K-Means for Dimensionality Reduction**

In [61]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Evaluate silhouette score for different numbers of clusters
n_clusters_to_try = 10
silhouette_scores = []
for k in range(2, 21):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_train)
    score = silhouette_score(X_train, kmeans.labels_)
    silhouette_scores.append(score)
    print(f"Number of clusters: {k}, Silhouette score: {score}")

optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
print("Optimal number of clusters:", optimal_k)

kmeans = KMeans(n_clusters=optimal_k, random_state=42)
X_train_reduced = kmeans.fit_transform(X_train)
X_val_reduced = kmeans.transform(X_val)
X_test_reduced = kmeans.transform(X_test)


Number of clusters: 2, Silhouette score: 0.1586581915616989
Number of clusters: 3, Silhouette score: 0.1044413223862648
Number of clusters: 4, Silhouette score: 0.10736005753278732
Number of clusters: 5, Silhouette score: 0.10601257532835007
Number of clusters: 6, Silhouette score: 0.09856193512678146
Number of clusters: 7, Silhouette score: 0.08997952193021774
Number of clusters: 8, Silhouette score: 0.08044768869876862
Number of clusters: 9, Silhouette score: 0.0800122395157814
Number of clusters: 10, Silhouette score: 0.07784409821033478
Number of clusters: 11, Silhouette score: 0.07685498893260956
Number of clusters: 12, Silhouette score: 0.07422864437103271
Number of clusters: 13, Silhouette score: 0.08227722346782684
Number of clusters: 14, Silhouette score: 0.08302313834428787
Number of clusters: 15, Silhouette score: 0.08406933397054672
Number of clusters: 16, Silhouette score: 0.08382843434810638
Number of clusters: 17, Silhouette score: 0.08284001052379608
Number of clusters:

**Step 5: Train a Classifier Using the Dimensionality-Reduced Dataset**

In [62]:
# Train the classifier on the reduced training set
svm_clf.fit(X_train_reduced, y_train)

val_accuracy = svm_clf.score(X_val_reduced, y_val)
print("Validation accuracy on the reduced dataset:", val_accuracy)


Validation accuracy on the reduced dataset: 0.275


**Step 6: Apply DBSCAN for Clustering**

In [6]:
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN

pca = PCA(n_components=50, random_state=42)
X_pca = pca.fit_transform(X_train)

dbscan = DBSCAN(eps=0.5, min_samples=5, metric='euclidean', n_jobs=-1)
dbscan.fit(X_pca)

labels = dbscan.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Number of clusters found by DBSCAN: {n_clusters}")
print(f"Number of noise points found by DBSCAN: {n_noise}")


Number of clusters found by DBSCAN: 0
Number of noise points found by DBSCAN: 240
