<a href="https://colab.research.google.com/github/luke2134/Assignment-3-Hierarchical-Clustering-Olivetti-Faces/blob/main/olivetti_faces_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 3: Hierarchical Clustering**

**1. Retrieve and Load the Olivetti Faces Dataset**

In [13]:
from sklearn.datasets import fetch_olivetti_faces
# Fetch the Olivetti Faces dataset, shuffle the data for randomness, and set a random state for reproducibility
faces_data = fetch_olivetti_faces(shuffle=True, random_state=42)
# Assign the image data to 'X' and the target labels (person IDs) to 'y'
X, y = faces_data.data, faces_data.target

# The following code had been added after the presentation to confirming the dataset has been loaded successfully:
# Display a message confirming the dataset has been loaded successfully
print("Olivetti Faces dataset loaded successfully.")


Olivetti Faces dataset loaded successfully.


**2. Split the Dataset using Stratified Sampling**

In [14]:
from sklearn.model_selection import train_test_split
# Split the dataset into training (60%) and temporary sets (40%) using stratified sampling
# to ensure each person is represented equally across the splits
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
# Further split the temporary set into validation (20%) and test sets (20%),
# again using stratified sampling to ensure equal representation
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

# The following code had been added after the presentation to confirm the dataset split has been completed :
# Display a message confirming the dataset split has been completed
print("Dataset split into training, validation, and test sets.")


Dataset split into training, validation, and test sets.


**3. Train a Classifier using k-fold Cross-Validation**

In [3]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

# Initialize an SVC classifier with a linear kernel
clf = SVC(kernel='linear')

# Perform 5-fold cross-validation on the training set to evaluate the classifier's performance
cv_scores = cross_val_score(clf, X_train, y_train, cv=5)

# Print the average accuracy across the 5 folds to assess how well the classifier performs on the dataset
print("CV accuracy:", cv_scores.mean())


CV accuracy: 0.925


**4. Apply Hierarchical Clustering**

In [15]:
from sklearn.cluster import AgglomerativeClustering

# Apply Agglomerative Clustering with Euclidean distance and the Ward linkage method for clustering
# Euclidean Distance:
ahc_euclidean = AgglomerativeClustering(n_clusters=40, metric='euclidean', linkage='ward')
# Fit the model to the training data and obtain the cluster labels
clusters_euclidean = ahc_euclidean.fit_predict(X_train)

In [8]:
# Apply Agglomerative Clustering with Minkowski distance and the Complete linkage method for clustering
# For Minkowski Distance:
ahc_minkowski = AgglomerativeClustering(n_clusters=40, metric='minkowski', linkage='complete')
# Fit the model to the training data and obtain the cluster labels
clusters_minkowski = ahc_minkowski.fit_predict(X_train)


In [16]:
# Apply Agglomerative Clustering with Cosine similarity and the Average linkage method for clustering
# For Cosine Similarity:
ahc_cosine = AgglomerativeClustering(n_clusters=40, metric='cosine', linkage='average')
# Fit the model to the training data and obtain the cluster labels
clusters_cosine = ahc_cosine.fit_predict(X_train)

# Display a message confirming clustering has been completed with the three distance metrics
print("Hierarchical clustering applied using Euclidean, Minkowski, and Cosine metrics.")


Hierarchical clustering applied using Euclidean, Minkowski, and Cosine metrics.


**5. Use Silhouette Score to Choose Number of Clusters**

In [18]:
from sklearn.metrics import silhouette_score

# Calculate the silhouette score for the clusters obtained using Euclidean distance
score_euclidean = silhouette_score(X_train, clusters_euclidean, metric='euclidean')

# Calculate the silhouette score for the clusters obtained using Minkowski distance
score_minkowski = silhouette_score(X_train, clusters_minkowski, metric='minkowski')

# Calculate the silhouette score for the clusters obtained using Cosine similarity
score_cosine = silhouette_score(X_train, clusters_cosine, metric='cosine')

# Print the silhouette scores for each clustering approach to compare their performance
print("Silhouette Score - Euclidean:", score_euclidean)
print("Silhouette Score - Minkowski:", score_minkowski)
print("Silhouette Score - Cosine:", score_cosine)


Silhouette Score - Euclidean: 0.15845941
Silhouette Score - Minkowski: 0.1462153283677593
Silhouette Score - Cosine: 0.19488618


**6. Train a Classifier Using Reduced Dimensionality**

In [19]:
# Perform 5-fold cross-validation on the clustered data using the Euclidean clusters as the feature set
# Here we reshape the clusters_euclidean array because cross_val_score expects the input as 2D
cv_scores_cluster = cross_val_score(clf, clusters_euclidean.reshape(-1, 1), y_train, cv=5)

# Print the average accuracy across the 5 folds for the classifier trained on the reduced feature set (clusters)
print("CV accuracy after clustering:", cv_scores_cluster.mean())


CV accuracy after clustering: 0.3
