# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: Michael Le

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [4]:
import numpy as np
import pandas as pd

## 0. Function definitions (2 marks)

In [14]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(n_clusters, X, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Kmeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for Kmeans
        n_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''
    # TODO: Implement function body
    if (n_components > 0):
        pca_model = PCA(n_components=n_components)
        pca_model.fit(X)
        X = pca_model.transform(X)
        
    kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
    kmeans_model.fit(X)
    cluster_labels = kmeans_model.predict(X)
    mean_silhouette_coefficient = silhouette_score(X, cluster_labels)
    
    return mean_silhouette_coefficient

## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [7]:
# TODO: Import dataset
df = pd.read_csv("Chemical Composion of Ceramic.csv")

df.head()

Unnamed: 0,Ceramic Name,Part,Na2O,MgO,Al2O3,SiO2,K2O,CaO,TiO2,Fe2O3,MnO,CuO,ZnO,PbO2,Rb2O,SrO,Y2O3,ZrO2,P2O5
0,FLQ-1-b,Body,0.62,0.38,19.61,71.99,4.84,0.31,0.07,1.18,630,10,70,10,430,0,40,80,90
1,FLQ-2-b,Body,0.57,0.47,21.19,70.09,4.98,0.49,0.09,1.12,380,20,80,40,430,-10,40,100,110
2,FLQ-3-b,Body,0.49,0.19,18.6,74.7,3.47,0.43,0.06,1.07,420,20,50,50,380,40,40,80,200
3,FLQ-4-b,Body,0.89,0.3,18.01,74.19,4.01,0.27,0.09,1.23,460,20,70,60,380,10,40,70,210
4,FLQ-5-b,Body,0.03,0.36,18.41,73.99,4.33,0.65,0.05,1.19,380,40,90,40,360,10,30,80,150


Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [9]:
# TODO: Remove non-numeric columns
df = df.drop(columns=["Ceramic Name", "Part"])
df.head()

Unnamed: 0,Na2O,MgO,Al2O3,SiO2,K2O,CaO,TiO2,Fe2O3,MnO,CuO,ZnO,PbO2,Rb2O,SrO,Y2O3,ZrO2,P2O5
0,0.62,0.38,19.61,71.99,4.84,0.31,0.07,1.18,630,10,70,10,430,0,40,80,90
1,0.57,0.47,21.19,70.09,4.98,0.49,0.09,1.12,380,20,80,40,430,-10,40,100,110
2,0.49,0.19,18.6,74.7,3.47,0.43,0.06,1.07,420,20,50,50,380,40,40,80,200
3,0.89,0.3,18.01,74.19,4.01,0.27,0.09,1.23,460,20,70,60,380,10,40,70,210
4,0.03,0.36,18.41,73.99,4.33,0.65,0.05,1.19,380,40,90,40,360,10,30,80,150


## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [11]:
# TODO: Implement clustering with raw data using cluster_fn above
scores = []

for n_clusters in range (2, 7):
    mean_silhouette_score = cluster_fn(n_clusters=n_clusters, X=df)
    score_dict = {'Number of Clusters': n_clusters, 'Silhouette Score': '{:.3f}'.format(mean_silhouette_score)}
    scores.append(score_dict)
    
scores_df = pd.DataFrame(scores)
scores_df

Unnamed: 0,Number of Clusters,Silhouette Score
0,2,0.584
1,3,0.562
2,4,0.543
3,5,0.508
4,6,0.51


### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [21]:
# TODO: Implement clustering with PCA-transformed data using cluster_fn above
scores_with_components = []

for n_clusters in range (2, 7):
    for n_components in range (2,7):
        mean_silhouette_score = cluster_fn(n_clusters=n_clusters, X=df, n_components=n_components)
        score_dict = {'Number of Clusters': n_clusters, 'Number of Components': n_components, 
                      'Silhouette Score': '{:.3f}'.format(mean_silhouette_score)}
        scores_with_components.append(score_dict)
    
scores_with_components_df = pd.DataFrame(scores_with_components)
scores_with_components_df

### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [30]:
# TODO: Display results
print("2.1 Results: Silhouette Scores for KMeans Clustering of Data for Chemical Composition of Ceramic Samples")
display(scores_df)

print("\n")

print("2.2 Results: Silhouette Scores for KMeans Clustering of PCA-Transformed Data for Chemical Composition of Ceramic Samples")
display(scores_with_components_df.astype(float).pivot_table(index="Number of Clusters", columns="Number of Components", 
                                              values="Silhouette Score"))

2.1 Results: Silhouette Scores for KMeans Clustering of Data for Chemical Composition of Ceramic Samples


Unnamed: 0,Number of Clusters,Silhouette Score
0,2,0.584
1,3,0.562
2,4,0.543
3,5,0.508
4,6,0.51




2.2 Results: Silhouette Scores for KMeans Clustering of PCA-Transformed Data for Chemical Composition of Ceramic Samples


Number of Components,2.0,3.0,4.0,5.0,6.0
Number of Clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2.0,0.619,0.6,0.59,0.587,0.586
3.0,0.612,0.587,0.571,0.567,0.565
4.0,0.601,0.571,0.554,0.549,0.547
5.0,0.567,0.546,0.521,0.516,0.513
6.0,0.569,0.551,0.53,0.524,0.515


**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**

From the results table for question 2.2: The combinations of 2 clusters with 2 principal components produced the best result with a silhouette score of 0.619.

## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [71]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.
from sklearn.mixture import GaussianMixture
from sklearn.cluster import SpectralClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn_2(n_gm_components, X, n_pca_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Gaussian Mixture clustering (random_state=0)
        
        n_gm_components (int): number of clusters to use for GaussianMixture
        n_pca_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''
    # TODO: Implement function body
    if (n_pca_components > 0):
        pca_model = PCA(n_components=n_pca_components)
        pca_model.fit(X)
        X = pca_model.transform(X)
        
    gm_model = GaussianMixture(n_components=n_gm_components, random_state=0)
    gm_model.fit(X)
    cluster_labels = gm_model.predict(X)
    mean_silhouette_coefficient = silhouette_score(X, cluster_labels)
    
    return mean_silhouette_coefficient

In [72]:
scores = []

for n_gm_components in range (2, 7):
    mean_silhouette_score = cluster_fn_2(n_gm_components=n_gm_components, X=df)
    score_dict = {'Number of GM Components': n_gm_components, 'Silhouette Score': '{:.3f}'.format(mean_silhouette_score)}
    scores.append(score_dict)
    
scores_df = pd.DataFrame(scores)
scores_df

Unnamed: 0,Number of GM Components,Silhouette Score
0,2,0.56
1,3,0.563
2,4,0.553
3,5,0.504
4,6,0.51


In [74]:
scores_with_pca_components = []

for n_gm_components in range (2, 7):
    for n_pca_components in range (2, 7):
        mean_silhouette_score = cluster_fn_2(n_gm_components=n_gm_components, X=df, n_pca_components=n_pca_components)
        score_dict = {'Number of GM Components': n_gm_components, 'Number of PCA Components': n_pca_components, 
                      'Silhouette Score': '{:.3f}'.format(mean_silhouette_score)}
        scores_with_pca_components.append(score_dict)
    
scores_with_pca_components_df = pd.DataFrame(scores_with_pca_components)
scores_with_pca_components_df

Unnamed: 0,Number of GM Components,Number of PCA Components,Silhouette Score
0,2,2,0.551
1,2,3,0.534
2,2,4,0.522
3,2,5,0.519
4,2,6,0.516
5,3,2,0.583
6,3,3,0.549
7,3,4,0.54
8,3,5,0.518
9,3,6,0.457


In [55]:
# TODO: Display results
print("3.1 Results: Silhouette Scores for Gaussian Clustering of Data for Chemical Composition of Ceramic Samples")
display(scores_df)

print("\n")

print("3.2 Results: Silhouette Scores for Gaussian Clustering of PCA-Transformed Data for Chemical Composition of Ceramic Samples")
display(scores_with_pca_components_df.astype(float).pivot_table(index="Number of GM Components", columns="Number of PCA Components", 
                                              values="Silhouette Score"))

3.1 Results: Silhouette Scores for Gaussian Clustering of Data for Chemical Composition of Ceramic Samples


Unnamed: 0,Number of GM Components,Silhouette Score
0,2,0.56
1,3,0.563
2,4,0.553
3,5,0.504
4,6,0.51




3.2 Results: Silhouette Scores for Gaussian Clustering of PCA-Transformed Data for Chemical Composition of Ceramic Samples


Number of PCA Components,2.0,3.0,4.0,5.0,6.0
Number of GM Components,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2.0,0.551,0.534,0.522,0.519,0.516
3.0,0.583,0.549,0.54,0.518,0.457
4.0,0.558,0.547,0.52,0.535,0.533
5.0,0.508,0.521,0.49,0.492,0.507
6.0,0.569,0.514,0.515,0.509,0.513


**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

I selected the Gaussian Mixture Method because it builds upon the ideas of K-means. A weakness of K-means is that is has no measure of probability and uses an overly simplistic distance (from cluster-to-cluster) to assign cluster membership. While k-means finds suitable clustering results for simple well-separated data with circular clusters, it has no built-in way of accounting for oblong or elliptical clusters. This results in a mixing of cluster assignments where the circular clusters overlap. The Gaussian Mixture models address this weakness by comparing the distances of each point to all cluster centers, rather than just the closest. It also imagines the cluster boundaries as ellipses rather than circles, in order to account for non-circular clusters.

The combination of 3 components for the Gaussian Mixture model and 2 PCA components resulted in the best silhouette score of 0.583. However, this did not improve on the silhouette score of 0.619 (was 0.036 less) when there were 2 clusters for K-means and 2 PCA components. This is likely because the Ceramic data is more well-fitted towards the circular clusters that K-means provides.
