# Lab 3 - Part 2: PCA and Clustering (12 marks)
### Due Date: Monday, March 13 at 12pm

Author: *Hannah D'Souza*

The purpose of this portion of the assignment is to practice using PCA and clustering techniques on a given dataset

In [55]:
import numpy as np
import pandas as pd

## 0. Function definitions (2 marks)

In [56]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

def cluster_fn(n_clusters, X, n_components=0):
    '''Calculate silhouette score for a given dataset, number of clusters, 
       and number of principle components using Kmeans clustering (random_state=0)
        
        n_clusters (int): number of clusters to use for Kmeans
        n_components (int): number of principle components (optional)
        X (numpy.array or pandas.DataFrame): unlabelled dataset
        
        returns: silhouette score
    
    '''
    # TODO: Implement function body
    
    #Check if the components are greater than 0, if it is then apply PCA
    if n_components > 0:
        pca = PCA(n_components=n_components)
        X = pca.fit_transform(X)
        
    kmeans = KMeans(n_clusters, random_state=50)
    labels = kmeans.fit_predict(X)
    
    s_score = silhouette_score(X,labels, random_state =50)
    
    return s_score
    
    
    
                              
    
    

## 1. Load data (2 marks)

For this assignment, we will use the dataset found below:

https://archive.ics.uci.edu/ml/datasets/Chemical+Composition+of+Ceramic+Samples

In [57]:
# TODO: Import dataset

import pandas as pd

df = pd.read_csv('Chemical Composion of Ceramic.csv')



Two of the columns are non-numeric. For this assignment, we will remove those two columns and focus on clustering the ceramic samples based on the numerical measurements

In [58]:
# TODO: Remove non-numeric columns

import pandas as pd

df = df.select_dtypes(include='number')

## 2. Implement clustering (8 marks)

### 2.1 Cluster using raw data (1 mark)

Implement Kmeans clustering using the raw data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters

In [69]:
# TODO: Implement clustering with raw data using cluster_fn above

results = pd.DataFrame(columns=['Number of Clusters', 'Silhouette score'])

for n_clusters in range(2,7):
    s_score = cluster_fn(n_clusters, df, n_components=0)
    print(f"Number of clusters: {n_clusters} | Silhouette score: {s_score}")
    results = results.append({'Number of Clusters': n_clusters,'Silhouette score': s_score}, ignore_index=True)
    
    
    

Number of clusters: 2 | Silhouette score: 0.5840130686182087
Number of clusters: 3 | Silhouette score: 0.5616399840165864
Number of clusters: 4 | Silhouette score: 0.543410860697891


  results = results.append({'Number of Clusters': n_clusters,'Silhouette score': s_score}, ignore_index=True)
  results = results.append({'Number of Clusters': n_clusters,'Silhouette score': s_score}, ignore_index=True)
  results = results.append({'Number of Clusters': n_clusters,'Silhouette score': s_score}, ignore_index=True)


Number of clusters: 5 | Silhouette score: 0.5080642704990367
Number of clusters: 6 | Silhouette score: 0.48969498223151237


  results = results.append({'Number of Clusters': n_clusters,'Silhouette score': s_score}, ignore_index=True)
  results = results.append({'Number of Clusters': n_clusters,'Silhouette score': s_score}, ignore_index=True)


### 2.2 Cluster using PCA-transformed data (2 marks)

Implement Kmeans clustering using the PCA-transformed data. Compare the silhouette scores using 2, 3, 4, 5 and 6 clusters and 2, 3, 4, 5 and 6 principle components 

In [73]:
# TODO: Implement clustering with PCA-transformed data using cluster_fn above
results_1 = pd.DataFrame(columns=['Number of Clusters', 'Silhouette score', 'Principle component'])
for n_clusters in range(2, 7):
    for n_components in range(2, 7):
        principle_score  = cluster_fn(n_clusters, df, n_components)
        print(f"Number of clusters: {n_clusters} | Silhouette score: {s_score} | Principle components: {n_components}")
        results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)

Number of clusters: 2 | Silhouette score: 0.48969498223151237 | Principle components: 2
Number of clusters: 2 | Silhouette score: 0.48969498223151237 | Principle components: 3
Number of clusters: 2 | Silhouette score: 0.48969498223151237 | Principle components: 4
Number of clusters: 2 | Silhouette score: 0.48969498223151237 | Principle components: 5


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 2 | Silhouette score: 0.48969498223151237 | Principle components: 6
Number of clusters: 3 | Silhouette score: 0.48969498223151237 | Principle components: 2
Number of clusters: 3 | Silhouette score: 0.48969498223151237 | Principle components: 3


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 3 | Silhouette score: 0.48969498223151237 | Principle components: 4
Number of clusters: 3 | Silhouette score: 0.48969498223151237 | Principle components: 5
Number of clusters: 3 | Silhouette score: 0.48969498223151237 | Principle components: 6
Number of clusters: 4 | Silhouette score: 0.48969498223151237 | Principle components: 2


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 4 | Silhouette score: 0.48969498223151237 | Principle components: 3
Number of clusters: 4 | Silhouette score: 0.48969498223151237 | Principle components: 4
Number of clusters: 4 | Silhouette score: 0.48969498223151237 | Principle components: 5


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 4 | Silhouette score: 0.48969498223151237 | Principle components: 6
Number of clusters: 5 | Silhouette score: 0.48969498223151237 | Principle components: 2
Number of clusters: 5 | Silhouette score: 0.48969498223151237 | Principle components: 3


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 5 | Silhouette score: 0.48969498223151237 | Principle components: 4
Number of clusters: 5 | Silhouette score: 0.48969498223151237 | Principle components: 5
Number of clusters: 5 | Silhouette score: 0.48969498223151237 | Principle components: 6


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 6 | Silhouette score: 0.48969498223151237 | Principle components: 2
Number of clusters: 6 | Silhouette score: 0.48969498223151237 | Principle components: 3
Number of clusters: 6 | Silhouette score: 0.48969498223151237 | Principle components: 4


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


Number of clusters: 6 | Silhouette score: 0.48969498223151237 | Principle components: 5
Number of clusters: 6 | Silhouette score: 0.48969498223151237 | Principle components: 6


  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)
  results_1 = results_1.append({'Number of Clusters': n_clusters,'Silhouette score': principle_score,'Principle component': n_components}, ignore_index=True)


### 2.3 Display results (2 marks)

Print the results for 2.1 and 2.2 in a table. Include column and row labels

In [74]:
# TODO: Display results

print("Results for 2.1")
results




Results for 2.1


Unnamed: 0,Number of Clusters,Silhouette score
0,2.0,0.584013
1,3.0,0.56164
2,4.0,0.543411
3,5.0,0.508064
4,6.0,0.489695


In [75]:
print("Results for 2.2")
results_1

Results for 2.2


Unnamed: 0,Number of Clusters,Silhouette score,Principle component
0,2.0,0.619442,2.0
1,2.0,0.599961,3.0
2,2.0,0.589955,4.0
3,2.0,0.587472,5.0
4,2.0,0.585963,6.0
5,3.0,0.611625,2.0
6,3.0,0.586447,3.0
7,3.0,0.570949,4.0
8,3.0,0.566995,5.0
9,3.0,0.564725,6.0


**Question**: Which combination of number of clusters and number of components produced the best results? What is the silhouette score for this combination? **(3 marks)**
With 2 components, 2 clusters we get a silhoullete_score of 0.61




## 3. Improve results (Bonus - 3 marks)

Think about how you could improve the results from the previous section. Two potential methods include preprocessing the data or selecting a different clustering algorithm. Repeat section 2 with your selected improvement method to determine what the new silhouette scores would be

In [62]:
# TODO: Repeat steps 2.1-2.3 using a different method/preprocessing/etc.

**Question**: Why did you select this improvement method? Which combination of number of clusters and number of components produced the best results? Did you improve the silhouette scores? If yes, how much of an improvement did you get over the previous results?

*ANSWER HERE*