## Project 6 : Clustering
- Name: Levi Grenier
- Date: Sep. 29, 2022

## Instructions

### Description

Practice clustering on a using the well known and very popular `Iris` Dataset! The Iris flower data set is fun for learning supervised classification algorithms, and is known as a difficult case for unsupervised learning. 
https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html
<br><br>Yes, there are many examples out there, but see if you can do it yourself :). We can easily hypothesize on how many clusters would yield the best result, so let us prove it through a simple experiment that you could repeat with additional data sets.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.

### Setup

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn import datasets
import sklearn as sk
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans

### Problem 1: Data Generation (5 points)
Reference for more information: Chapter 5.11 K-Means in the online course book.

1. Load the `iris` dataset and separate into `X` and `y` variables (our ground truth labels will just be used for visualization).
2. Write a hypothesis on how many clusters will yield the best labeling.

In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

**Hypothesis**
> I hypothesize that 6 clusters will yield the best labeling.
>
> (Because the data exploration doesn't happen until after the hypothesis, I have not used knowledge of the data set to inform my hypothesis. Just curious why this would take place before looking at the data.)

### Problem 2: Data exploration (10 points)

This is the step where you would normally conduct any needed preprocessing, data wrangling, and investigation of the data.
<br>**Note:** `print(iris.DESCR)` prints the iris dataset description, provided you loaded it into a variable named `iris`

a. Using your skills from previous projects, provide code below to produce answers to the following questions (edit this cell with your answers): 

    1. How many features are provided?

    There are four features provided (not including the target feature). 

    2. How many total observations?
    
    There are 150 total observations in this dataset.

    3. How many different labels are included, what are they called, and is it a balanced dataset with the same number of observations for each class?
    
    There are four different labels in X. They are 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', and 'petal width (cm)'. There are fifty observations in each class. It is a balanced dataset with one-third of the data falling under each of the three classes.
    
        
b. Create a 2D or 3D scatter plot of two or three of the features and use the y labels for color coding. Do not reduce the data or number of features in any way (you will do this by applying PCA in problem 5).

c. Since clusters can be influenced by the magnitudes of the variables, scale the feature data and plot a histogram of the transformed feature data (think about if you should use the min-max, standard scaler, or normalizer).

I have looked at each column's histogram for standard, MinMax, and nomalized scaling. Normalized has largely coherent and well-grouped histograms, so I am concerned that we will not be able to distringuish specific clusters. MinMax na dstandard scaling have similar histograms. The difference between the two that has lead me to my choice is that for  the histogram of petal width, we see that MinMax divides a set of observations into two distinct groups whereas standard scaling does not divide that group (the second one in the historgram). Thus, I have chosen to use MinMax scaling.  


In [None]:
# a
print(iris.DESCR)
print(iris.feature_names)


In [None]:
# b

# 2D scatterplot for sepal length and width
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.plot(X[0:49,0], X[0:49,1], 'r.', X[50:99,0], X[50:99,1], 'b.', X[100:149,0], X[100:149,1], 'y.',)
plt.title("Sepal Length vs. Sepal Width")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")
plt.show()

# 3D scatter plot for sepal width, petal length, and petal width 
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[0:49,1], X[0:49,2], X[0:49,3], color = 'red')
ax.scatter(X[50:99,1], X[50:99,2], X[50:99,3], color = 'blue')
ax.scatter(X[100:149,1], X[100:149,2], X[100:149,3], color = 'yellow')
ax.set_xlabel("Sepal Width")
ax.set_ylabel("Petal Length")
ax.set_zlabel("Petal Width")
plt.show()



In [None]:
#c. Scale the data (think about if you should use the min-max, standard scaler, or normalizer)

col = 3

# Standard Scale
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X)
X_scaled = ss.transform(X)
plt.hist(X_scaled[0:49,col])
plt.title("X_scaled histogram")
plt.show()

# Normalized
from sklearn.preprocessing import Normalizer
normalized = Normalizer()
normalized.fit(X)
X_norm = normalized.transform(X)
plt.hist(X_norm[0:49,col])
plt.title("X_norm histogram")
plt.show()

# MinMax
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X_minmax = scaler.transform(X)
plt.hist(X_minmax[0:49,col])
plt.title("X_minmax histogram")
plt.show()

### Problem 3: Unsupervised Learning - Clustering (15 points)
Conduct clustering experiments with one of algorithms discussed in class (e.g., k-means) for number of clusters k = 2-10. Create another 2D or 3D scatter plot utilizing the <b>cluster assignments</b> for color coding (this output can be a plot for each of the values of k or just one final plot using the value of k from your best Silhouette result obtained in Problem 4 below).  

#### Steps:
Repeat for each value of k (maybe a loop here would be appropriate):
1. Create model object
2. Train or fit the model
3. Predict cluster assignments
4. Calculate Silhouette width (see Problem 4)
4. Plot points color coded by class labels predicted by the model.

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

silhouette_vals = []

for k in range(2, 10):  
    # Declaring and fitting the model
    km = KMeans(k)
    km.fit(X_minmax)
    prediction = km.predict(X_minmax)
    
    # Plotting the results
    plt.scatter(X[:,0], X[:,1], c=prediction)
    plt.show()
    
    # Silhouette Score
    silhouette_avg = silhouette_score(X_minmax, prediction)
    print(f"For k = {k} clusters, the average silhouette_score is: {silhouette_avg}")
    silhouette_vals.append(silhouette_avg)

### Problem 4: Evaluate results (20 points)

As we have discussed, validating an usupervised problem is difficult. There is a metric that can be used to determine the density or separation of cluster assignments, called Silhouette width. In this step, perform analysis of results using the above `k = 2-10` and compute the Silhouette width (Hint: possibly you can just add code to your loop in problem 3 and store the results in a list of values). 

Scikit Learn has a great example for Silhouette analysis [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

1. For each k (k = 2-10), what are the Silhouette width values?
 
 See printed statements in Problem 3. See plot below. 
 

2. Discuss if your best number of clusters (highest Silhouette width value) matches your hypothesis from Problem 1.

 For k=2, we have the highest silhouette value at 0.63. This disproves my hypothesis in which k=6 was the ideal number of clusters. 

In [None]:
plt.plot(np.arange(2,10), silhouette_vals, 'o-')
plt.title("Silhouette Values for k-many clusters")
plt.xlabel("k")
plt.ylabel("Silhouette Value")

print(f"We see that at k = 2, we have the highest silhouette value at {max(silhouette_vals)}")

### Problem 5 (15 points): Principal Component Analysis (PCA)
PCA is the most popular form of dimensionality reduction, which basically, rotates and transforms the data into a new subspace, such that the resultant matrix has:
- Most relevance (variation) now associated with first feature
- Second feature gets the next most, etc.
#### Steps:
    1. Reduce the feature data (X) using PCA
    2. Repeat the same experiment from problem 3 above (remember your plots are now the 1st, 2nd, and possibly 3rd principal component vs. the raw feature data like before).
    3. Compare and contrast results to those from previous/non-PCA problems; does it perform better/worse/same? Provide discussion below (this could vary, depending on setup).

In [None]:
# Clustering with PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
data_pca = pca.transform(X)

silhouette_vals_pca = []

for k in range(2, 10):  
    # Declaring and fitting the model
    km = KMeans(k)
    km.fit(data_pca)
    prediction = km.predict(data_pca)
    
    # Plotting the results
    plt.scatter(X[:,0], X[:,1], c=prediction)
    plt.show()
    
    # Silhouette Score
    silhouette_avg = silhouette_score(data_pca, prediction)
    print(f"For k = {k} clusters, the average silhouette_score is: {silhouette_avg}")
    silhouette_vals_pca.append(silhouette_avg)

print(f"We see that at k = 2, we have the highest silhouette value at {max(silhouette_vals_pca)}")
    


    

**Discuss new results**
> We can see that the silhouette value for two clusters has improved by 0.075! That's pretty good considering that we just transformed it using the first two principal components. We should also note that both approaches predicted two clusters as ideal, so we can be pretty confident in that regard. 
>

## You Finished! Treat yourself by taking this questionnaire
### Questionnaire
1) How long did you spend on this assignment?
<br> Like three and a half hours.<br>
2) What did you like about it? What did you not like about it?
<br> I liked this a lot. I feel like it is super interesting (though I am concerned that it may seem unmotivated to my peers). I especially liked that we explored the silhouette analysis. I really like having something that I can use to make a model decision. <br>
3) Did you find any errors or is there anything you would like changed?
<br>Did not find anything of note.<br>