## Unsupervised Learning: Exploring Clustering and Dimensionality Reduction
<img src="./materials/logo_cross.png" style="height:100px"> 


Welcome to the Hack GT/DSGT collaboration! Our goal at Data Science at Georgia Tech is to teach you about Data Science and Machine Learning in a way that is approachable and useful. We mix theory and hands-on coding -- because it's cool when you can do stuff with your own hands. 

This notebook accompanies the Unsupervised Learning workshop. Use this notebook to follow along!  

Instructions for People New to Notebooks:
- *To run a cell, click on a cell and press shift enter.**  

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import pandas as pd

What are these packages? 
- Pandas: Easy data manipulation. Turns data into a spreadsheet-like format (dataframes)  
- Numpy: For working with arrays. Useful for efficient mathematical operations  
- Matplotlib: To quickly create visualizaitons  
- Sklearn: Helps quickly instantiate and train ML models     

These are super well documented! If you every have a question, just google!

In [None]:
# Importing fancy functions is easy!
from sklearn import datasets
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import scale, StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
#silencing warnings. because they don't really matter and are just ugly to look at
try: 
    import warnings
    warnings.filterwarnings('ignore')
except: pass

# Customer Clustering
<img src="./materials/debit-card.png" style="height:50px">  

## We will first look at kMeans and Agglomerative Clustering

We will use a shoppping example. We want to cluster users for a recommendaation system.   
We need to find groups that may behave similarly. Let's see what data we have  

In [None]:
data = pd.read_csv('./shopping_data.csv')
data.head()

In [None]:
X = data.iloc[:, [2,3,4]].values #
y = data.iloc[:, [1]].values
print(data.shape)

## KMeans  
KMeans is a common initial approach for clustering  
Here we see the main 3 steps  
1) Import what you need  
2) Instantiate your model  
3) Fit your model   

In [None]:
from sklearn.cluster import KMeans

In [None]:
#standard flow -- create instance. fit model. use to predict.
kmeans= KMeans()
kmeans = KMeans(n_clusters=5) 
kmeans.fit_predict(X)
print('prediction and fitting done')

Note: How did we come up with the number of clusters?
    
Answer: You wouldnt really know. Main ways to guess would be   
        1) subject matter expertise   
        2) knowing the number of classes upfront (we don't)  
        3) Using another algorithm to inform us (we use heirarchical clustering for this next!)

In [None]:
plt.figure(figsize=(7, 5))
plt.scatter(X[:,1],X[:,2]) # Which features are these? 
plt.xlabel('Annual Income', fontsize=15)
plt.ylabel('Spending Score', fontsize=15)

In [None]:
plt.figure(figsize=(7,5))
plt.scatter(X[:,1],X[:,2], c=kmeans.labels_, cmap='rainbow')
plt.xlabel('Annual Income', fontsize=15), plt.ylabel('Spending Score', fontsize=15)

##### We can use the model to predict the identity of a novel input

In [None]:
y_pred = kmeans.predict(X)
print(y_pred)

But wait. What do these classes mean? Think about it...

## Agglomerative Clustering  
We can derive clusters and insight into the data through dendrograms

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
# instantiate and fit the model
clustering = AgglomerativeClustering(linkage='ward', n_clusters=5) 
clustering.fit(X)

In [None]:
# Make the dendrogram. This is fancier but makes a nice image to understand
lnk_matrx = linkage(X, 'ward') # {“ward”, “complete”, “average”, “single”}#
figure = plt.figure(figsize=(7.5, 5))

dendrogram(lnk_matrx, color_threshold=220,
           truncate_mode= "lastp", p =len(X), leaf_font_size=3) #make the dendrogram and fix aesthetics

plt.title('Agglomerative Clustering Dendrogram (Ward)'), plt.xlabel('Sample Index'), plt.ylabel('Distance')
plt.tight_layout()
plt.show()

## Whoa! We were able to find clusters that describe distinct customer groups. 
However, consider a case that is more complex. What if you have 10s or even 100s of features? How do you make this data human understandable? How do you visualize? How do you keep from overwhelming your model? 

For supervised tasks, often a good first approach is a correlation plot or forward feature selection (see slides for more detail here!). Here we can see which features to pluck when training our model

<img src="./materials/dim_red.png" style="height:300px"> 


## But what about more complex clustering Tasks 
### Think of an image! How many features do we have?  
16x16 pizels is 256 features! And what is 256x256...??? 

<img src="./materials/digits.png" style="height:100px"> 


In [None]:
digits = datasets.load_digits()

In [None]:
# Import. Instantiate. Fit
from sklearn import decomposition

X = digits.data
y = digits.target

pca = decomposition.PCA(n_components=2)
X_reduced = pca.fit_transform(X)

#### Quiz: Does this heatmap mean anything to anybody?

In [None]:
df = pd.DataFrame(digits.data)
plt.matshow(df.corr())

### We talked about PCA in the workshop slides.   
But a quick reminder -- PCA does not just select valuable features. PCA looks to define new features that are linear combinations of old features. These new features form new axis that best explain the variation in your data 

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, 
            edgecolor='none', s=50,
            cmap=plt.cm.get_cmap('tab10', 10))
plt.colorbar()
plt.title('PCA Projection', fontsize=20);
plt.xlabel('PC1', fontsize=20), plt.ylabel('PC2', fontsize=20)

#### Discussion:
Look at this chart. Is this very human interpretable?   
Is this machine interpretable -- ex do you think a classifier might be able to use this? 
How much can we actually tell how well a model could use this.  
Think -- how many features did we have? How many features are we looking at? 

#### Just for an example of finding different solutions to a given task... Let's see if t-SNE works any better!  

In a high level sentence -- t-SNE is an itterative approach that finds nonlinear mappings (transformation) from high to low dimensional space. t-SNE looks to keep neighbors in high dim. space neighbors in low dim space. This is done by computing probability distributions in high dim space and trying to best emululate them in the low dimensional representation.
Great resource --> 
https://mlexplained.com/2018/09/14/paper-dissected-visualizing-data-using-t-sne-explained/

In [None]:
# Import, instantiate, fit
from sklearn.manifold import TSNE
tsne = TSNE(random_state=13)
X_tsne = tsne.fit_transform(X)

plt.figure(figsize=(10,7))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, 
            edgecolor='none', s=50,
            cmap=plt.cm.get_cmap('tab10', 10))
plt.colorbar()
plt.title('t-SNE', fontsize=20);

###### Out of these two techniques, which would you prefer for this dataset? Why? 
Think in terms of interpretability and efficiency?

## Fun Conclusion Activity
Whoa we covered a lot. We probably didn't even get to finish. We all like feeling good about ourselves, so:  
Brainstorm a list of what you learned about. Facts? Concepts? Approaches?   
**(Double click on this to change in markdown)** 
- 
- 
- 
- 
- 

## Thank you for your time!