__Agenda__

1. Introduction to unsupervised learning

2. Clustering

3. Kmeans algorithm details

4. Implementation of kmeans with sklearn

5. How to choose number of clusters: Silhouette & Calinski-Harabasz score

6. Challenge

7. An interesting application of the kmeans algorithm with image processing.

8. Summary

# Unsupervised Learning

- Association Rules

- Cluster Analysis

- Principal Components, Curves and Surfaces

- Indepedent Component Analysis

- Multidimensional Scaling

- Non-linear Dimension Reduction

<img src="img/map_of_ml.png" width=650, height=650> 

[Img source](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)


## Clustering

A clustering problem is where you want to discover the inherent groupings in the data.

## K-Means  Algorithm


<img src="img/kmeans.png" width=650, height=650> 


[Let's see kmeans in action](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68)


[This notebook is motivated from](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)

In [None]:
from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt
import numpy as np
import pickle
%matplotlib inline

In [None]:
np.random.seed(110119)

X, y = make_blobs(n_samples=700, n_features=2, centers=4, cluster_std=.5)

In [None]:
# can you plot this dataset

plt.scatter(X[:, 0], X[:, 1], c=y, s=25)

In [None]:
# let's instantiate kmeans algorithm
# don't forget to check its parameters
k_means = KMeans(n_clusters=4)

# dont forget to fit the model!
k_means.fit(X)

# we make a prediction for each point
y_hat = k_means.predict(X)

# we can access the coordinates of the cluster centers by cluster_centers_ method
cl_centers = k_means.cluster_centers_

# note that the colors are different - Is this a problem?
plt.scatter(X[:, 0], X[:, 1], c=y_hat, s=25)


# also let's mark the cluster centers too.
plt.scatter(cl_centers[:, 0], cl_centers[:, 1], c='black', s=100)

__Your Turn__

- Guess how many cluster are there in the figure below.

- Use kmeans to find clusters.

In [None]:
dbfile = open('blobs_1.obj', 'rb')
data = pickle.load(dbfile)

X = data[0]

# can you plot this dataset

plt.scatter(X[:, 0], X[:, 1], s=25)

__Compare your results with the actual values below.__

- Do they close to the actual values?

- What might go wrong?



In [None]:
# let's play with cluster_std and try to find number of clusters

np.random.seed(110119)

X, y = make_blobs(n_samples=700, n_features=2, centers=np.array([[-10, -20],
                                                                 [-5, -15],
                                                                 [-2, -9],
                                                                 [5, 12],
                                                                 [7, 17]
                                                                 ]), cluster_std=5)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=25)

Q: How do we find optimal K value?

[Metrics](https://scikit-learn.org/stable/modules/clustering.html#k-mean)

[Calinski_Harabasz](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index)

[Silhoutte Coefficients](https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient)

In [None]:
# !pip install yellowbrick

In [None]:
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()

visualizer = KElbowVisualizer(
    model, k=(2, 10), metric='calinski_harabasz', timings=False)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

In [None]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(
    model, k=(2, 10), metric='calinski_harabasz', timings=False
)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

In [None]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model,
                              k=(2, 10),
                              metric='silhouette',
                              timings=False,
                              locate_elbow=True)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

[Yellowbrick API](https://www.scikit-yb.org/en/latest/api/cluster/elbow.html)

## Hierarchical Clustering in action

**[This post here](https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/)** walks through cluster assignment _step_ by _step_ if the demo would be helpful.

Meanwhile, we can do it in _**scipy**_ and _**sklearn**_

### Hierarchical clustering with `scipy`

In [None]:
# lets generate some data and look at an example of hierarchical agglomerative clustering
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# generate two clusters: a with 100 points, b with 50:
np.random.seed(1000)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[50, ])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50, ])
X = np.concatenate((a, b),)
print(X.shape)  # 150 samples with 2 dimensions
plt.scatter(X[:, 0], X[:, 1])
plt.title("Sample data for clustering demo")

In [None]:
# construct dendrogram in scipy
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'single')

In [None]:
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(Z, pdist(X))
c

In [None]:
# calculate and construct the dendrogram
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()

In [None]:
# trimming and truncating the dendrogram
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=12,  # show only the last p merged clusters
    show_leaf_counts=False,  # otherwise numbers in brackets are counts
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
)
plt.show()

# from documentation of "lastp"
# The last p non-singleton formed in the linkage are the only non-leaf nodes in the linkage;
# they correspond to rows Z[n-p-2:end] in Z. All other non-singleton clusters are contracted into leaf nodes.

### Hierarchical clustering with `sklearn` on Iris (because it's there)

**[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)** for AgglomerativeClustering in `sklearn`


**[A great example of using manhattan distance](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_clustering_metrics.html#sphx-glr-auto-examples-cluster-plot-agglomerative-clustering-metrics-py)** with agglomerative clustering in `sklearn`.

In [None]:
# we can also use the scikitlearn module hierarchical clustering to perform the same task
from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import KernelDensity
np.random.seed(2000)

In [None]:
# try clustering on the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
# in this case, we won't be working with predicting labels, so we will only use the features (X)
X_iris = iris.data
y_iris = iris.target

In [None]:
plt.scatter(X_iris[:, 0], X_iris[:, 2])  # c = y_iris)

In [None]:
iris_cluster = AgglomerativeClustering(n_clusters=3)
iris_cluster
pred_iris_clust = iris_cluster.fit_predict(X_iris)
plt.scatter(X_iris[:, 0], X_iris[:, 2], c=pred_iris_clust, s=10)

In [None]:
# compare it to the actual truth
plt.scatter(X_iris[:, 0], X_iris[:, 2], c=y_iris)

#### Evaluate

To evaluate you might try different numbers of clusters and compare their silhouette score as you did w kmeans.

In [None]:
# evaluation - silhouette score
from sklearn.metrics import silhouette_score
silhouette_score(X_iris, pred_iris_clust)

### Evaluating number of clusters / Cut points
For hierarchical agglomerative clustering, or clustering in general, it is generally difficult to truly evaluate the results. Therefore, it is up you, the data scientists, to decide.

**[Standford has a good explaination on page 380](https://nlp.stanford.edu/IR-book/pdf/17hier.pdf)** of your options for picking the cut-off. 

When we are viewing dendrograms for hierarchical agglomerative, we can visually examine where the natural cutoff is, despite it not sounding exactly statistical, or scientific. We might want to interpret the clusters and assign meanings to them depending on domain-specific knowledge and shape of dendrogram. However, we can evaluate the quality of our clusters using measurements such as Sihouette score discussed in the k-means lectures. 



## Advantages & Disadvantages of hierarchical clustering

#### Advantages
- Intuitive and easy to implement
- More informative than k-means because it takes individual relationship into consideration
- Allows us to look at dendrogram and decide number of clusters

#### Disadvantages
- Very sensitive to outliers
- Cannot undo the previous merge, which might lead to problems later on 


### Further reading

- [from MIT on just hierarchical](http://web.mit.edu/6.S097/www/resources/Hierarchical.pdf)
- [from MIT comparing clustering methods](http://www.mit.edu/~9.54/fall14/slides/Class13.pdf)
- [fun CMU slides on clustering](http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/clustering.pdf)

### Find those clusters!!! 

In [None]:
shop = pd.read_excel('Online Retail.xlsx')