In [None]:
# The standard start of our notebooks
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

#  Clustering

## Data

In [None]:
# Load the data, which is included in sklearn.
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target).replace(to_replace=dict(enumerate(iris.target_names)))

In [None]:
X.head()

In [None]:
y.head()

## KMeans

Use KMeans to segment the iris data into two clusters

In [None]:
from sklearn.cluster import KMeans



Plot each of the 2D projections to see if the clustering "makes sense"

In [None]:
import matplotlib.cm as cm

fig, axes = plt.subplots(nrows=X.shape[1], ncols=X.shape[1], sharex=False, sharey=False, figsize = (16, 16))

cmap = cm.jet
for i, f1 in enumerate(X.columns):
    for j, f2 in enumerate(X.columns):
       if not f1 == f2:
        X.plot(kind='scatter', x=f1, y=f2, c=labels, cmap=cmap, ax=axes[i, j])
        
            
plt.show()

Compare to the actual labels

Repeat the above for three or more clusters

## Pick number of clusters using scree plot

We would like to have a more data-driven approach to choosing the right number of clusters. Especially when we do not have any true labels.

### Exercise: 
(a) Plot k vs RSS for k between 1 and 10

(b) See how easily you can add a StandardScaler() step to kmeans. (That is, normalize the columns)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler




## Silhouette Scores

Silhouette scores are a slightly better measure of cluster 'goodness' since they take into effect the density of each cluster as well as the distance _between_ clusters.

In [None]:
from silhouette import silhouette_plot
from sklearn.metrics import silhouette_samples, silhouette_score



In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

In [None]:
for i in range(2, 7):
    clusterer = Pipeline(steps=[
        ('scale', StandardScaler()),
        ('kmeans', KMeans(i))
    ])
    silhouette_plot(X, y, clusterer, i)

## Hierarchical Clustering

Use scipy for this one

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

clusters = linkage(X, 'ward')
_ = dendrogram(clusters)

In [None]:
labels = fcluster(clusters, 7, depth=10)

In [None]:
pd.crosstab(y, labels)

## DBSCAN

For DBSCAN, we need to pick `min_samples` and $\epsilon$. One way to do this:

1. Fix a value of `min_samples` that makes sense.
2. Try a wide range of values for $\epsilon$ and record the number of unique labels for each one.
3. Look for a persistent number of clusters over a wide range of $\epsilon$

In [None]:
from sklearn.cluster import DBSCAN



Pick a few representative choices for $\epsilon$ and see how the clusters compare to the true labels.