Crunchy nut cluster analysis
================

In [None]:
from IPython.display import Image
Image(url= "http://www.kelloggs.com.au/content/dam/workarea/assetpushqueue/images/web-raw-approved/eng%20AU/71/65/prod_img-417165.jpg.thumb.319.319.png")

### Cluster Analysis...
1. ...is a bunch of different analyses
1. ...is unsupervised
1. ...classifies things into different groups, or 'clusters'
1. ...often looks at the 'distance' of points (and clusters) from each other. This can be Euclidian distance in a n-dimensional space.

Everything here is in 2D space - but imagine it in nDs!

In [None]:
!pip install scikit-learn plotly

In [None]:
import numpy as np
import pandas as pd
import sklearn as skl
import sklearn.cluster as sklc

from scipy.cluster.hierarchy import dendrogram, linkage

import plotly
plotly.offline.init_notebook_mode()

In [None]:
def rand_circle(n=10000, Rc=20, Xc =-30, Yc=-40):
    theta = np.random.rand(1, n)*(2*np.pi)
    r = Rc*np.sqrt(np.random.rand(1, n))
    [x] = Xc + r*np.cos(theta)
    [y] = Yc + r*np.sin(theta)
    return x, y

In [None]:
x_a, y_a = rand_circle(200, 2.0, 17, 15.8)
a = pd.DataFrame({'x': x_a, 'y': y_a, 'type': ['a']*200})

x_b, y_b = rand_circle(200, 2.0, 23, 15.8)
b = pd.DataFrame({'x': x_b, 'y': y_b, 'type': ['b']*200})

x_c, y_c = rand_circle(2000, 4.0, 20, 10)
c = pd.DataFrame({'x': x_c, 'y': y_c, 'type': ['c']*2000})

pts = pd.concat([a, b, c])

In [None]:
fig = {
    'data': [
        {
            'x': pts.x, 
            'y': pts.y,
            'mode': 'markers'}],
    'layout': {
        'xaxis': {'title': 'X'},
        'yaxis': {'title': "Y"},
        'title': 'Hello Micky!'
        
    }
}
plotly.offline.iplot(fig)

In [None]:
Image(url= "http://data.freehdw.com/mickey-mouse-head.jpg")

In [None]:
fig = {
    'data': [
        {
            'x': pts[pts['type']=='a'].x, 
            'y': pts[pts['type']=='a'].y,
            'mode': 'markers',
            'name': 'A'},
        {
            'x': pts[pts['type']=='b'].x, 
            'y': pts[pts['type']=='b'].y, 
            'mode': 'markers',
            'name': 'B'},
        {
            'x': pts[pts['type']=='c'].x, 
            'y': pts[pts['type']=='c'].y,
            'mode': 'markers',
            'name': 'C'}
    ],
    'layout': {
        'xaxis': {'title': 'X'},
        'yaxis': {'title': "Y"},
        'title': "Sourece clusters"
        
    }
}
plotly.offline.iplot(fig)

Kmeans
===

1. Initialize k different starting cluster 'centers'
2. Allocate points to closest cluster
3. Recalculate cluster centers
4. Iterate over 2 and 3

[Example](http://onmyphd.com/?p=k-means.clustering)

In [None]:
kmeans_clustering = sklc.KMeans(n_clusters=3).fit(pts[['x','y']])

In [None]:
pts["kmeans"] = pd.Series(kmeans_clustering.labels_, index=pts.index)

In [None]:
fig = {
    'data': [
        {
            'x': pts[pts['kmeans']==0].x, 
            'y': pts[pts['kmeans']==0].y,
            'mode': 'markers',
            'name': '0'},
        {
            'x': pts[pts['kmeans']==1].x, 
            'y': pts[pts['kmeans']==1].y, 
            'mode': 'markers',
            'name': '1'},
        {
            'x': pts[pts['kmeans']==2].x, 
            'y': pts[pts['kmeans']==2].y,
            'mode': 'markers',
            'name': '2'}
    ],
    'layout': {
        'xaxis': {'title': 'X'},
        'yaxis': {'title': "Y"},
        'title': 'K Means'
    }
}
plotly.offline.iplot(fig)

#### Gotchas
1. Clusters assumed to be similar number of points
1. Clusters assumed to be similar variance
1. Clusters assumed to be 'spherical'
1. You need to guess the number of clusters

## Higherarchical Agglomorate

1. All points start in their own cluster
1. Join the two 'closest' clusters
1. Carry on until all the points are in one cluster

In [None]:
# ha_clustering_ward = sklc.hierarchical.AgglomerativeClustering(linkage='ward').fit(pts[['x','y']])

In [None]:
ha_clustering = linkage(pts[['x','y']].as_matrix(), method='average')

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
dendrogram(ha_clustering,
    truncate_mode='lastp',
    p=12,
    leaf_rotation=90.,
    leaf_font_size=12.,
    show_contracted=True)
plt.show()

Vertical height is distance to be "bridged" to join two clusters.

In [None]:
from scipy.cluster.hierarchy import fcluster
clusters = fcluster(ha_clustering, 3, criterion='maxclust')

In [None]:
fig = {
    'data': [
        {
            'x': pts[clusters==1].x, 
            'y': pts[clusters==1].y,
            'mode': 'markers',
            'name': '0'},
        {
            'x': pts[clusters==2].x, 
            'y': pts[clusters==2].y, 
            'mode': 'markers',
            'name': '1'},
        {
            'x': pts[clusters==3].x, 
            'y': pts[clusters==3].y,
            'mode': 'markers',
            'name': '2'}
    ],
    'layout': {
        'xaxis': {'title': 'X'},
        'yaxis': {'title': "Y"},
        'title': 'K Means'
    }
}
plotly.offline.iplot(fig)

## What is distance anyway

1. Centroid
1. Nearest
1. Farthest
1. Average
1. Wards method (inner cluster variance)

## How do you know how many clusters

Lots of different ways and its ultimately **subjective**!!

1. Look for a decrese in $R^2 = 1-\sum{\frac{SSW}{SSB}}$ i.e. the amount of the variance exlplained by the given solution
1. An increase in RMS distance between clusters indicates two dissimilar clusters a have been joined
1. 'Insignificant clusters' contain < half the mean cluster group size (implies similar size)

## Real world uses

[My WIBS paper!](http://www.atmos-meas-tech.net/6/337/2013/amt-6-337-2013.pdf)

[My back trajectory paper](http://www.atmos-chem-phys.net/11/9605/2011/acp-11-9605-2011.pdf)