### OVERVIEW

* all methods accept std matrix of [#samples, #features]
* AP, SC, DBSCAN accept similarity matrix of [#samples, #samples]
* returns array [integer_label,...] corresponding to each cluster
* class labels can be in *labels_*

[comparison](plot_cluster_comparison.ipynb)

### K-MEANS (LLOYD'S ALGO)

* Tries to separate data into n groups of equal variance 
* Uses minimal intertia (within-cluster sum-of-square) criterion
* Assumes clusters are convex & isotropic, not always true
* Poor response to elongated clusters or irregular manifolds
* Inertia not normalized; distances get big in high-D spaces
* Consider dimensionality reduction before running Kmeans
* Kmeans always converges, possibly at local minimum
* Use multi initializations to overcome local minima issue

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)

[demo, iris](plot_cluster_iris.ipynb) | 
[demo, color quant](plot_color_quantization.ipynb) |
[demo, assumptions](plot_kmeans_assumptions.ipynb) |
[demo, digits](plot_kmeans_digits.ipynb)

[demo, silhouette](plot_kmeans_silhouette_analysis.ipynb) |
[demo, initialization](plot_kmeans_stability_low_dim_dense.ipynb) |

* Visualization also available via [Voronoi diagrams](https://en.wikipedia.org/wiki/Voronoi_diagram)

### MINIBATCH KMEANS

* Goal: reduce computation time while optimizing same function
* Uses random subsets for training on each iteration

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans) |
[learn faces dataset](plot_dict_face_patches.ipynb) |
[kmeans std vs minibatch](plot_mini_batch_kmeans.ipynb)

### AFFINITY PROPAGATION

* creates clusters by sending msgs between sample pairs until convergence.
* dataset modeled with "exemplars" (most representative of total)
* Main issue is complexity. Use on small-medium datasets

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation) |
[example](plot_affinity_propagation.ipynb) |
[plot stock market](plot_stock_market.ipynb)

### MEAN SHIFT

* Looks for blobs in a smooth density of samples
* Centroid-based
* Automatically sets #clusters

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift) | 
[example](plot_mean_shift.ipynb)

### SPECTRAL CLUSTERING

* Finds a low-D representation of affinity matrix, followed by K-means.

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering) |
[image segmentation](plot_segmentation_toy.ipynb) |
[raccoon face segmentation](plot_face_segmentation.ipynb)

### HIERARCHICAL CLUSTERING

* Hierarchy represented by tree or dendrogram
* Linkage criteria used for cluster merging:
   * 1) Ward/min-variance linkage
   * 2) Max/complete linkage
   * 3) Avg linkage
* Usually returns uneven cluster sizes
* Can use Euclidian, Manhattan, cosine distances for linkages

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering) |
[demo](plot_agglomerative_clustering.ipynb) |
[metrics](plot_agglomerative_clustering_metrics.ipynb) |
[2D digits](plot_digits_linkage.ipynb) |

* Connectivty constraints (ex: only adjacent clusters can be merged) can be used. 

[struct vs unstruct](plot_ward_structured_vs_unstructured.ipynb)

### DBSCAN

* Views clusters as high-density areas, surrounded by low-D areas
* This allows DBSCAN clusters to be any shape
* Uses "core samples" concept to characterize each cluster

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) | [demo](plot_dbscan.ipynb)


### BIRCH

* Builds a CFT (characteristic feature tree) from input data
* Doesn't scale well. If D>20, use MiniBatchKmeans instead

[API](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html#sklearn.cluster.Birch) | [birch vs kmeans](plot_birch_vs_minibatchkmeans.ipynb)

### SCORING: RAND INDEX

* measures similarity of labels_true & labels_pred

In [1]:
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

metrics.adjusted_rand_score(labels_true, labels_pred)  

0.24242424242424246

In [2]:
labels_pred = [1, 1, 0, 0, 3, 3]
metrics.adjusted_rand_score(labels_true, labels_pred) 

0.24242424242424246

In [3]:
# ARS is symmetric; swapping arg doesn't change score
metrics.adjusted_rand_score(labels_pred, labels_true)  

0.24242424242424246

In [4]:
# perfect labeling:
labels_pred = labels_true[:]
metrics.adjusted_rand_score(labels_true, labels_pred)

1.0

In [5]:
# bad (independent) labeling:
labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
metrics.adjusted_rand_score(labels_true, labels_pred)  

-0.12903225806451613

[example: adjustment for chance](plot_adjusted_for_chance_measures.ipynb)

### SCORING: MUTUAL INFORMATION

* measures agreement between labels_true & labels_pred
* two normalized versions (NMI, AMI) available

In [6]:
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
metrics.adjusted_mutual_info_score(labels_true, labels_pred) 

0.2250422831983088

In [7]:
labels_pred = [1, 1, 0, 0, 3, 3]
metrics.adjusted_mutual_info_score(labels_true, labels_pred) 

0.2250422831983088

In [8]:
# symmetric
metrics.adjusted_mutual_info_score(labels_pred, labels_true)  

0.2250422831983088

In [9]:
# perfect labeling
labels_pred = labels_true[:]
metrics.adjusted_mutual_info_score(labels_true, labels_pred)
metrics.normalized_mutual_info_score(labels_true, labels_pred)

1.0

In [10]:
# not true for mutual_info_score
metrics.mutual_info_score(labels_true, labels_pred)  

0.69314718055994518

In [11]:
# bad (independent) labels:
labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
metrics.adjusted_mutual_info_score(labels_true, labels_pred)

-0.10526315789473674

### SCORING: FOWLKES-MALLOWS

[API](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html#sklearn.metrics.fowlkes_mallows_score)

* Measures similarity of two clusters
* defined as geometric mean of precision & result
* FMI = TP/sqrt((TP+FP)*(TP_FN))

In [12]:
# perfect labeling
from sklearn.metrics.cluster import fowlkes_mallows_score
fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1]),
fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0])

1.0

In [13]:
# if class members 100% split across diff clusters, 
# then assignment is total random.
fowlkes_mallows_score([0, 0, 0, 0], [0, 1, 2, 3])

0.0

In [14]:
# bad (independent) labels:
labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
metrics.fowlkes_mallows_score(labels_true, labels_pred)  

0.0

### SCORING: SILHOUETTE COEFFICIENT

[API](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score) | [example](plot_kmeans_silhouette_analysis.ipynb)

In [15]:
# SC for set of samples = mean of SC for each sample.
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target

In [16]:
# apply to results of a cluster analysis
import numpy as np
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.silhouette_score(X, labels, metric='euclidean')

0.55259194452136762

### SCORING: CALINSKI-HARABAZ INDEX

[API](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabaz_score.html#sklearn.metrics.calinski_harabaz_score)

* Defined as ratio of (between-clusters dispersion mean / within-clusters dispersion mean)

In [17]:
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target

In [18]:
# apply to results of a cluster analysis
import numpy as np
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabaz_score(X, labels)  

560.39992424664024