# GMM

It is a probabilistic approach to clustering addressing many of these problems. In this approach we describe each cluster by its centroid (mean), covariance , and the size of the cluster(Weight)

### Aproach

* Rather than identifying clusters by “nearest” centroids like k means, we fit a set of k gaussians to the data. 
* Then we estimate gaussian distribution parameters such as mean and Variance for each cluster and weight of a cluster. 
* After learning the parameters for each data point we can calculate the probabilities of it belonging to each of the clusters.

### How do we estimate GD params

* Expectation maximization is the technique most commonly used to estimate the mixture model's parameters. 
* In frequentist probability theory, models are typically learned by using maximum likelihood estimation techniques, which seek to maximize the probability, or likelihood, of the observed data given the model parameters. 

Well it's hard to visualize and grab these concepts in first shot. Here is some material for deep dive in gmm
* http://www.cse.iitm.ac.in/~vplab/courses/DVP/PDF/gmm.pdf

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white", color_codes=True)

%matplotlib inline

In [50]:
data = pd.read_csv('my_machine-learning/datasets/iris.csv')
data = data.drop('Id', axis=1) # get rid of the Id column - don't need it
data.sample(5)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
75,6.6,3.0,4.4,1.4,Iris-versicolor
36,5.5,3.5,1.3,0.2,Iris-setosa
129,7.2,3.0,5.8,1.6,Iris-virginica
73,6.1,2.8,4.7,1.2,Iris-versicolor
83,6.0,2.7,5.1,1.6,Iris-versicolor


In [51]:
X = data.iloc[:,0:4]
y = data.iloc[:,-1]

### Standard Scaling

In [52]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns = X.columns)

In [53]:
from sklearn.cluster import KMeans

kmean = KMeans(n_clusters=3)
y_kmeans = kmean.fit_predict(X_scaled)

#### Adjusted Rand score

* you can't just compare the SpeciesId with the cluster numbers, because they are both arbitrarily assigned integers.
* But you can use the *adjusted Rand score* to quantify the goodness of the clustering, as compared with SpeciesId (the true labels).

* e.g. this will give a perfect score of 1.0, even though the labels are reversed - adjusted_rand_score([0,0,1,1], [1,1,0,0]) # => 1.0

see http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html


In [54]:
from sklearn.metrics.cluster import adjusted_rand_score

# first let's see how the k-means clustering did - 
score = adjusted_rand_score(y, y_kmeans)
score

0.6201351808870379

In [55]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3)
y_gmm = gmm.fit_predict(X_scaled)


* GMM tries to fit normally distributed clusters, which is probably the case with this data,
* so it fit it better. k-means is biased towards spherically distributed clusters.

In [56]:
score = adjusted_rand_score(y, y_gmm)
score

0.9038742317748124

* **Its impossible to create 4d graph without some dimensionality reduction technique like pca** 
* So we will create one after learning pca