[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nkeriven/ensta-mt12/blob/main/notebooks/06_clustering/N5_EM_iris_data_example.ipynb)


# GMM covariances


Demonstration of several covariances types for Gaussian mixture models.

See https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture for more information on the estimator.

Although GMM are often used for clustering, we can compare the obtained
clusters with the actual classes from the dataset. We initialize the means
of the Gaussians with the means of the classes from the training set to make
this comparison valid.

We plot predicted labels on both training and held out test data using a
variety of GMM covariance types on the iris dataset.
We compare GMMs with spherical, diagonal, full, and tied covariance
matrices in increasing order of performance. Although one would
expect full covariance to perform best in general, it is prone to
overfitting on small datasets and does not generalize well to held out
test data.

On the plots, train data is shown as dots, while test data is shown as
crosses. The iris dataset is four-dimensional. Only the first two
dimensions are shown here, and thus some points are separated in other
dimensions.



This is a simplified version of the code that can be found here:
https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold
%matplotlib inline

# Iris data - EM clustering
Note that while labels are available for this data set, they will not be used in GMM identification.  

In [None]:
iris = datasets.load_iris()

X_train = iris.data
y_train = iris.target

n_classes = len(np.unique(y_train))  #list the unique elements in y_train : nb of different labels

## GMM model estimation

In [None]:
# Try GMMs using full covariance (no constraints imposed on cov)
estimator = GaussianMixture(n_components=n_classes, 
                             covariance_type='full', max_iter=50, random_state=0)

# !! 
Lines below initialize with centers of mass of each clsuetr, as labels are known...
Usually, 3 different centers are required, chosen at random. It this latter case, the correct
    clusters are extracted, upto some circular permutation on the labels

In [None]:
# Since we have class labels for the training data, we can
# initialize the GMM parameters in a supervised manner.
estimator.means_init = np.array([X_train[y_train == i].mean(axis=0)
                                    for i in range(n_classes)])

estimator.fit(X_train)

In [None]:
print(estimator.covariances_)
#print(estimator.covariances_[1][1:3,1:3])

## Ploting results : 
choose the  axis pair to visualize 

In [None]:
# for K clusters, specify K colors  (here K=3)
colors = ['navy', 'turquoise', 'darkorange']

fig, ax = plt.subplots(subplot_kw={'aspect': 'equal'})

axes=[1,3] # between 0 and 3
for k in range(n_classes):
        # defines ellipses parameters, using eigen-axes
        data = iris.data[iris.target == k]
        covariances = estimator.covariances_[k][np.ix_(axes,axes)]
        plt.scatter(data[:, axes[0]], data[:, axes[1]], s=10, color=colors[k],
                    label=iris.target_names[k])
        Est_means=estimator.means_[k,axes]
        
        v, w = np.linalg.eigh(covariances) 
        u = w[0] / np.linalg.norm(w[0])
        angle = np.arctan2(u[1], u[0])
        angle = 180 * angle / np.pi  # convert to degrees
        v = 2. * np.sqrt(2.) * np.sqrt(v)
        ell = mpl.patches.Ellipse(Est_means, v[0], v[1],
                                  180 + angle, color=colors[k])
        #dplot the ellipses
        ell.set_clip_box(ax.bbox)
        ell.set_alpha(0.5)
        ax.add_artist(ell)
        ax.set_aspect('auto')
     
        
# for visualizing axe1 vs axe2, use "covariances = estimator.covariances_[n][0:2, 0:2]"
# for visualizing axe1 vs axe3, use "covariances = estimator.covariances_[n][0::2, 0::2]"
# for visualizing axe1 vs axe4, use "covariances = estimator.covariances_[n][0::3, 0::3]"
# for visualizing axe2 vs axe3, use "covariances = estimator.covariances_[n][1:3, 1:3]"
# for visualizing axe2 vs axe4, use "covariances = estimator.covariances_[n][1::2, 1::2]"
# for visualizing axe3 vs axe4, use "covariances = estimator.covariances_[n][2:, 2:]"



In [None]:
import itertools
from scipy import linalg
from sklearn import mixture

lowest_bic = np.infty
bic = []
aic = []
n_components_range = range(2, 7)
cv_type =  'full'

    
for n_comp in n_components_range:
        # Fit a Gaussian mixture with EM
    gmm = GaussianMixture(n_components=n_comp,
                          covariance_type=cv_type, max_iter=1000, random_state=1)
    gmm.fit(X_train)
    #bic.append(gmm.aic(X_train))
    bic.append(gmm.bic(X_train))
    aic.append(gmm.aic(X_train))
bic = np.array(bic)
aic = np.array(aic)

# Plot the BIC scores

plt.plot(np.linspace(2,6,5),bic,'b',np.linspace(2,6,5),aic,'r')

print("bic = {}".format(bic))
print("aic = {}".format(aic))    


### Exercize
Play with the covariance type: see https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

### If there is time...
What about PCA ?