# Clustering

# 2. K-means vs Gaussian Mixture

In this lab, you will compare K-means and the Gaussian Mixture Model (GMM) on the [Iris dataset](https://fr.wikipedia.org/wiki/Iris_de_Fisher). Feel free to use any dataset of your choice.

## Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

In [None]:
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, silhouette_score
from sklearn.metrics.cluster import contingency_matrix

## Data

In [None]:
iris = datasets.load_iris()
X = iris.data  
y = iris.target
feature_names = iris.feature_names
label_names = iris.target_names

In [None]:
feature_names

In [None]:
print(label_names)

In [None]:
def show_data(X, y, features=[0, 1], feature_names=feature_names):
    '''Display the samples in 2D'''
    plt.figure(figsize=(5,5))
    for label in set(y):
        plt.scatter(X[y == label, features[0]], X[y == label, features[1]])
    plt.xlabel(feature_names[features[0]])
    plt.ylabel(feature_names[features[1]])
    plt.show()

In [None]:
show_data(X, y, [0, 1])

In [None]:
show_data(X, y, [2, 3])

## K-means

We first apply k-means and display the clusters.

In [None]:
n_clusters = 3
km = KMeans(n_clusters, n_init=10)
labels = km.fit_predict(X)

In [None]:
show_data(X, labels, [0, 1])

In [None]:
show_data(X, labels, [2, 3])

## Gaussian Mixture Model

We now compare with the Gaussian Mixture Model.

In [None]:
gm = GaussianMixture(n_clusters)
labels_ = gm.fit_predict(X)

In [None]:
show_data(X, labels_, [0, 1])

In [None]:
show_data(X, labels_, [2, 3])

## To do

* Display the contigency matrix for each clustering. Which one looks better?
* Confirm your guess using the ARI and AMI scores.
* Check that the optimal number of clusters is 3.
* Is it possible to guess the optimal number of clusters with the silhouette scores?<br>
Interpret the results.

## Variants

Some constraints can be added on the covariance matrices so as to make the model simpler and less prone to overfitting.

In [None]:
gm = GaussianMixture(n_clusters, covariance_type='spherical')

## To do 

* Test the various types of covariance matrices and interpret the results.
* Redo the experiments with a scaling factor of 10 on ones of the 4 components and interpret the results.

In [None]:
X_scale = X.copy()
X_scale[:, 0] *= 10