# Clustering

# 1. Metrics

In this lab, you will use and interpret clustering metrics on the [Iris dataset](https://fr.wikipedia.org/wiki/Iris_de_Fisher). Feel free to use any dataset of your choice.

## Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
from sklearn import datasets
from sklearn.cluster import KMeans

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score
from sklearn.metrics.cluster import contingency_matrix

## Data

In [None]:
iris = datasets.load_iris()
X = iris.data  
y = iris.target
feature_names = iris.feature_names
label_names = iris.target_names

In [None]:
feature_names

In [None]:
print(label_names)

In [None]:
def show_data(X, y, features=[0, 1], feature_names=feature_names):
    '''Display the samples in 2D'''
    plt.figure(figsize=(5,5))
    for label in set(y):
        plt.scatter(X[y == label, features[0]], X[y == label, features[1]])
    plt.xlabel(feature_names[features[0]])
    plt.ylabel(feature_names[features[1]])
    plt.show()

In [None]:
show_data(X, y, [0, 1])

In [None]:
show_data(X, y, [2, 3])

## K-means

Let's apply k-means and display the clusters.

In [None]:
km = KMeans(n_clusters=3, n_init=10)
labels = km.fit_predict(X)

In [None]:
show_data(X, labels, [0, 1])

In [None]:
show_data(X, labels, [2, 3])

## Silhouette

We first try to assess the quality of the clustering using the silhouette score. Here we do not use the ground-truth labels.

## To do

* Compute the silhouette of each sample.
* What are the 3 samples of lowest silhouette? What are their clusters?
* Display the silhouette distribution of each cluster using Seaborn (check ``sns.kdeplot``).
* What are the worst clusters in terms of silhouette?
* Compute the average silhouette when the number of clusters grows from 2 to 6.<br> 
What is the optimal number of clusters in terms of average silhouette?

## Contingency matrix

We now use the ground-truth labels. First, we compute and display the contingency matrix.

In [None]:
n_clusters = 3
km = KMeans(n_clusters, n_init=10)
labels = km.fit_predict(X)

In [None]:
contingency = contingency_matrix(y, labels)

In [None]:
sns.heatmap(contingency, annot=True, square=True, xticklabels=np.arange(n_clusters), yticklabels=label_names);

## Metrics

Second, we use the metrics (Average F1 score, ARI and AMI) to find the optimal number of clusters.

## To do

* Plot the ARI and AMI scores with respect to the number of clusters.
* What is the optimal number of clusters?