# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/05_apprentissage_non_supervise/05_demo_clustering.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '05_demo_clustering.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 05 - D√©monstration Clustering

Ce notebook explore les algorithmes de clustering: K-Means, DBSCAN, et Hierarchical Clustering.

## Objectifs
- Comprendre K-Means et choisir le nombre optimal de clusters
- Appliquer DBSCAN pour d√©tecter des formes complexes
- Utiliser le clustering hi√©rarchique et les dendrogrammes
- √âvaluer la qualit√© du clustering

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons, make_circles
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    silhouette_score, davies_bouldin_score, calinski_harabasz_score,
    adjusted_rand_score, normalized_mutual_info_score
)
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import cdist
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Partie 1 : K-Means Clustering

### Principe
- Partitionnement en k clusters
- Minimise la variance intra-cluster
- It√©ratif: assignation puis mise √† jour des centro√Ødes
- Sensible √† l'initialisation (k-means++)

In [None]:
# 1.1 Dataset synth√©tique avec 4 clusters
np.random.seed(42)
X_blobs, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.8, random_state=42)

# Visualisation
plt.figure(figsize=(10, 6))
plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis', 
            s=50, alpha=0.7, edgecolors='k')
plt.title('Dataset avec 4 Clusters (labels vrais)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster')
plt.show()

In [None]:
# 1.2 Application de K-Means avec k=4
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
y_pred = kmeans.fit_predict(X_blobs)
centers = kmeans.cluster_centers_

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Labels vrais
axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis',
                s=50, alpha=0.7, edgecolors='k')
axes[0].set_title('Labels Vrais')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Pr√©dictions K-Means
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_pred, cmap='viridis',
                s=50, alpha=0.7, edgecolors='k')
axes[1].scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8,
                marker='X', edgecolors='black', linewidths=2, label='Centro√Ødes')
axes[1].set_title('Pr√©dictions K-Means (k=4)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Inertie (somme des distances carr√©es): {kmeans.inertia_:.2f}")
print(f"Nombre d'it√©rations: {kmeans.n_iter_}")

In [None]:
# 1.3 M√©thode du coude (Elbow Method) pour choisir k
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans_k = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    y_pred_k = kmeans_k.fit_predict(X_blobs)
    
    inertias.append(kmeans_k.inertia_)
    silhouette_scores.append(silhouette_score(X_blobs, y_pred_k))

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Elbow Method
axes[0].plot(K_range, inertias, 'o-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='r', linestyle='--', label='k optimal = 4')
axes[0].set_xlabel('Nombre de Clusters (k)')
axes[0].set_ylabel('Inertie (Within-Cluster Sum of Squares)')
axes[0].set_title('M√©thode du Coude (Elbow Method)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Silhouette Score
axes[1].plot(K_range, silhouette_scores, 'o-', linewidth=2, markersize=8, color='orange')
axes[1].axvline(x=4, color='r', linestyle='--', label='k optimal = 4')
axes[1].set_xlabel('Nombre de Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score vs k')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Interpr√©tation:")
print("- Elbow: Chercher le 'coude' o√π l'inertie diminue moins rapidement")
print("- Silhouette: Plus proche de 1 = meilleurs clusters")

In [None]:
# 1.4 Impact du nombre de clusters k
k_values = [2, 3, 4, 6]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, k in enumerate(k_values):
    kmeans_k = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    y_pred_k = kmeans_k.fit_predict(X_blobs)
    centers_k = kmeans_k.cluster_centers_
    
    silhouette = silhouette_score(X_blobs, y_pred_k)
    
    axes[idx].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_pred_k, cmap='viridis',
                      s=50, alpha=0.7, edgecolors='k')
    axes[idx].scatter(centers_k[:, 0], centers_k[:, 1], c='red', s=300, alpha=0.8,
                      marker='X', edgecolors='black', linewidths=2)
    axes[idx].set_title(f'k={k}\nSilhouette: {silhouette:.3f}, Inertie: {kmeans_k.inertia_:.1f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

In [None]:
# 1.5 Limitations de K-Means: formes non convexes
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
X_circles, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

datasets = [('Moons', X_moons), ('Circles', X_circles)]

for row, (name, X) in enumerate(datasets):
    # Donn√©es brutes
    axes[row, 0].scatter(X[:, 0], X[:, 1], s=50, alpha=0.7, edgecolors='k')
    axes[row, 0].set_title(f'{name} Dataset')
    axes[row, 0].set_xlabel('Feature 1')
    axes[row, 0].set_ylabel('Feature 2')
    
    # K-Means (k=2)
    kmeans_shape = KMeans(n_clusters=2, random_state=42)
    y_pred_shape = kmeans_shape.fit_predict(X)
    centers_shape = kmeans_shape.cluster_centers_
    
    axes[row, 1].scatter(X[:, 0], X[:, 1], c=y_pred_shape, cmap='viridis',
                         s=50, alpha=0.7, edgecolors='k')
    axes[row, 1].scatter(centers_shape[:, 0], centers_shape[:, 1], 
                         c='red', s=300, alpha=0.8, marker='X', 
                         edgecolors='black', linewidths=2)
    axes[row, 1].set_title(f'{name} - K-Means (k=2)\n√âchec sur formes non convexes')
    axes[row, 1].set_xlabel('Feature 1')
    axes[row, 1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("Limitation de K-Means:")
print("- Suppose des clusters sph√©riques/convexes")
print("- √âchoue sur formes complexes (moons, circles, etc.)")
print("- Solution: DBSCAN ou autres algorithmes")

## Partie 2 : DBSCAN (Density-Based Spatial Clustering)

### Principe
- Clustering bas√© sur la densit√©
- D√©tecte des formes arbitraires
- Identifie les outliers (points de bruit)
- Hyperparam√®tres: eps (rayon), min_samples (densit√©)

In [None]:
# 2.1 DBSCAN sur Moons et Circles
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

datasets = [('Moons', X_moons), ('Circles', X_circles)]

for row, (name, X) in enumerate(datasets):
    # Donn√©es brutes
    axes[row, 0].scatter(X[:, 0], X[:, 1], s=50, alpha=0.7, edgecolors='k')
    axes[row, 0].set_title(f'{name} Dataset')
    axes[row, 0].set_xlabel('Feature 1')
    axes[row, 0].set_ylabel('Feature 2')
    
    # K-Means
    kmeans_db = KMeans(n_clusters=2, random_state=42)
    y_km = kmeans_db.fit_predict(X)
    
    axes[row, 1].scatter(X[:, 0], X[:, 1], c=y_km, cmap='viridis',
                         s=50, alpha=0.7, edgecolors='k')
    axes[row, 1].set_title(f'{name} - K-Means (√©chec)')
    axes[row, 1].set_xlabel('Feature 1')
    axes[row, 1].set_ylabel('Feature 2')
    
    # DBSCAN
    dbscan = DBSCAN(eps=0.2, min_samples=5)
    y_db = dbscan.fit_predict(X)
    
    # Outliers (label -1) en noir
    unique_labels = set(y_db)
    colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels) - (1 if -1 in unique_labels else 0)))
    
    for k, col in zip([l for l in unique_labels if l != -1], colors):
        class_member_mask = (y_db == k)
        axes[row, 2].scatter(X[class_member_mask, 0], X[class_member_mask, 1],
                             c=[col], s=50, alpha=0.7, edgecolors='k')
    
    # Outliers
    if -1 in unique_labels:
        outliers_mask = (y_db == -1)
        axes[row, 2].scatter(X[outliers_mask, 0], X[outliers_mask, 1],
                             c='black', s=50, alpha=0.5, edgecolors='red', 
                             linewidths=1, label='Outliers')
    
    n_clusters = len(set(y_db)) - (1 if -1 in y_db else 0)
    n_outliers = list(y_db).count(-1)
    
    axes[row, 2].set_title(f'{name} - DBSCAN (succ√®s)\nClusters: {n_clusters}, Outliers: {n_outliers}')
    axes[row, 2].set_xlabel('Feature 1')
    axes[row, 2].set_ylabel('Feature 2')
    if n_outliers > 0:
        axes[row, 2].legend()

plt.tight_layout()
plt.show()

In [None]:
# 2.2 Impact des hyperparam√®tres eps et min_samples
eps_values = [0.1, 0.2, 0.3, 0.5]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, eps in enumerate(eps_values):
    dbscan_eps = DBSCAN(eps=eps, min_samples=5)
    y_pred_eps = dbscan_eps.fit_predict(X_moons)
    
    n_clusters = len(set(y_pred_eps)) - (1 if -1 in y_pred_eps else 0)
    n_outliers = list(y_pred_eps).count(-1)
    
    unique_labels = set(y_pred_eps)
    colors = plt.cm.viridis(np.linspace(0, 1, max(1, len(unique_labels) - (1 if -1 in unique_labels else 0))))
    
    for k, col in zip([l for l in unique_labels if l != -1], colors):
        class_member_mask = (y_pred_eps == k)
        axes[idx].scatter(X_moons[class_member_mask, 0], X_moons[class_member_mask, 1],
                         c=[col], s=50, alpha=0.7, edgecolors='k')
    
    if -1 in unique_labels:
        outliers_mask = (y_pred_eps == -1)
        axes[idx].scatter(X_moons[outliers_mask, 0], X_moons[outliers_mask, 1],
                         c='black', s=50, alpha=0.5, edgecolors='red', linewidths=1)
    
    axes[idx].set_title(f'eps={eps}, min_samples=5\nClusters: {n_clusters}, Outliers: {n_outliers}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("Impact de eps:")
print("- eps petit: Plus de clusters, plus d'outliers")
print("- eps grand: Moins de clusters, fusion possible")

## Partie 3 : Clustering Hi√©rarchique

### Principe
- Construit une hi√©rarchie de clusters
- Agglom√©ratif (bottom-up) ou divisif (top-down)
- Linkage: single, complete, average, ward
- Visualisation avec dendrogramme

In [None]:
# 3.1 Clustering hi√©rarchique sur dataset Blobs
# Utiliser un sous-√©chantillon pour la visualisation
n_samples = 100
X_small, y_small = make_blobs(n_samples=n_samples, centers=3, 
                              cluster_std=0.8, random_state=42)

# Calcul de la matrice de linkage
linkage_methods = ['single', 'complete', 'average', 'ward']

fig, axes = plt.subplots(2, 2, figsize=(18, 14))
axes = axes.ravel()

for idx, method in enumerate(linkage_methods):
    Z = linkage(X_small, method=method)
    
    dendrogram(Z, ax=axes[idx], truncate_mode='lastp', p=20)
    axes[idx].set_title(f'Dendrogramme - Linkage: {method.capitalize()}')
    axes[idx].set_xlabel('Index ou Cluster')
    axes[idx].set_ylabel('Distance')

plt.tight_layout()
plt.show()

print("M√©thodes de linkage:")
print("- Single: Distance minimale entre points (sensible aux outliers)")
print("- Complete: Distance maximale entre points (forme des clusters compacts)")
print("- Average: Distance moyenne entre points (compromis)")
print("- Ward: Minimise la variance intra-cluster (recommand√©)")

In [None]:
# 3.2 Clustering hi√©rarchique avec AgglomerativeClustering
n_clusters = 3

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, method in enumerate(linkage_methods):
    agg = AgglomerativeClustering(n_clusters=n_clusters, linkage=method)
    y_agg = agg.fit_predict(X_small)
    
    axes[idx].scatter(X_small[:, 0], X_small[:, 1], c=y_agg, cmap='viridis',
                      s=100, alpha=0.7, edgecolors='k')
    axes[idx].set_title(f'Agglomerative Clustering\nLinkage: {method.capitalize()}, k={n_clusters}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

In [None]:
# 3.3 Dendrogramme avec seuil de coupe
Z_ward = linkage(X_small, method='ward')

fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Dendrogramme complet
dendrogram(Z_ward, ax=axes[0])
axes[0].set_title('Dendrogramme Complet (Ward)')
axes[0].set_xlabel('Index')
axes[0].set_ylabel('Distance')

# Dendrogramme avec ligne de coupe
threshold = 15
dendrogram(Z_ward, ax=axes[1], color_threshold=threshold)
axes[1].axhline(y=threshold, c='red', linestyle='--', linewidth=2, label=f'Threshold={threshold}')
axes[1].set_title('Dendrogramme avec Seuil de Coupe')
axes[1].set_xlabel('Index')
axes[1].set_ylabel('Distance')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Interpr√©tation:")
print("- Hauteur de fusion = dissimilarit√© entre clusters")
print("- Couper √† un certain seuil d√©termine le nombre de clusters")
print("- Grand saut vertical = bon nombre de clusters")

## Partie 4 : √âvaluation de la Qualit√© du Clustering

In [None]:
# 4.1 M√©triques internes (sans labels vrais)
k_range = range(2, 10)
silhouette_scores = []
davies_bouldin_scores = []
calinski_harabasz_scores = []

for k in k_range:
    kmeans_eval = KMeans(n_clusters=k, random_state=42)
    labels = kmeans_eval.fit_predict(X_blobs)
    
    silhouette_scores.append(silhouette_score(X_blobs, labels))
    davies_bouldin_scores.append(davies_bouldin_score(X_blobs, labels))
    calinski_harabasz_scores.append(calinski_harabasz_score(X_blobs, labels))

# Visualisation
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Silhouette Score (plus haut = mieux)
axes[0].plot(k_range, silhouette_scores, 'o-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='r', linestyle='--', label='k=4 (vrai)')
axes[0].set_xlabel('Nombre de Clusters (k)')
axes[0].set_ylabel('Silhouette Score')
axes[0].set_title('Silhouette Score\n(plus haut = mieux)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Davies-Bouldin Index (plus bas = mieux)
axes[1].plot(k_range, davies_bouldin_scores, 'o-', linewidth=2, markersize=8, color='orange')
axes[1].axvline(x=4, color='r', linestyle='--', label='k=4 (vrai)')
axes[1].set_xlabel('Nombre de Clusters (k)')
axes[1].set_ylabel('Davies-Bouldin Index')
axes[1].set_title('Davies-Bouldin Index\n(plus bas = mieux)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Calinski-Harabasz Index (plus haut = mieux)
axes[2].plot(k_range, calinski_harabasz_scores, 'o-', linewidth=2, markersize=8, color='green')
axes[2].axvline(x=4, color='r', linestyle='--', label='k=4 (vrai)')
axes[2].set_xlabel('Nombre de Clusters (k)')
axes[2].set_ylabel('Calinski-Harabasz Index')
axes[2].set_title('Calinski-Harabasz Index\n(plus haut = mieux)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 4.2 M√©triques externes (avec labels vrais)
kmeans_final = KMeans(n_clusters=4, random_state=42)
y_pred_final = kmeans_final.fit_predict(X_blobs)

# Adjusted Rand Index et Normalized Mutual Information
ari = adjusted_rand_score(y_true, y_pred_final)
nmi = normalized_mutual_info_score(y_true, y_pred_final)

print("M√©triques externes (comparaison avec labels vrais):")
print(f"Adjusted Rand Index (ARI): {ari:.4f}")
print(f"Normalized Mutual Information (NMI): {nmi:.4f}")
print("\nInterpr√©tation:")
print("- ARI et NMI varient entre 0 (al√©atoire) et 1 (parfait)")
print("- ARI et NMI proches de 1 = clustering tr√®s similaire aux labels vrais")

## R√©capitulatif

### K-Means

**Avantages:**
- Simple et rapide
- Efficace pour clusters sph√©riques/convexes
- Scalable (grands datasets)

**Inconv√©nients:**
- N√©cessite de sp√©cifier k √† l'avance
- Sensible √† l'initialisation
- Suppose des clusters de forme sph√©rique
- Sensible aux outliers

**Choix de k:**
- M√©thode du coude (Elbow)
- Silhouette Score
- Davies-Bouldin Index

### DBSCAN

**Avantages:**
- D√©tecte des formes arbitraires
- Identifie les outliers
- Pas besoin de sp√©cifier k
- Robuste au bruit

**Inconv√©nients:**
- Choix de eps et min_samples d√©licat
- Difficult√© avec densit√©s variables
- Moins efficace en haute dimension

### Clustering Hi√©rarchique

**Avantages:**
- Produit une hi√©rarchie compl√®te
- Dendrogramme facilite l'interpr√©tation
- Pas besoin de k √† l'avance
- D√©terministe

**Inconv√©nients:**
- Co√ªt calculatoire √©lev√© (O(n¬≤) ou O(n¬≥))
- Pas scalable pour grands datasets
- D√©cisions irr√©versibles

### M√©triques d'√âvaluation

**Internes (sans labels):**
- Silhouette Score: [-1, 1], plus haut = mieux
- Davies-Bouldin Index: [0, ‚àû), plus bas = mieux
- Calinski-Harabasz Index: [0, ‚àû), plus haut = mieux

**Externes (avec labels):**
- Adjusted Rand Index (ARI): [-1, 1], 1 = parfait
- Normalized Mutual Information (NMI): [0, 1], 1 = parfait

### Quand utiliser quoi?

**K-Means:**
- Clusters sph√©riques
- K connu ou estimable
- Grands datasets

**DBSCAN:**
- Formes complexes
- Pr√©sence d'outliers
- K inconnu

**Hi√©rarchique:**
- Petits/moyens datasets
- Besoin de hi√©rarchie
- Exploration des donn√©es