# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/05_apprentissage_non_supervise/05_exercices.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '05_exercices.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 05 - Exercices d'Apprentissage Non-Supervis√©

Exercices pratiques sur le clustering, la r√©duction de dimensionnalit√© et la d√©tection d'anomalies.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, load_wine
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Exercice 1 : Segmentation Client (K-Means + PCA)

**Objectif**: Segmenter des clients sur Wine dataset.

**Consignes**:
1. Charger Wine dataset
2. Appliquer K-Means avec diff√©rents k
3. Visualiser avec PCA 2D
4. Interpr√©ter les segments

In [None]:
# 1. Chargement
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

# Standardisation
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_wine)

print(f"Shape: {X_wine.shape}")
print(f"Features: {wine.feature_names}")

In [None]:
# 2. Elbow method
inertias = []
silhouettes = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(K_range, inertias, 'o-', linewidth=2)
axes[0].set_xlabel('k')
axes[0].set_ylabel('Inertie')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)

axes[1].plot(K_range, silhouettes, 'o-', linewidth=2, color='orange')
axes[1].set_xlabel('k')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

optimal_k = K_range[np.argmax(silhouettes)]
print(f"k optimal: {optimal_k}")

In [None]:
# 3. Clustering et visualisation PCA
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# PCA 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Labels vrais
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_wine, cmap='viridis',
                s=80, alpha=0.7, edgecolors='k')
axes[0].set_title('Labels Vrais')
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')

# Clusters K-Means
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis',
                s=80, alpha=0.7, edgecolors='k')
centers_pca = pca.transform(kmeans.cluster_centers_)
axes[1].scatter(centers_pca[:, 0], centers_pca[:, 1], c='red', s=300,
                alpha=0.8, marker='X', edgecolors='black', linewidths=2)
axes[1].set_title(f'Clusters K-Means (k={optimal_k})')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')

plt.tight_layout()
plt.show()

In [None]:
# 4. Interpr√©tation des segments
df = pd.DataFrame(X_wine, columns=wine.feature_names)
df['Cluster'] = clusters

print("Profil moyen des clusters:")
print(df.groupby('Cluster').mean().round(2))

print(f"\nTaille des clusters: {np.bincount(clusters)}")

## Exercice 2 : D√©tection d'Anomalies

**Objectif**: D√©tecter des outliers avec Isolation Forest et One-Class SVM.

**Consignes**:
1. G√©n√©rer dataset avec outliers
2. Appliquer Isolation Forest
3. Appliquer One-Class SVM
4. Comparer les r√©sultats

In [None]:
# 1. G√©n√©ration dataset avec outliers
np.random.seed(42)
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5)
X_outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_anomaly = np.vstack([X_normal, X_outliers])
y_true = np.array([1]*300 + [-1]*20)  # 1=normal, -1=outlier

plt.figure(figsize=(10, 6))
plt.scatter(X_normal[:, 0], X_normal[:, 1], label='Normal', alpha=0.6)
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red', label='Outliers', 
            s=100, marker='x', linewidths=2)
plt.title('Dataset avec Outliers')
plt.legend()
plt.show()

In [None]:
# 2. Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
y_pred_iso = iso_forest.fit_predict(X_anomaly)

# 3. One-Class SVM
ocsvm = OneClassSVM(nu=0.1, kernel='rbf', gamma='auto')
y_pred_svm = ocsvm.fit_predict(X_anomaly)

# 4. Visualisation
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

methods = [
    ('V√©rit√© Terrain', y_true),
    ('Isolation Forest', y_pred_iso),
    ('One-Class SVM', y_pred_svm)
]

for idx, (name, y_pred) in enumerate(methods):
    normal_mask = (y_pred == 1)
    outlier_mask = (y_pred == -1)
    
    axes[idx].scatter(X_anomaly[normal_mask, 0], X_anomaly[normal_mask, 1],
                      label='Normal', alpha=0.6, s=50)
    axes[idx].scatter(X_anomaly[outlier_mask, 0], X_anomaly[outlier_mask, 1],
                      c='red', label='Outliers', s=100, marker='x', linewidths=2)
    axes[idx].set_title(f'{name}\n{outlier_mask.sum()} outliers d√©tect√©s')
    axes[idx].legend()

plt.tight_layout()
plt.show()

# M√©triques
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

for name, y_pred in [('Isolation Forest', y_pred_iso), ('One-Class SVM', y_pred_svm)]:
    print(f"\n{name}:")
    print(f"  Accuracy: {accuracy_score(y_true, y_pred):.3f}")
    print(f"  Precision: {precision_score(y_true, y_pred, pos_label=-1):.3f}")
    print(f"  Recall: {recall_score(y_true, y_pred, pos_label=-1):.3f}")
    print(f"  F1: {f1_score(y_true, y_pred, pos_label=-1):.3f}")

## Exercice 3 : Pipeline Complet Non-Supervis√©

**Objectif**: Combiner PCA + Clustering + Visualisation.

**Consignes**:
1. G√©n√©rer dataset haute dimension
2. R√©duire avec PCA (95% variance)
3. Clustering avec K-Means et DBSCAN
4. Visualiser avec t-SNE

In [None]:
# 1. Dataset haute dimension
from sklearn.datasets import make_classification

X_high, _ = make_classification(n_samples=500, n_features=50, n_informative=20,
                                n_redundant=10, n_clusters_per_class=2, random_state=42)

scaler_high = StandardScaler()
X_high_scaled = scaler_high.fit_transform(X_high)

print(f"Shape originale: {X_high.shape}")

In [None]:
# 2. PCA pour 95% variance
pca_95 = PCA(n_components=0.95)
X_pca_95 = pca_95.fit_transform(X_high_scaled)

print(f"Composantes pour 95% variance: {pca_95.n_components_}")
print(f"Variance expliqu√©e: {pca_95.explained_variance_ratio_.sum():.1%}")
print(f"R√©duction: {X_high.shape[1]} ‚Üí {X_pca_95.shape[1]}")

In [None]:
# 3. Clustering
# K-Means
kmeans_high = KMeans(n_clusters=4, random_state=42, n_init=10)
clusters_kmeans = kmeans_high.fit_predict(X_pca_95)

# DBSCAN
dbscan_high = DBSCAN(eps=3, min_samples=10)
clusters_dbscan = dbscan_high.fit_predict(X_pca_95)

print(f"K-Means: {len(np.unique(clusters_kmeans))} clusters")
print(f"DBSCAN: {len(set(clusters_dbscan)) - (1 if -1 in clusters_dbscan else 0)} clusters")
print(f"DBSCAN outliers: {list(clusters_dbscan).count(-1)}")

In [None]:
# 4. Visualisation t-SNE
tsne = TSNE(n_components=2, random_state=42, n_jobs=-1)
X_tsne = tsne.fit_transform(X_pca_95)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# K-Means
axes[0].scatter(X_tsne[:, 0], X_tsne[:, 1], c=clusters_kmeans, cmap='viridis',
                s=50, alpha=0.7, edgecolors='k', linewidths=0.5)
axes[0].set_title('K-Means Clustering (t-SNE visualization)')
axes[0].set_xlabel('t-SNE 1')
axes[0].set_ylabel('t-SNE 2')

# DBSCAN
unique_labels = set(clusters_dbscan)
colors = plt.cm.viridis(np.linspace(0, 1, len(unique_labels) - (1 if -1 in unique_labels else 0)))

for k, col in zip([l for l in unique_labels if l != -1], colors):
    class_member_mask = (clusters_dbscan == k)
    axes[1].scatter(X_tsne[class_member_mask, 0], X_tsne[class_member_mask, 1],
                    c=[col], s=50, alpha=0.7, edgecolors='k', linewidths=0.5)

if -1 in unique_labels:
    outliers_mask = (clusters_dbscan == -1)
    axes[1].scatter(X_tsne[outliers_mask, 0], X_tsne[outliers_mask, 1],
                    c='black', s=50, alpha=0.5, marker='x', linewidths=2, label='Outliers')
    axes[1].legend()

axes[1].set_title('DBSCAN Clustering (t-SNE visualization)')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')

plt.tight_layout()
plt.show()

## R√©capitulatif

### Points cl√©s

1. **Segmentation Client (K-Means + PCA)**
   - Standardisation essentielle
   - Elbow + Silhouette pour choisir k
   - PCA pour visualisation 2D
   - Interpr√©ter les segments par leurs moyennes

2. **D√©tection d'Anomalies**
   - Isolation Forest: Rapide, peu de param√®tres
   - One-Class SVM: Plus robuste mais plus lent
   - Contamination/nu: Proportion attendue d'outliers
   - M√©triques: Pr√©cision, Rappel, F1

3. **Pipeline Non-Supervis√©**
   - PCA d'abord pour r√©duire dimension
   - Puis clustering sur espace r√©duit
   - t-SNE uniquement pour visualisation finale
   - Ne pas faire confiance aux distances t-SNE

### Workflow recommand√©

1. **Pr√©paration**
   - Nettoyage et standardisation
   - PCA exploratoire

2. **Clustering**
   - Tester plusieurs algorithmes
   - Optimiser hyperparam√®tres
   - Valider avec m√©triques internes

3. **Interpr√©tation**
   - Visualisation 2D/3D
   - Profils des clusters
   - Validation m√©tier