# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/04_classification_supervisee/04_exercices.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '04_exercices.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 04 - Exercices de Classification Supervis√©e

Ce notebook contient des exercices pratiques sur les algorithmes de classification.

## Objectifs
- Appliquer KNN, arbres, boosting et SVM
- Comparer les performances
- Optimiser les hyperparam√®tres
- Diagnostiquer les probl√®mes

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, fetch_covtype
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay
)
from time import time
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Exercice 1 : Classification de Chiffres Manuscrits (Digits Dataset)

**Objectif** : Classifier les chiffres manuscrits (0-9) avec diff√©rents algorithmes.

**Consignes** :
1. Charger le dataset Digits (8x8 images)
2. Explorer les donn√©es et visualiser quelques exemples
3. Entra√Æner KNN, Decision Tree, Random Forest, XGBoost
4. Comparer les performances (accuracy, F1, temps)
5. Analyser les erreurs avec la matrice de confusion

In [None]:
# 1. Chargement des donn√©es
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Shape: {X_digits.shape}")
print(f"Nombre de classes: {len(np.unique(y_digits))}")
print(f"Distribution des classes:\n{np.bincount(y_digits)}")

In [None]:
# 2. Visualisation d'exemples
fig, axes = plt.subplots(2, 10, figsize=(16, 4))
axes = axes.ravel()

for i in range(20):
    axes[i].imshow(digits.images[i], cmap='gray')
    axes[i].set_title(f'Label: {y_digits[i]}')
    axes[i].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# 3. Pr√©paration des donn√©es
X_train_dig, X_test_dig, y_train_dig, y_test_dig = train_test_split(
    X_digits, y_digits, test_size=0.2, random_state=42, stratify=y_digits
)

# Standardisation pour KNN
scaler_dig = StandardScaler()
X_train_dig_scaled = scaler_dig.fit_transform(X_train_dig)
X_test_dig_scaled = scaler_dig.transform(X_test_dig)

print(f"Train set: {X_train_dig.shape}")
print(f"Test set: {X_test_dig.shape}")

In [None]:
# 4. Entra√Ænement et comparaison des mod√®les
models_dig = {
    'KNN (k=5)': (KNeighborsClassifier(n_neighbors=5), True),  # True = needs scaling
    'Decision Tree': (DecisionTreeClassifier(max_depth=10, random_state=42), False),
    'Random Forest': (RandomForestClassifier(n_estimators=100, random_state=42), False),
    'XGBoost': (xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='mlogloss'), False)
}

results_dig = []

for name, (model, needs_scaling) in models_dig.items():
    # Choisir les donn√©es appropri√©es
    X_tr = X_train_dig_scaled if needs_scaling else X_train_dig
    X_te = X_test_dig_scaled if needs_scaling else X_test_dig
    
    # Entra√Ænement
    start = time()
    model.fit(X_tr, y_train_dig)
    train_time = time() - start
    
    # Pr√©diction
    start = time()
    y_pred = model.predict(X_te)
    pred_time = time() - start
    
    # M√©triques
    results_dig.append({
        'Model': name,
        'Train Acc': model.score(X_tr, y_train_dig),
        'Test Acc': accuracy_score(y_test_dig, y_pred),
        'Precision': precision_score(y_test_dig, y_pred, average='weighted'),
        'Recall': recall_score(y_test_dig, y_pred, average='weighted'),
        'F1': f1_score(y_test_dig, y_pred, average='weighted'),
        'Train Time': train_time,
        'Pred Time': pred_time
    })

results_dig_df = pd.DataFrame(results_dig)
print("R√©sultats sur Digits Dataset:")
print(results_dig_df.to_string(index=False))

In [None]:
# Visualisation des performances
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Accuracy
axes[0].barh(results_dig_df['Model'], results_dig_df['Test Acc'], alpha=0.7)
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Test Accuracy')
axes[0].set_xlim(0.8, 1.0)

# F1 Score
axes[1].barh(results_dig_df['Model'], results_dig_df['F1'], alpha=0.7, color='orange')
axes[1].set_xlabel('F1 Score')
axes[1].set_title('F1 Score (Weighted)')
axes[1].set_xlim(0.8, 1.0)

# Train Time
axes[2].barh(results_dig_df['Model'], results_dig_df['Train Time'], alpha=0.7, color='green')
axes[2].set_xlabel('Temps (s)')
axes[2].set_title('Temps d\'Entra√Ænement')

plt.tight_layout()
plt.show()

In [None]:
# 5. Analyse des erreurs (meilleur mod√®le)
best_idx = results_dig_df['Test Acc'].idxmax()
best_model_name = results_dig_df.loc[best_idx, 'Model']
best_model, needs_scaling = models_dig[best_model_name]

X_te = X_test_dig_scaled if needs_scaling else X_test_dig
y_pred_best = best_model.predict(X_te)

print(f"Meilleur mod√®le: {best_model_name}")
print(f"\nClassification Report:")
print(classification_report(y_test_dig, y_pred_best))

# Matrice de confusion
cm = confusion_matrix(y_test_dig, y_pred_best)
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=digits.target_names)
disp.plot(cmap='Blues')
plt.title(f'Matrice de Confusion - {best_model_name}')
plt.show()

# Visualiser quelques erreurs
errors = y_test_dig != y_pred_best
X_errors = X_test_dig[errors]
y_true_errors = y_test_dig[errors]
y_pred_errors = y_pred_best[errors]

print(f"\nNombre d'erreurs: {errors.sum()} / {len(y_test_dig)} ({100*errors.sum()/len(y_test_dig):.2f}%)")

fig, axes = plt.subplots(2, 5, figsize=(14, 6))
axes = axes.ravel()

for i in range(min(10, len(X_errors))):
    axes[i].imshow(X_errors[i].reshape(8, 8), cmap='gray')
    axes[i].set_title(f'True: {y_true_errors[i]}\nPred: {y_pred_errors[i]}', color='red')
    axes[i].axis('off')

plt.suptitle('Exemples d\'Erreurs de Classification')
plt.tight_layout()
plt.show()

## Exercice 2 : Optimisation des Hyperparam√®tres

**Objectif** : Optimiser un Random Forest et un SVM sur le dataset Digits.

**Consignes** :
1. Random Forest: Optimiser `n_estimators`, `max_depth`, `min_samples_split`
2. SVM RBF: Optimiser `C` et `gamma`
3. Comparer les performances avant/apr√®s optimisation
4. G√©n√©rer les learning curves

In [None]:
# 1. Optimisation Random Forest
from sklearn.model_selection import RandomizedSearchCV

param_dist_rf = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier(random_state=42)
rf_search = RandomizedSearchCV(
    rf, param_dist_rf, n_iter=20, cv=5, 
    scoring='f1_weighted', random_state=42, n_jobs=-1, verbose=1
)

print("Optimisation Random Forest...")
rf_search.fit(X_train_dig, y_train_dig)

print(f"\nMeilleurs param√®tres: {rf_search.best_params_}")
print(f"Meilleur score F1 (CV): {rf_search.best_score_:.4f}")

# Comparaison avant/apr√®s
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train_dig, y_train_dig)

y_pred_default = rf_default.predict(X_test_dig)
y_pred_optimized = rf_search.best_estimator_.predict(X_test_dig)

print("\nComparaison Random Forest:")
print(f"D√©faut - Test Accuracy: {accuracy_score(y_test_dig, y_pred_default):.4f}")
print(f"Optimis√© - Test Accuracy: {accuracy_score(y_test_dig, y_pred_optimized):.4f}")

In [None]:
# 2. Optimisation SVM RBF
from sklearn.model_selection import GridSearchCV

param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 0.001, 0.01, 0.1, 1]
}

svm = SVC(kernel='rbf')
svm_search = GridSearchCV(
    svm, param_grid_svm, cv=5, 
    scoring='f1_weighted', n_jobs=-1, verbose=1
)

print("Optimisation SVM RBF...")
svm_search.fit(X_train_dig_scaled, y_train_dig)

print(f"\nMeilleurs param√®tres: {svm_search.best_params_}")
print(f"Meilleur score F1 (CV): {svm_search.best_score_:.4f}")

# Comparaison avant/apr√®s
svm_default = SVC(kernel='rbf')
svm_default.fit(X_train_dig_scaled, y_train_dig)

y_pred_svm_default = svm_default.predict(X_test_dig_scaled)
y_pred_svm_optimized = svm_search.best_estimator_.predict(X_test_dig_scaled)

print("\nComparaison SVM:")
print(f"D√©faut - Test Accuracy: {accuracy_score(y_test_dig, y_pred_svm_default):.4f}")
print(f"Optimis√© - Test Accuracy: {accuracy_score(y_test_dig, y_pred_svm_optimized):.4f}")

In [None]:
# 4. Learning Curves
models_lc = {
    'RF D√©faut': rf_default,
    'RF Optimis√©': rf_search.best_estimator_,
    'SVM D√©faut': svm_default,
    'SVM Optimis√©': svm_search.best_estimator_
}

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, (name, model) in enumerate(models_lc.items()):
    # Choisir les donn√©es appropri√©es
    if 'SVM' in name:
        X_lc = X_train_dig_scaled
    else:
        X_lc = X_train_dig
    
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_lc, y_train_dig,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    axes[idx].plot(train_sizes, train_mean, 'o-', label='Train', linewidth=2)
    axes[idx].fill_between(train_sizes, train_mean - train_std, 
                            train_mean + train_std, alpha=0.2)
    
    axes[idx].plot(train_sizes, val_mean, 's-', label='Validation', linewidth=2)
    axes[idx].fill_between(train_sizes, val_mean - val_std, 
                            val_mean + val_std, alpha=0.2)
    
    axes[idx].set_xlabel('Taille du Training Set')
    axes[idx].set_ylabel('Accuracy')
    axes[idx].set_title(f'Learning Curve: {name}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Exercice 3 : Gestion du D√©s√©quilibre des Classes

**Objectif** : Traiter un dataset d√©s√©quilibr√© avec diff√©rentes techniques.

**Consignes** :
1. Cr√©er un dataset synth√©tique d√©s√©quilibr√© (95% classe 0, 5% classe 1)
2. Entra√Æner un mod√®le baseline (sans traitement)
3. Utiliser class_weight='balanced'
4. Sous-√©chantillonner la classe majoritaire
5. Sur-√©chantillonner la classe minoritaire (SMOTE)
6. Comparer les performances avec pr√©cision, rappel, F1

In [None]:
# 1. Cr√©ation d'un dataset d√©s√©quilibr√©
from sklearn.datasets import make_classification

X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=2,
    weights=[0.95, 0.05],  # 95% classe 0, 5% classe 1
    random_state=42
)

print(f"Distribution des classes: {np.bincount(y_imb)}")
print(f"Proportion classe 1: {100 * y_imb.sum() / len(y_imb):.2f}%")

X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.3, random_state=42, stratify=y_imb
)

In [None]:
# 2. Baseline (sans traitement)
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
rf_baseline.fit(X_train_imb, y_train_imb)
y_pred_baseline = rf_baseline.predict(X_test_imb)

print("Baseline (sans traitement du d√©s√©quilibre):")
print(classification_report(y_test_imb, y_pred_baseline, target_names=['Classe 0', 'Classe 1']))

In [None]:
# 3. class_weight='balanced'
rf_balanced = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_balanced.fit(X_train_imb, y_train_imb)
y_pred_balanced = rf_balanced.predict(X_test_imb)

print("Avec class_weight='balanced':")
print(classification_report(y_test_imb, y_pred_balanced, target_names=['Classe 0', 'Classe 1']))

In [None]:
# 4. Sous-√©chantillonnage de la classe majoritaire
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = rus.fit_resample(X_train_imb, y_train_imb)

print(f"Apr√®s sous-√©chantillonnage: {np.bincount(y_train_under)}")

rf_under = RandomForestClassifier(n_estimators=100, random_state=42)
rf_under.fit(X_train_under, y_train_under)
y_pred_under = rf_under.predict(X_test_imb)

print("\nAvec sous-√©chantillonnage:")
print(classification_report(y_test_imb, y_pred_under, target_names=['Classe 0', 'Classe 1']))

In [None]:
# 5. Sur-√©chantillonnage avec SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_imb, y_train_imb)

print(f"Apr√®s SMOTE: {np.bincount(y_train_smote)}")

rf_smote = RandomForestClassifier(n_estimators=100, random_state=42)
rf_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = rf_smote.predict(X_test_imb)

print("\nAvec SMOTE:")
print(classification_report(y_test_imb, y_pred_smote, target_names=['Classe 0', 'Classe 1']))

In [None]:
# 6. Comparaison globale
strategies = {
    'Baseline': y_pred_baseline,
    'Balanced Weights': y_pred_balanced,
    'Under-sampling': y_pred_under,
    'SMOTE': y_pred_smote
}

results_imb = []

for name, y_pred in strategies.items():
    results_imb.append({
        'Strategy': name,
        'Accuracy': accuracy_score(y_test_imb, y_pred),
        'Precision (Class 1)': precision_score(y_test_imb, y_pred),
        'Recall (Class 1)': recall_score(y_test_imb, y_pred),
        'F1 (Class 1)': f1_score(y_test_imb, y_pred)
    })

results_imb_df = pd.DataFrame(results_imb)
print("\nComparaison des Strat√©gies:")
print(results_imb_df.to_string(index=False))

# Visualisation
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

metrics = ['Precision (Class 1)', 'Recall (Class 1)', 'F1 (Class 1)']
for idx, metric in enumerate(metrics):
    axes[idx].barh(results_imb_df['Strategy'], results_imb_df[metric], alpha=0.7)
    axes[idx].set_xlabel(metric)
    axes[idx].set_title(metric)
    axes[idx].set_xlim(0, 1)

plt.tight_layout()
plt.show()

print("\nConclusions:")
print("- Baseline: Haute accuracy mais mauvais rappel pour classe minoritaire")
print("- Balanced Weights: Compromis raisonnable")
print("- Under-sampling: Perte d'information, peut d√©grader les performances")
print("- SMOTE: Souvent le meilleur compromis, g√©n√®re des exemples synth√©tiques")

## R√©capitulatif

### Points cl√©s abord√©s

1. **Classification Multi-classe**
   - Application sur dataset Digits (10 classes)
   - Comparaison KNN, arbres, boosting
   - Analyse des erreurs avec matrice de confusion

2. **Optimisation des Hyperparam√®tres**
   - GridSearchCV pour recherche exhaustive
   - RandomizedSearchCV pour recherche al√©atoire (plus rapide)
   - Learning curves pour diagnostiquer biais/variance
   - Impact de l'optimisation sur les performances

3. **D√©s√©quilibre des Classes**
   - Probl√®me fr√©quent en pratique (fraude, maladies rares, etc.)
   - Accuracy peut √™tre trompeuse
   - Strat√©gies:
     - `class_weight='balanced'`: P√©nalise les erreurs sur classe minoritaire
     - Under-sampling: R√©duit classe majoritaire
     - Over-sampling (SMOTE): Augmente classe minoritaire
   - M√©triques appropri√©es: Pr√©cision, Rappel, F1, AUC-ROC

### Recommandations pratiques

1. **Exploration pr√©liminaire**
   - Toujours v√©rifier la distribution des classes
   - Visualiser quelques exemples
   - Comprendre les features

2. **Choix du mod√®le**
   - Commencer simple (arbres, KNN)
   - Essayer ensemble methods (RF, XGBoost)
   - SVM si dataset de taille raisonnable

3. **Optimisation**
   - Utiliser validation crois√©e
   - RandomizedSearchCV si grand espace de recherche
   - Ne pas sur-optimiser sur le test set

4. **D√©s√©quilibre**
   - D√©tecter le d√©s√©quilibre t√¥t
   - Choisir m√©triques appropri√©es
   - Tester plusieurs strat√©gies
   - SMOTE souvent efficace