# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/04_classification_supervisee/04_demo_boosting_svm.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '04_demo_boosting_svm.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 04 - D√©monstration Gradient Boosting et SVM

Ce notebook explore les m√©thodes de boosting (XGBoost, LightGBM, CatBoost) et les Support Vector Machines.

## Objectifs
- Comprendre le principe du boosting
- Comparer XGBoost, LightGBM et CatBoost
- Ma√Ætriser les SVM et le kernel trick
- Optimiser les hyperparam√®tres

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.data  # type: ignoresets import load_breast_cancer, make_circles, make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay,
    roc_curve, auc, RocCurveDisplay
)
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
from matplotlib.colors import ListedColormap
from time import time
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Partie 1 : Gradient Boosting (XGBoost, LightGBM, CatBoost)

### Principe du Boosting
- Ensemble learning s√©quentiel
- Chaque mod√®le corrige les erreurs du pr√©c√©dent
- Combinaison pond√©r√©e des pr√©dicteurs faibles
- Gradient descent sur l'erreur de pr√©diction

In [None]:
# 1.1 Chargement du dataset Breast Cancer
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data  # type: ignore, columns=cancer.feature_names  # type: ignore)
y_cancer = cancer.target  # type: ignore

print(f"Shape: {X_cancer.shape}")
print(f"Classes: {cancer.target  # type: ignore_names}")
print(f"Distribution: {np.bincount(y_cancer)}")

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

In [None]:
# 1.2 Comparaison des algorithmes de boosting
models_boosting = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
    'LightGBM': lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1),
    'CatBoost': CatBoostClassifier(iterations=100, random_state=42, verbose=0)
}

results_boosting = []

for name, model in models_boosting.items():
    # Entra√Ænement
    start = time()
    model.fit(X_train, y_train)
    train_time = time() - start
    
    # Pr√©dictions
    start = time()
    y_pred = model.predict(X_test)
    pred_time = time() - start
    
    # Pr√©dictions de probabilit√©s pour AUC-ROC
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    
    results_boosting.append({
        'Model': name,
        'Train Acc': model.score(X_train, y_train),
        'Test Acc': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC-ROC': roc_auc,
        'Train Time': train_time,
        'Pred Time': pred_time
    })

results_df = pd.DataFrame(results_boosting)
print("Comparaison des Algorithmes de Boosting:")
print(results_df.to_string(index=False))

In [None]:
# 1.3 Visualisation des performances
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Test Accuracy
axes[0, 0].barh(results_df['Model'], results_df['Test Acc'], alpha=0.7)
axes[0, 0].set_xlabel('Accuracy')
axes[0, 0].set_title('Test Accuracy')
axes[0, 0].set_xlim(0.9, 1.0)

# F1 Score
axes[0, 1].barh(results_df['Model'], results_df['F1'], alpha=0.7, color='orange')
axes[0, 1].set_xlabel('F1 Score')
axes[0, 1].set_title('F1 Score')
axes[0, 1].set_xlim(0.9, 1.0)

# AUC-ROC
axes[1, 0].barh(results_df['Model'], results_df['AUC-ROC'], alpha=0.7, color='green')
axes[1, 0].set_xlabel('AUC-ROC')
axes[1, 0].set_title('AUC-ROC Score')
axes[1, 0].set_xlim(0.95, 1.0)

# Train Time
axes[1, 1].barh(results_df['Model'], results_df['Train Time'], alpha=0.7, color='red')
axes[1, 1].set_xlabel('Temps (s)')
axes[1, 1].set_title('Temps d\'Entra√Ænement')

plt.tight_layout()
plt.show()

In [None]:
# 1.4 Courbes ROC
plt.figure(figsize=(10, 8))

for name, model in models_boosting.items():
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Courbes ROC - Comparaison des Mod√®les de Boosting')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# 1.5 Feature Importance (XGBoost)
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
xgb_model.fit(X_train, y_train)

# Extraction de l'importance
importance = pd.DataFrame({
    'Feature': X_cancer.columns,
    'Importance': xgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False).head(15)

print("Top 15 Features les plus importantes (XGBoost):")
print(importance)

plt.figure(figsize=(12, 6))
plt.barh(importance['Feature'], importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance - XGBoost')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# 1.6 Hyperparameter tuning avec GridSearchCV (XGBoost)
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [50, 100, 200]
}

xgb_grid = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
grid_search = GridSearchCV(xgb_grid, param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1)

print("Recherche des meilleurs hyperparam√®tres...")
grid_search.fit(X_train, y_train)

print(f"\nMeilleurs param√®tres: {grid_search.best_params_}")
print(f"Meilleur score F1 (CV): {grid_search.best_score_:.4f}")

# √âvaluation avec le meilleur mod√®le
best_xgb = grid_search.best_estimator_
y_pred_best = best_xgb.predict(X_test)
print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Test F1: {f1_score(y_test, y_pred_best):.4f}")

## Partie 2 : Support Vector Machines (SVM)

### Principe
- Cherche l'hyperplan optimal qui s√©pare les classes
- Maximise la marge entre les classes
- Support vectors: points les plus proches de la fronti√®re
- Kernel trick: projection dans un espace de dimension sup√©rieure

In [None]:
# 2.1 SVM lin√©aire sur dataset simple
from sklearn.data  # type: ignoresets import make_blobs

X_blob, y_blob = make_blobs(n_samples=200, centers=2, cluster_std=1.5, random_state=42)
X_train_blob, X_test_blob, y_train_blob, y_test_blob = train_test_split(
    X_blob, y_blob, test_size=0.3, random_state=42
)

# SVM lin√©aire
svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_train_blob, y_train_blob)

# Visualisation
def plot_svm_boundary(X, y, model, title):
    """Visualise la fronti√®re de d√©cision d'un SVM avec les support vectors."""
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                edgecolors='k', s=50, alpha=0.8)
    
    # Highlight support vectors
    plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1],
                s=200, linewidth=2, facecolors='none', edgecolors='red', 
                label=f'Support Vectors ({len(model.support_vectors_)})')
    
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.legend()

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_train_blob[:, 0], X_train_blob[:, 1], c=y_train_blob, 
            cmap='viridis', edgecolors='k', s=50, alpha=0.8)
plt.title('Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(1, 2, 2)
plot_svm_boundary(X_train_blob, y_train_blob, svm_linear, 
                 f'SVM Lin√©aire\nAccuracy: {svm_linear.score(X_test_blob, y_test_blob):.3f}')

plt.tight_layout()
plt.show()

In [None]:
# 2.2 Impact du param√®tre C (r√©gularisation)
C_values = [0.1, 1.0, 10.0, 100.0]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, C in enumerate(C_values):
    svm = SVC(kernel='linear', C=C)
    svm.fit(X_train_blob, y_train_blob)
    
    test_acc = svm.score(X_test_blob, y_test_blob)
    n_support = len(svm.support_vectors_)
    
    plt.sca(axes[idx])
    plot_svm_boundary(X_train_blob, y_train_blob, svm,
                     f'C={C}\nAccuracy: {test_acc:.3f}\nSupport Vectors: {n_support}')

plt.tight_layout()
plt.show()

print("Impact de C:")
print("- C petit: Marge large, plus de support vectors, underfitting")
print("- C grand: Marge √©troite, moins de support vectors, overfitting")

In [None]:
# 2.3 Kernel Trick - Dataset non lin√©aire
X_circle, y_circle = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)
X_train_circ, X_test_circ, y_train_circ, y_test_circ = train_test_split(
    X_circle, y_circle, test_size=0.3, random_state=42
)

# Diff√©rents kernels
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, kernel in enumerate(kernels):
    svm = SVC(kernel=kernel, C=1.0, gamma='scale')
    svm.fit(X_train_circ, y_train_circ)
    
    test_acc = svm.score(X_test_circ, y_test_circ)
    
    plt.sca(axes[idx])
    plot_svm_boundary(X_train_circ, y_train_circ, svm,
                     f'Kernel: {kernel}\nAccuracy: {test_acc:.3f}')

plt.tight_layout()
plt.show()

In [None]:
# 2.4 SVM RBF - Impact du param√®tre gamma
gamma_values = [0.1, 1.0, 10.0, 100.0]

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, gamma in enumerate(gamma_values):
    svm = SVC(kernel='rbf', C=1.0, gamma=gamma)
    svm.fit(X_train_circ, y_train_circ)
    
    train_acc = svm.score(X_train_circ, y_train_circ)
    test_acc = svm.score(X_test_circ, y_test_circ)
    
    plt.sca(axes[idx])
    plot_svm_boundary(X_train_circ, y_train_circ, svm,
                     f'gamma={gamma}\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}')

plt.tight_layout()
plt.show()

print("Impact de gamma (RBF):")
print("- gamma petit: Influence √©tendue, fronti√®re lisse, underfitting")
print("- gamma grand: Influence locale, fronti√®re complexe, overfitting")

In [None]:
# 2.5 SVM sur Breast Cancer dataset
# Standardisation (importante pour SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Entra√Ænement avec diff√©rents kernels
svm_models = {
    'SVM Linear': SVC(kernel='linear', C=1.0),
    'SVM RBF': SVC(kernel='rbf', C=1.0, gamma='scale'),
    'SVM Poly (d=3)': SVC(kernel='poly', degree=3, C=1.0)
}

results_svm = []

for name, model in svm_models.items():
    # Entra√Ænement
    start = time()
    model.fit(X_train_scaled, y_train)
    train_time = time() - start
    
    # Pr√©dictions
    y_pred = model.predict(X_test_scaled)
    
    results_svm.append({
        'Model': name,
        'Train Acc': model.score(X_train_scaled, y_train),
        'Test Acc': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'Train Time': train_time,
        'Support Vectors': len(model.support_vectors_)
    })

results_svm_df = pd.DataFrame(results_svm)
print("Performances des SVM sur Breast Cancer:")
print(results_svm_df.to_string(index=False))

In [None]:
# 2.6 Grid Search pour optimiser SVM RBF
param_grid_svm = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
}

svm_grid = SVC(kernel='rbf')
grid_search_svm = GridSearchCV(svm_grid, param_grid_svm, cv=5, scoring='f1', n_jobs=-1, verbose=1)

print("Recherche des meilleurs hyperparam√®tres pour SVM RBF...")
grid_search_svm.fit(X_train_scaled, y_train)

print(f"\nMeilleurs param√®tres: {grid_search_svm.best_params_}")
print(f"Meilleur score F1 (CV): {grid_search_svm.best_score_:.4f}")

# √âvaluation
best_svm = grid_search_svm.best_estimator_
y_pred_best_svm = best_svm.predict(X_test_scaled)
print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred_best_svm):.4f}")
print(f"Test F1: {f1_score(y_test, y_pred_best_svm):.4f}")

## Partie 3 : Comparaison Globale Boosting vs SVM

In [None]:
# 3.1 Comparaison finale sur Breast Cancer
final_models = {
    'XGBoost': best_xgb,
    'SVM RBF': best_svm
}

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for idx, (name, model) in enumerate(final_models.items()):
    if name == 'XGBoost':
        X_eval = X_test
    else:
        X_eval = X_test_scaled
    
    y_pred = model.predict(X_eval)
    cm = confusion_matrix(y_test, y_pred)
    
    plt.sca(axes[idx])
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, 
                                   display_labels=cancer.target  # type: ignore_names)
    disp.plot(ax=axes[idx], cmap='Blues')
    axes[idx].set_title(f'{name}\nAccuracy: {accuracy_score(y_test, y_pred):.4f}')

plt.tight_layout()
plt.show()

# Classification reports
for name, model in final_models.items():
    if name == 'XGBoost':
        X_eval = X_test
    else:
        X_eval = X_test_scaled
    
    y_pred = model.predict(X_eval)
    print(f"\n{name} - Classification Report:")
    print(classification_report(y_test, y_pred, target_names=cancer.target  # type: ignore_names))

## R√©capitulatif

### Gradient Boosting

**XGBoost:**
- Tr√®s performant, optimis√©
- R√©gularisation L1/L2
- Gestion des valeurs manquantes
- Parall√©lisation efficace

**LightGBM:**
- Tr√®s rapide (leaf-wise growth)
- Faible consommation m√©moire
- Excellent pour les grands datasets
- Risque d'overfitting si pas r√©gularis√©

**CatBoost:**
- Gestion native des features cat√©gorielles
- Robuste √† l'overfitting
- Bon par d√©faut (peu de tuning)
- Plus lent que LightGBM

**Hyperparam√®tres cl√©s:**
- `n_estimators`: Nombre d'arbres
- `learning_rate`: Taux d'apprentissage
- `max_depth`: Profondeur des arbres
- `min_child_weight` / `min_samples_leaf`: R√©gularisation

### Support Vector Machines

**Avantages:**
- Efficace en haute dimension
- Robuste avec kernel trick
- Th√©orie math√©matique solide
- Bon avec peu de donn√©es

**Inconv√©nients:**
- Lent pour grands datasets (O(n¬≤) √† O(n¬≥))
- Sensible √† l'√©chelle (standardisation requise)
- Choix du kernel et des hyperparam√®tres crucial
- Moins interpr√©table

**Kernels:**
- `linear`: Donn√©es lin√©airement s√©parables
- `rbf` (Radial Basis Function): Cas g√©n√©ral, non lin√©aire
- `poly`: Polynomiale, interactions
- `sigmoid`: Similaire aux r√©seaux de neurones

**Hyperparam√®tres cl√©s:**
- `C`: Compromis marge/erreur (r√©gularisation)
- `gamma`: Influence des support vectors (RBF, poly)
- `degree`: Degr√© du polyn√¥me (poly)

### Quand utiliser quoi?

**Gradient Boosting (XGBoost/LightGBM/CatBoost):**
- Comp√©titions Kaggle
- Datasets tabulaires structur√©s
- Features h√©t√©rog√®nes
- Besoin de feature importance
- Grands datasets

**SVM:**
- Datasets de petite √† moyenne taille
- Haute dimension
- Fronti√®res non lin√©aires complexes
- Besoin de robustesse th√©orique
- Classification binaire