# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/XX_CHAPTER/XX_NOTEBOOK.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '04_demo_knn_arbres.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 04 - D√©monstration KNN et Arbres de D√©cision

Ce notebook explore les algorithmes de classification K-Nearest Neighbors (KNN) et Arbres de D√©cision.

## Objectifs
- Comprendre le fonctionnement de KNN
- Impl√©menter et visualiser les arbres de d√©cision
- Optimiser les hyperparam√®tres
- Comparer les performances

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_wine, make_classification, make_moons
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, ConfusionMatrixDisplay
)
from matplotlib.colors import ListedColormap
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Partie 1 : K-Nearest Neighbors (KNN)

### Principe
- Algorithme bas√© sur les instances (lazy learning)
- Classification par vote majoritaire des k plus proches voisins
- Distance euclidienne (ou autre m√©trique)
- Sensible √† l'√©chelle des features

In [None]:
# 1.1 Dataset synth√©tique 2D pour visualisation
np.random.seed(42)
X_2d, y_2d = make_moons(n_samples=200, noise=0.15, random_state=42)

X_train_2d, X_test_2d, y_train_2d, y_test_2d = train_test_split(
    X_2d, y_2d, test_size=0.3, random_state=42
)

# Visualisation
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train_2d, cmap='viridis', 
            edgecolors='k', s=80, alpha=0.8)
plt.title('Train Set')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')

plt.subplot(1, 2, 2)
plt.scatter(X_test_2d[:, 0], X_test_2d[:, 1], c=y_test_2d, cmap='viridis', 
            edgecolors='k', s=80, alpha=0.8)
plt.title('Test Set')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Class')

plt.tight_layout()
plt.show()

In [None]:
# 1.2 Visualisation des fronti√®res de d√©cision pour diff√©rents k
def plot_decision_boundary(X, y, model, title):
    """Visualise la fronti√®re de d√©cision d'un mod√®le."""
    h = 0.02  # R√©solution de la grille
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                edgecolors='k', s=50, alpha=0.8)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

# Tester diff√©rentes valeurs de k
k_values = [1, 3, 5, 15]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_2d, y_train_2d)
    
    accuracy = knn.score(X_test_2d, y_test_2d)
    
    plt.sca(axes[idx])
    plot_decision_boundary(X_test_2d, y_test_2d, knn, 
                          f'KNN (k={k})\nAccuracy: {accuracy:.3f}')

plt.tight_layout()
plt.show()

In [None]:
# 1.3 Optimisation de k avec validation crois√©e
k_range = range(1, 31)
cv_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_2d, y_train_2d, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Visualisation
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, cv_scores, 'o-', linewidth=2)
best_k = k_range[np.argmax(cv_scores)]
plt.axvline(best_k, color='r', linestyle='--', label=f'Best k={best_k}')
plt.xlabel('k (Nombre de voisins)')
plt.ylabel('Accuracy (CV)')
plt.title('Optimisation de k par Validation Crois√©e')
plt.legend()
plt.grid(True, alpha=0.3)

# Biais-variance tradeoff
plt.subplot(1, 2, 2)
plt.plot(k_range, cv_scores, 'o-', linewidth=2)
plt.axvline(best_k, color='r', linestyle='--', alpha=0.5)
plt.annotate('Overfitting\n(variance √©lev√©e)', xy=(1, 0.92), fontsize=10, color='red')
plt.annotate('Underfitting\n(biais √©lev√©)', xy=(25, 0.87), fontsize=10, color='red')
plt.xlabel('k (Nombre de voisins)')
plt.ylabel('Accuracy (CV)')
plt.title('Compromis Biais-Variance')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Meilleur k: {best_k} (Accuracy CV: {max(cv_scores):.4f})")

In [None]:
# 1.4 Application sur dataset Iris (multiclass)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

# Standardisation (importante pour KNN!)
scaler = StandardScaler()
X_train_iris_scaled = scaler.fit_transform(X_train_iris)
X_test_iris_scaled = scaler.transform(X_test_iris)

# Entra√Ænement avec le meilleur k
knn_iris = KNeighborsClassifier(n_neighbors=5)
knn_iris.fit(X_train_iris_scaled, y_train_iris)

# Pr√©dictions
y_pred_iris = knn_iris.predict(X_test_iris_scaled)

# √âvaluation
print("KNN sur Iris Dataset:")
print(f"Accuracy: {accuracy_score(y_test_iris, y_pred_iris):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))

# Matrice de confusion
cm = confusion_matrix(y_test_iris, y_pred_iris)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot(cmap='Blues')
plt.title('Matrice de Confusion - KNN sur Iris')
plt.show()

## Partie 2 : Arbres de D√©cision

### Principe
- Mod√®le hi√©rarchique bas√© sur des r√®gles de d√©cision
- Crit√®re de split: Gini impurity ou Entropy (Information Gain)
- Facilement interpr√©table
- Sujet √† l'overfitting sans r√©gularisation

In [None]:
# 2.1 Arbre de d√©cision sur dataset 2D
tree_2d = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_2d.fit(X_train_2d, y_train_2d)

# Visualisation de la fronti√®re
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plot_decision_boundary(X_train_2d, y_train_2d, tree_2d, 
                      f'Decision Tree (Train)\nAccuracy: {tree_2d.score(X_train_2d, y_train_2d):.3f}')

plt.subplot(1, 2, 2)
plot_decision_boundary(X_test_2d, y_test_2d, tree_2d, 
                      f'Decision Tree (Test)\nAccuracy: {tree_2d.score(X_test_2d, y_test_2d):.3f}')

plt.tight_layout()
plt.show()

In [None]:
# 2.2 Visualisation de l'arbre
plt.figure(figsize=(20, 10))
plot_tree(tree_2d, filled=True, feature_names=['Feature 1', 'Feature 2'],
          class_names=['Class 0', 'Class 1'], fontsize=10)
plt.title('Visualisation de l\'Arbre de D√©cision')
plt.show()

print(f"Profondeur de l'arbre: {tree_2d.get_depth()}")
print(f"Nombre de feuilles: {tree_2d.get_n_leaves()}")

In [None]:
# 2.3 Impact de la profondeur maximale (overfitting)
max_depths = [1, 2, 3, 5, 10, None]  # None = pas de limite

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

train_scores = []
test_scores = []

for idx, depth in enumerate(max_depths):
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train_2d, y_train_2d)
    
    train_acc = tree.score(X_train_2d, y_train_2d)
    test_acc = tree.score(X_test_2d, y_test_2d)
    
    train_scores.append(train_acc)
    test_scores.append(test_acc)
    
    plt.sca(axes[idx])
    plot_decision_boundary(X_test_2d, y_test_2d, tree,
                          f'max_depth={depth}\nTrain: {train_acc:.3f}, Test: {test_acc:.3f}')

plt.tight_layout()
plt.show()

# Courbe de complexit√©
plt.figure(figsize=(10, 5))
x_labels = [str(d) if d is not None else 'None' for d in max_depths]
x_pos = np.arange(len(max_depths))

plt.plot(x_pos, train_scores, 'o-', label='Train Accuracy', linewidth=2, markersize=8)
plt.plot(x_pos, test_scores, 's-', label='Test Accuracy', linewidth=2, markersize=8)
plt.xticks(x_pos, x_labels)
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Impact de la Profondeur sur les Performances')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# 2.4 Comparaison Gini vs Entropy
criterions = ['gini', 'entropy']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, criterion in enumerate(criterions):
    tree = DecisionTreeClassifier(criterion=criterion, max_depth=5, random_state=42)
    tree.fit(X_train_2d, y_train_2d)
    
    test_acc = tree.score(X_test_2d, y_test_2d)
    
    plt.sca(axes[idx])
    plot_decision_boundary(X_test_2d, y_test_2d, tree,
                          f'Criterion: {criterion.capitalize()}\nTest Accuracy: {test_acc:.3f}')

plt.tight_layout()
plt.show()

print("Gini vs Entropy:")
print("- Gini: Plus rapide √† calculer, favorise les splits √©quilibr√©s")
print("- Entropy: Bas√© sur la th√©orie de l'information, peut donner des r√©sultats l√©g√®rement diff√©rents")
print("- En pratique: Performances similaires, Gini plus courant")

In [None]:
# 2.5 Application sur Wine Dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=42
)

# Entra√Ænement
tree_wine = DecisionTreeClassifier(max_depth=5, min_samples_split=10, random_state=42)
tree_wine.fit(X_train_wine, y_train_wine)

# Pr√©dictions
y_pred_wine = tree_wine.predict(X_test_wine)

# √âvaluation
print("Decision Tree sur Wine Dataset:")
print(f"Train Accuracy: {tree_wine.score(X_train_wine, y_train_wine):.4f}")
print(f"Test Accuracy: {accuracy_score(y_test_wine, y_pred_wine):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_wine, y_pred_wine, target_names=wine.target_names))

In [None]:
# 2.6 Importance des features
feature_importance = pd.DataFrame({
    'Feature': wine.feature_names,
    'Importance': tree_wine.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nImportance des Features:")
print(feature_importance)

plt.figure(figsize=(12, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance - Decision Tree sur Wine')
plt.tight_layout()
plt.show()

## Partie 3 : Comparaison KNN vs Arbres de D√©cision

In [None]:
# 3.1 Comparaison sur plusieurs m√©triques
from time import time

# Pr√©paration des donn√©es
X_comp, y_comp = make_classification(
    n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
    n_classes=3, random_state=42
)

X_train_comp, X_test_comp, y_train_comp, y_test_comp = train_test_split(
    X_comp, y_comp, test_size=0.3, random_state=42
)

scaler_comp = StandardScaler()
X_train_comp_scaled = scaler_comp.fit_transform(X_train_comp)
X_test_comp_scaled = scaler_comp.transform(X_test_comp)

# Mod√®les
models = {
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'KNN (k=10)': KNeighborsClassifier(n_neighbors=10),
    'Tree (depth=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Tree (depth=10)': DecisionTreeClassifier(max_depth=10, random_state=42),
}

results = []

for name, model in models.items():
    # Choisir les donn√©es appropri√©es (scaled pour KNN)
    if 'KNN' in name:
        X_tr, X_te = X_train_comp_scaled, X_test_comp_scaled
    else:
        X_tr, X_te = X_train_comp, X_test_comp
    
    # Entra√Ænement
    start = time()
    model.fit(X_tr, y_train_comp)
    train_time = time() - start
    
    # Pr√©diction
    start = time()
    y_pred = model.predict(X_te)
    pred_time = time() - start
    
    # M√©triques
    results.append({
        'Model': name,
        'Train Acc': model.score(X_tr, y_train_comp),
        'Test Acc': accuracy_score(y_test_comp, y_pred),
        'Precision': precision_score(y_test_comp, y_pred, average='weighted'),
        'Recall': recall_score(y_test_comp, y_pred, average='weighted'),
        'F1': f1_score(y_test_comp, y_pred, average='weighted'),
        'Train Time (s)': train_time,
        'Pred Time (s)': pred_time
    })

results_df = pd.DataFrame(results)
print("Comparaison des Mod√®les:")
print(results_df.to_string(index=False))

In [None]:
# 3.2 Visualisation de la comparaison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Accuracy
axes[0, 0].bar(range(len(results_df)), results_df['Test Acc'], alpha=0.7)
axes[0, 0].set_xticks(range(len(results_df)))
axes[0, 0].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].set_title('Test Accuracy')
axes[0, 0].grid(True, alpha=0.3)

# F1 Score
axes[0, 1].bar(range(len(results_df)), results_df['F1'], alpha=0.7, color='orange')
axes[0, 1].set_xticks(range(len(results_df)))
axes[0, 1].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[0, 1].set_ylabel('F1 Score')
axes[0, 1].set_title('F1 Score')
axes[0, 1].grid(True, alpha=0.3)

# Train Time
axes[1, 0].bar(range(len(results_df)), results_df['Train Time (s)'], alpha=0.7, color='green')
axes[1, 0].set_xticks(range(len(results_df)))
axes[1, 0].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[1, 0].set_ylabel('Temps (s)')
axes[1, 0].set_title('Temps d\'Entra√Ænement')
axes[1, 0].grid(True, alpha=0.3)

# Pred Time
axes[1, 1].bar(range(len(results_df)), results_df['Pred Time (s)'], alpha=0.7, color='red')
axes[1, 1].set_xticks(range(len(results_df)))
axes[1, 1].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[1, 1].set_ylabel('Temps (s)')
axes[1, 1].set_title('Temps de Pr√©diction')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## R√©capitulatif

### K-Nearest Neighbors (KNN)

**Avantages:**
- Simple et intuitif
- Pas d'hypoth√®se sur la distribution des donn√©es
- Fonctionne bien pour des fronti√®res non lin√©aires
- Pas de phase d'entra√Ænement (lazy learning)

**Inconv√©nients:**
- Pr√©diction lente (O(n) pour chaque point)
- Sensible √† l'√©chelle des features (standardisation requise)
- Performance d√©grad√©e en haute dimension (curse of dimensionality)
- Sensible au bruit et aux outliers

**Hyperparam√®tres cl√©s:**
- `n_neighbors` (k): Nombre de voisins √† consid√©rer
- `metric`: Distance utilis√©e (euclidean, manhattan, etc.)
- `weights`: Uniform ou distance-weighted

### Arbres de D√©cision

**Avantages:**
- Facilement interpr√©table (r√®gles if-then)
- G√®re les features num√©riques et cat√©gorielles
- Pas besoin de normalisation
- Capture les interactions entre features
- Rapide en pr√©diction

**Inconv√©nients:**
- Sujet √† l'overfitting sans r√©gularisation
- Instable (variance √©lev√©e)
- Biais pour les features √† nombreuses valeurs
- Peut cr√©er des arbres complexes

**Hyperparam√®tres cl√©s:**
- `max_depth`: Profondeur maximale de l'arbre
- `min_samples_split`: Nombre min d'√©chantillons pour split
- `min_samples_leaf`: Nombre min d'√©chantillons dans une feuille
- `criterion`: Gini ou entropy

### Quand utiliser quoi?

**KNN:**
- Dataset de petite √† moyenne taille
- Fronti√®res de d√©cision complexes
- Peu de features
- Temps de pr√©diction non critique

**Arbres de D√©cision:**
- Besoin d'interpr√©tabilit√©
- Features h√©t√©rog√®nes (num√©riques + cat√©gorielles)
- Base pour des m√©thodes d'ensemble (Random Forest, Boosting)
- Temps de pr√©diction critique