# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/03_regression/03_exercices.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '03_exercices.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 03 - Exercices de R√©gression

Ce notebook contient des exercices pratiques sur la r√©gression lin√©aire, polynomiale et la r√©gularisation.

## Objectifs
- Appliquer la r√©gression lin√©aire sur des donn√©es r√©elles
- Diagnostiquer les probl√®mes de r√©gression
- Utiliser la r√©gularisation pour am√©liorer les mod√®les
- Comparer diff√©rentes approches

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing, load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## Exercice 1 : R√©gression Lin√©aire sur le Dataset California Housing

**Objectif** : Pr√©dire le prix m√©dian des maisons en Californie.

**Consignes** :
1. Charger le dataset California Housing
2. Explorer les donn√©es (statistiques descriptives, corr√©lations)
3. Entra√Æner un mod√®le de r√©gression lin√©aire
4. √âvaluer les performances (MSE, RMSE, R¬≤, MAE)
5. Analyser les r√©sidus
6. Identifier les features les plus importantes

In [None]:
# 1. Chargement des donn√©es
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target  # Prix m√©dian des maisons (en 100k$)

print(f"Shape: {X.shape}")
print(f"\nFeatures: {list(X.columns)}")
print(f"\nTarget range: [{y.min():.2f}, {y.max():.2f}]")

In [None]:
# 2. Exploration des donn√©es
print("Statistiques descriptives:")
print(X.describe())

# Matrice de corr√©lation
plt.figure(figsize=(12, 10))
correlation_matrix = X.corrwith(pd.Series(y, name='Target')).sort_values(ascending=False)
print("\nCorr√©lations avec la target:")
print(correlation_matrix)

# Visualisation
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for idx, col in enumerate(X.columns):
    axes[idx].scatter(X[col], y, alpha=0.3, s=5)
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Price')
    axes[idx].set_title(f'Corr: {X[col].corr(pd.Series(y)):.3f}')

plt.tight_layout()
plt.show()

In [None]:
# 3. Pr√©paration et entra√Ænement
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardisation (importante pour la r√©gression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Entra√Ænement
model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("Mod√®le entra√Æn√© avec succ√®s!")

In [None]:
# 4. √âvaluation des performances
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

# M√©triques
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("Performances du mod√®le:")
print(f"\nTrain RMSE: {np.sqrt(train_mse):.4f}")
print(f"Test RMSE:  {np.sqrt(test_mse):.4f}")
print(f"\nTrain R¬≤: {train_r2:.4f}")
print(f"Test R¬≤:  {test_r2:.4f}")
print(f"\nTest MAE: {mean_absolute_error(y_test, y_test_pred):.4f}")

# Visualisation pr√©dictions vs r√©alit√©
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_train, y_train_pred, alpha=0.3, s=10)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Valeurs r√©elles')
plt.ylabel('Pr√©dictions')
plt.title(f'Train Set (R¬≤={train_r2:.3f})')

plt.subplot(1, 2, 2)
plt.scatter(y_test, y_test_pred, alpha=0.3, s=10)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Valeurs r√©elles')
plt.ylabel('Pr√©dictions')
plt.title(f'Test Set (R¬≤={test_r2:.3f})')

plt.tight_layout()
plt.show()

In [None]:
# 5. Analyse des r√©sidus
residuals = y_test - y_test_pred

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# R√©sidus vs pr√©dictions
axes[0, 0].scatter(y_test_pred, residuals, alpha=0.3, s=10)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Pr√©dictions')
axes[0, 0].set_ylabel('R√©sidus')
axes[0, 0].set_title('R√©sidus vs Pr√©dictions')

# Distribution des r√©sidus
axes[0, 1].hist(residuals, bins=50, edgecolor='black')
axes[0, 1].set_xlabel('R√©sidus')
axes[0, 1].set_ylabel('Fr√©quence')
axes[0, 1].set_title('Distribution des R√©sidus')

# Q-Q plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1, 0])
axes[1, 0].set_title('Q-Q Plot')

# R√©sidus absolus vs pr√©dictions
axes[1, 1].scatter(y_test_pred, np.abs(residuals), alpha=0.3, s=10)
axes[1, 1].set_xlabel('Pr√©dictions')
axes[1, 1].set_ylabel('|R√©sidus|')
axes[1, 1].set_title('R√©sidus Absolus vs Pr√©dictions')

plt.tight_layout()
plt.show()

print(f"Moyenne des r√©sidus: {residuals.mean():.6f}")
print(f"√âcart-type des r√©sidus: {residuals.std():.4f}")

In [None]:
# 6. Importance des features
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', key=abs, ascending=False)

print("Importance des features (coefficients):")
print(feature_importance)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
plt.xlabel('Coefficient')
plt.title('Importance des Features (R√©gression Lin√©aire)')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

## Exercice 2 : R√©gression Polynomiale et Overfitting

**Objectif** : Explorer l'impact du degr√© polynomial sur les performances.

**Consignes** :
1. Cr√©er un dataset synth√©tique avec bruit
2. Tester des r√©gressions polynomiales de degr√©s 1, 3, 5, 10, 15
3. Comparer les performances train/test
4. Visualiser l'overfitting
5. Identifier le degr√© optimal

In [None]:
# 1. Cr√©ation du dataset synth√©tique
np.random.seed(42)
n_samples = 100
X_synth = np.sort(np.random.uniform(-3, 3, n_samples))
y_true = np.sin(X_synth) + 0.5 * X_synth  # Fonction vraie
y_synth = y_true + np.random.normal(0, 0.5, n_samples)  # Avec bruit

X_synth = X_synth.reshape(-1, 1)

# Split
X_s_train, X_s_test, y_s_train, y_s_test = train_test_split(
    X_synth, y_synth, test_size=0.3, random_state=42
)

# Visualisation
plt.figure(figsize=(10, 6))
plt.scatter(X_s_train, y_s_train, alpha=0.6, label='Train', s=50)
plt.scatter(X_s_test, y_s_test, alpha=0.6, label='Test', s=50)
plt.plot(np.sort(X_synth, axis=0), np.sin(np.sort(X_synth, axis=0)) + 0.5 * np.sort(X_synth, axis=0), 
         'r--', label='Fonction vraie', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Dataset Synth√©tique')
plt.show()

In [None]:
# 2-3. Test de diff√©rents degr√©s polynomiaux
degrees = [1, 3, 5, 10, 15]
results = []

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

X_plot = np.linspace(-3, 3, 300).reshape(-1, 1)

for idx, degree in enumerate(degrees):
    # Pipeline: Polynomial Features + Linear Regression
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    pipeline.fit(X_s_train, y_s_train)
    
    # Pr√©dictions
    y_train_pred = pipeline.predict(X_s_train)
    y_test_pred = pipeline.predict(X_s_test)
    y_plot = pipeline.predict(X_plot)
    
    # M√©triques
    train_mse = mean_squared_error(y_s_train, y_train_pred)
    test_mse = mean_squared_error(y_s_test, y_test_pred)
    train_r2 = r2_score(y_s_train, y_train_pred)
    test_r2 = r2_score(y_s_test, y_test_pred)
    
    results.append({
        'Degree': degree,
        'Train RMSE': np.sqrt(train_mse),
        'Test RMSE': np.sqrt(test_mse),
        'Train R¬≤': train_r2,
        'Test R¬≤': test_r2
    })
    
    # Visualisation
    axes[idx].scatter(X_s_train, y_s_train, alpha=0.6, label='Train', s=30)
    axes[idx].scatter(X_s_test, y_s_test, alpha=0.6, label='Test', s=30)
    axes[idx].plot(X_plot, y_plot, 'g-', label='Mod√®le', linewidth=2)
    axes[idx].plot(np.sort(X_synth, axis=0), 
                   np.sin(np.sort(X_synth, axis=0)) + 0.5 * np.sort(X_synth, axis=0),
                   'r--', label='Vrai', linewidth=1, alpha=0.7)
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('y')
    axes[idx].set_title(f'Degr√© {degree}\nTest R¬≤={test_r2:.3f}, RMSE={np.sqrt(test_mse):.3f}')
    axes[idx].legend()
    axes[idx].set_ylim(-5, 5)

# Supprimer le dernier subplot vide
fig.delaxes(axes[-1])
plt.tight_layout()
plt.show()

# R√©sultats
results_df = pd.DataFrame(results)
print("\nR√©sultats par degr√© polynomial:")
print(results_df.to_string(index=False))

In [None]:
# 4. Visualisation de l'overfitting
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RMSE
axes[0].plot(results_df['Degree'], results_df['Train RMSE'], 'o-', label='Train RMSE', linewidth=2)
axes[0].plot(results_df['Degree'], results_df['Test RMSE'], 's-', label='Test RMSE', linewidth=2)
axes[0].set_xlabel('Degr√© Polynomial')
axes[0].set_ylabel('RMSE')
axes[0].set_title('RMSE vs Degr√© Polynomial')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# R¬≤
axes[1].plot(results_df['Degree'], results_df['Train R¬≤'], 'o-', label='Train R¬≤', linewidth=2)
axes[1].plot(results_df['Degree'], results_df['Test R¬≤'], 's-', label='Test R¬≤', linewidth=2)
axes[1].set_xlabel('Degr√© Polynomial')
axes[1].set_ylabel('R¬≤')
axes[1].set_title('R¬≤ vs Degr√© Polynomial')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 5. Identification du degr√© optimal
best_idx = results_df['Test R¬≤'].idxmax()
best_degree = results_df.loc[best_idx, 'Degree']

print(f"\nDegr√© optimal: {best_degree}")
print("\nPerformances du meilleur mod√®le:")
print(results_df.loc[best_idx].to_string())

print("\nüìä Observations:")
print("- Degr√© 1 (lin√©aire): Underfitting - trop simple")
print("- Degr√©s 3-5: Bon compromis biais-variance")
print("- Degr√©s 10-15: Overfitting - trop flexible")

## Exercice 3 : R√©gularisation (Ridge, Lasso, ElasticNet)

**Objectif** : Utiliser la r√©gularisation pour contr√¥ler l'overfitting.

**Consignes** :
1. Utiliser le dataset Diabetes de sklearn
2. Cr√©er des features polynomiales (degr√© 3)
3. Comparer Linear, Ridge, Lasso, ElasticNet
4. Optimiser les hyperparam√®tres alpha
5. Analyser la s√©lection de features par Lasso

In [None]:
# 1. Chargement du dataset Diabetes
diabetes = load_diabetes()
X_diab = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y_diab = diabetes.target

print(f"Shape: {X_diab.shape}")
print(f"Features: {list(X_diab.columns)}")
print(f"\nTarget statistics:")
print(f"Mean: {y_diab.mean():.2f}, Std: {y_diab.std():.2f}")
print(f"Range: [{y_diab.min():.2f}, {y_diab.max():.2f}]")

In [None]:
# 2. Cr√©ation de features polynomiales (degr√© 3)
poly = PolynomialFeatures(degree=3, include_bias=False)
X_diab_poly = poly.fit_transform(X_diab)

print(f"Nombre de features originales: {X_diab.shape[1]}")
print(f"Nombre de features polynomiales: {X_diab_poly.shape[1]}")

# Split
X_d_train, X_d_test, y_d_train, y_d_test = train_test_split(
    X_diab_poly, y_diab, test_size=0.2, random_state=42
)

# Standardisation
scaler_d = StandardScaler()
X_d_train_scaled = scaler_d.fit_transform(X_d_train)
X_d_test_scaled = scaler_d.transform(X_d_test)

In [None]:
# 3. Comparaison des mod√®les
models = {
    'Linear': LinearRegression(),
    'Ridge (Œ±=1)': Ridge(alpha=1.0),
    'Ridge (Œ±=10)': Ridge(alpha=10.0),
    'Lasso (Œ±=1)': Lasso(alpha=1.0, max_iter=10000),
    'Lasso (Œ±=5)': Lasso(alpha=5.0, max_iter=10000),
    'ElasticNet (Œ±=1)': ElasticNet(alpha=1.0, l1_ratio=0.5, max_iter=10000)
}

results_reg = []

for name, model in models.items():
    # Entra√Ænement
    model.fit(X_d_train_scaled, y_d_train)
    
    # Pr√©dictions
    y_train_pred = model.predict(X_d_train_scaled)
    y_test_pred = model.predict(X_d_test_scaled)
    
    # M√©triques
    train_rmse = np.sqrt(mean_squared_error(y_d_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_d_test, y_test_pred))
    train_r2 = r2_score(y_d_train, y_train_pred)
    test_r2 = r2_score(y_d_test, y_test_pred)
    
    # Nombre de coefficients non nuls
    non_zero_coefs = np.sum(np.abs(model.coef_) > 1e-10)
    
    results_reg.append({
        'Model': name,
        'Train RMSE': train_rmse,
        'Test RMSE': test_rmse,
        'Train R¬≤': train_r2,
        'Test R¬≤': test_r2,
        'Non-Zero Coefs': non_zero_coefs
    })

results_reg_df = pd.DataFrame(results_reg)
print("Comparaison des mod√®les de r√©gularisation:")
print(results_reg_df.to_string(index=False))

In [None]:
# Visualisation des r√©sultats
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

x_pos = np.arange(len(results_reg_df))

# RMSE
width = 0.35
axes[0].bar(x_pos - width/2, results_reg_df['Train RMSE'], width, label='Train', alpha=0.8)
axes[0].bar(x_pos + width/2, results_reg_df['Test RMSE'], width, label='Test', alpha=0.8)
axes[0].set_xlabel('Mod√®le')
axes[0].set_ylabel('RMSE')
axes[0].set_title('RMSE par Mod√®le')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(results_reg_df['Model'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# R¬≤
axes[1].bar(x_pos - width/2, results_reg_df['Train R¬≤'], width, label='Train', alpha=0.8)
axes[1].bar(x_pos + width/2, results_reg_df['Test R¬≤'], width, label='Test', alpha=0.8)
axes[1].set_xlabel('Mod√®le')
axes[1].set_ylabel('R¬≤')
axes[1].set_title('R¬≤ par Mod√®le')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(results_reg_df['Model'], rotation=45, ha='right')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 4. Optimisation des hyperparam√®tres avec courbes d'apprentissage
alphas = np.logspace(-3, 2, 50)
ridge_scores = []
lasso_scores = []

for alpha in alphas:
    # Ridge
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_d_train_scaled, y_d_train)
    ridge_scores.append(r2_score(y_d_test, ridge.predict(X_d_test_scaled)))
    
    # Lasso
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_d_train_scaled, y_d_train)
    lasso_scores.append(r2_score(y_d_test, lasso.predict(X_d_test_scaled)))

# Visualisation
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.semilogx(alphas, ridge_scores, 'o-', linewidth=2)
best_alpha_ridge = alphas[np.argmax(ridge_scores)]
plt.axvline(best_alpha_ridge, color='r', linestyle='--', label=f'Best Œ±={best_alpha_ridge:.3f}')
plt.xlabel('Alpha')
plt.ylabel('R¬≤ Score (Test)')
plt.title('Ridge: R¬≤ vs Alpha')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.semilogx(alphas, lasso_scores, 'o-', linewidth=2)
best_alpha_lasso = alphas[np.argmax(lasso_scores)]
plt.axvline(best_alpha_lasso, color='r', linestyle='--', label=f'Best Œ±={best_alpha_lasso:.3f}')
plt.xlabel('Alpha')
plt.ylabel('R¬≤ Score (Test)')
plt.title('Lasso: R¬≤ vs Alpha')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Meilleur alpha Ridge: {best_alpha_ridge:.4f} (R¬≤={max(ridge_scores):.4f})")
print(f"Meilleur alpha Lasso: {best_alpha_lasso:.4f} (R¬≤={max(lasso_scores):.4f})")

In [None]:
# 5. Analyse de la s√©lection de features par Lasso
# Entra√Æner Lasso avec diff√©rents alphas
alphas_lasso = [0.1, 0.5, 1.0, 5.0, 10.0]
lasso_models = {}

for alpha in alphas_lasso:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_d_train_scaled, y_d_train)
    lasso_models[alpha] = lasso

# Visualisation des coefficients
plt.figure(figsize=(14, 6))

for alpha, model in lasso_models.items():
    non_zero = np.sum(np.abs(model.coef_) > 1e-10)
    plt.plot(model.coef_, alpha=0.7, marker='o', markersize=2, 
             label=f'Œ±={alpha} ({non_zero} features)')

plt.xlabel('Index de Feature')
plt.ylabel('Coefficient')
plt.title('Coefficients Lasso pour diff√©rents Alpha')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

# Tableau r√©sum√©
print("\nS√©lection de features par Lasso:")
for alpha, model in lasso_models.items():
    non_zero = np.sum(np.abs(model.coef_) > 1e-10)
    test_r2 = r2_score(y_d_test, model.predict(X_d_test_scaled))
    print(f"Alpha={alpha:5.1f}: {non_zero:3d}/{len(model.coef_)} features, Test R¬≤={test_r2:.4f}")

## Exercice 4 : Learning Curves et Cross-Validation

**Objectif** : Diagnostiquer les probl√®mes de biais/variance avec les courbes d'apprentissage.

**Consignes** :
1. Utiliser le dataset California Housing
2. G√©n√©rer des learning curves pour Linear, Ridge, Polynomial (degr√© 5)
3. Utiliser la validation crois√©e pour estimer les performances
4. Identifier les probl√®mes de biais/variance

In [None]:
# 1. Pr√©paration des donn√©es
X_lc = X[:5000]  # Sous-√©chantillon pour vitesse
y_lc = y[:5000]

X_lc_train, X_lc_test, y_lc_train, y_lc_test = train_test_split(
    X_lc, y_lc, test_size=0.2, random_state=42
)

scaler_lc = StandardScaler()
X_lc_train_scaled = scaler_lc.fit_transform(X_lc_train)
X_lc_test_scaled = scaler_lc.transform(X_lc_test)

In [None]:
# 2. G√©n√©ration des learning curves
models_lc = {
    'Linear': LinearRegression(),
    'Ridge (Œ±=10)': Ridge(alpha=10.0),
    'Polynomial (d=5)': Pipeline([
        ('poly', PolynomialFeatures(degree=5)),
        ('linear', LinearRegression())
    ])
}

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (name, model) in enumerate(models_lc.items()):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_lc_train_scaled, y_lc_train,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    
    # Convertir en RMSE
    train_scores = np.sqrt(-train_scores)
    val_scores = np.sqrt(-val_scores)
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    axes[idx].plot(train_sizes, train_mean, 'o-', label='Train', linewidth=2)
    axes[idx].fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.2)
    
    axes[idx].plot(train_sizes, val_mean, 's-', label='Validation', linewidth=2)
    axes[idx].fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.2)
    
    axes[idx].set_xlabel('Taille du Training Set')
    axes[idx].set_ylabel('RMSE')
    axes[idx].set_title(f'Learning Curve: {name}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 3. Validation crois√©e
print("Scores de validation crois√©e (5-fold):")
print("="*60)

for name, model in models_lc.items():
    scores = cross_val_score(model, X_lc_train_scaled, y_lc_train, 
                             cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    rmse_scores = np.sqrt(-scores)
    
    print(f"\n{name}:")
    print(f"  RMSE moyen: {rmse_scores.mean():.4f} (+/- {rmse_scores.std():.4f})")
    print(f"  RMSE par fold: {[f'{s:.4f}' for s in rmse_scores]}")

In [None]:
# 4. Diagnostic biais-variance
print("\n" + "="*60)
print("DIAGNOSTIC BIAIS-VARIANCE")
print("="*60)

print("""
1. Linear Regression:
   - Train RMSE √©lev√©, proche de Val RMSE
   - Les courbes convergent rapidement
   - Diagnostic: UNDERFITTING (biais √©lev√©)
   - Solution: Ajouter des features ou utiliser un mod√®le plus complexe

2. Ridge (Œ±=10):
   - Train RMSE l√©g√®rement plus √©lev√© que Linear
   - Val RMSE similaire ou l√©g√®rement meilleur
   - Diagnostic: BON COMPROMIS (r√©gularisation appropri√©e)
   - Solution: Optimiser alpha pour am√©liorer l√©g√®rement

3. Polynomial (d=5):
   - Train RMSE tr√®s faible
   - Grand √©cart entre Train et Val RMSE
   - Les courbes ne convergent pas
   - Diagnostic: OVERFITTING (variance √©lev√©e)
   - Solution: R√©gularisation ou r√©duire la complexit√©
""")

print("Points cl√©s:")
print("- Biais √©lev√©: Train et Val RMSE √©lev√©s, proches l'un de l'autre")
print("- Variance √©lev√©e: Train RMSE bas, Val RMSE √©lev√©, grand √©cart")
print("- Bon mod√®le: Train et Val RMSE proches et relativement bas")

## R√©capitulatif

### Points cl√©s abord√©s

1. **R√©gression Lin√©aire**
   - Analyse exploratoire des donn√©es
   - Entra√Ænement et √©valuation (MSE, RMSE, R¬≤, MAE)
   - Diagnostic des r√©sidus
   - Importance des features

2. **R√©gression Polynomiale**
   - Impact du degr√© sur les performances
   - Visualisation de l'overfitting
   - Compromis biais-variance

3. **R√©gularisation**
   - Ridge (L2): P√©nalit√© sur la magnitude des coefficients
   - Lasso (L1): S√©lection de features automatique
   - ElasticNet: Combinaison L1 + L2
   - Optimisation des hyperparam√®tres

4. **Diagnostic et Validation**
   - Learning curves pour d√©tecter biais/variance
   - Validation crois√©e pour estimer les performances
   - Identification des probl√®mes et solutions

### Recommandations pratiques

1. Toujours explorer les donn√©es avant de mod√©liser
2. Standardiser les features pour la r√©gression
3. Analyser les r√©sidus pour valider les hypoth√®ses
4. Utiliser la validation crois√©e pour √©valuer
5. Choisir la r√©gularisation adapt√©e au probl√®me:
   - Ridge: Multicollin√©arit√©, toutes les features utiles
   - Lasso: S√©lection de features, sparsit√©
   - ElasticNet: Compromis Ridge/Lasso

### M√©triques de r√©gression

- **MSE/RMSE**: Sensible aux outliers, m√™me unit√© que la target
- **MAE**: Moins sensible aux outliers
- **R¬≤**: Proportion de variance expliqu√©e (0-1, peut √™tre n√©gatif)
- **Adjusted R¬≤**: P√©nalise le nombre de features