# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/03_regression/03_demo_regression_lineaire.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '03_demo_regression_lineaire.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# Chapitre 03 - R√©gression Lin√©aire

**Objectifs :**
- Comprendre et impl√©menter la r√©gression lin√©aire simple
- Ma√Ætriser la r√©gression lin√©aire multiple
- Utiliser la r√©gression polynomiale
- Diagnostiquer les r√©sidus
- √âvaluer les performances (R¬≤, MSE, RMSE)

**Pr√©requis :** Chapitres 00, 01, 02

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.data  # type: ignoresets import load_diabetes, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
import scipy.stats as stats

# Configuration
np.random.seed(42)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Imports r√©ussis")

## 1. R√©gression Lin√©aire Simple

### 1.1 G√©n√©ration de donn√©es synth√©tiques

Cr√©ons des donn√©es avec une relation lin√©aire : $y = 3x + 5 + \epsilon$

In [None]:
# G√©n√©rer des donn√©es synth√©tiques
n_samples = 100
X_simple = np.linspace(0, 10, n_samples).reshape(-1, 1)
y_true = 3 * X_simple.ravel() + 5
noise = np.random.normal(0, 2, n_samples)
y_simple = y_true + noise

# Visualisation
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y_simple, alpha=0.6, label='Donn√©es observ√©es')
plt.plot(X_simple, y_true, 'r--', label='Vraie relation (y=3x+5)', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('R√©gression Lin√©aire Simple - Donn√©es Synth√©tiques')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Nombre d'√©chantillons : {n_samples}")
print(f"Range de X : [{X_simple.min():.2f}, {X_simple.max():.2f}]")
print(f"Range de y : [{y_simple.min():.2f}, {y_simple.max():.2f}]")

### 1.2 Entra√Ænement du mod√®le avec scikit-learn

In [None]:
# Cr√©er et entra√Æner le mod√®le
model_simple = LinearRegression()
model_simple.fit(X_simple, y_simple)

# Pr√©dictions
y_pred = model_simple.predict(X_simple)

# Param√®tres appris
w1 = model_simple.coef_[0]
w0 = model_simple.intercept_

print("=== Param√®tres Appris ===")
print(f"Pente (w1) : {w1:.4f} (vraie valeur: 3.0)")
print(f"Intercept (w0) : {w0:.4f} (vraie valeur: 5.0)")
print(f"\n√âquation : ≈∑ = {w1:.4f}x + {w0:.4f}")

### 1.3 √âvaluation du mod√®le

In [None]:
# M√©triques
mse = mean_squared_error(y_simple, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_simple, y_pred)
r2 = r2_score(y_simple, y_pred)

print("=== M√©triques de Performance ===")
print(f"MSE  : {mse:.4f}")
print(f"RMSE : {rmse:.4f}")
print(f"MAE  : {mae:.4f}")
print(f"R¬≤   : {r2:.4f}")
print(f"\nInterpr√©tation : Le mod√®le explique {r2*100:.2f}% de la variance")

### 1.4 Visualisation des pr√©dictions

In [None]:
plt.figure(figsize=(12, 5))

# Subplot 1 : Droite de r√©gression
plt.subplot(1, 2, 1)
plt.scatter(X_simple, y_simple, alpha=0.6, label='Donn√©es')
plt.plot(X_simple, y_pred, 'g-', label=f'R√©gression (R¬≤={r2:.3f})', linewidth=2)
plt.plot(X_simple, y_true, 'r--', label='Vraie relation', linewidth=2, alpha=0.5)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Droite de R√©gression')
plt.legend()
plt.grid(True, alpha=0.3)

# Subplot 2 : R√©sidus
residus = y_simple - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residus, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.xlabel('Valeurs Pr√©dites')
plt.ylabel('R√©sidus (y - ≈∑)')
plt.title('Analyse des R√©sidus')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Moyenne des r√©sidus : {residus.mean():.6f} (devrait √™tre ~0)")
print(f"√âcart-type des r√©sidus : {residus.std():.4f}")

### 1.5 Impl√©mentation manuelle (pour comprendre les maths)

Calculons les param√®tres avec les formules analytiques :

$$w_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)}, \quad w_0 = \bar{y} - w_1 \bar{x}$$

In [None]:
# Impl√©mentation manuelle
x_flat = X_simple.ravel()
x_mean = x_flat.mean()
y_mean = y_simple.mean()

# Formule analytique
covariance = np.sum((x_flat - x_mean) * (y_simple - y_mean))
variance_x = np.sum((x_flat - x_mean)**2)

w1_manual = covariance / variance_x
w0_manual = y_mean - w1_manual * x_mean

print("=== Comparaison Impl√©mentations ===")
print(f"scikit-learn : w1={w1:.6f}, w0={w0:.6f}")
print(f"Manuel       : w1={w1_manual:.6f}, w0={w0_manual:.6f}")
print(f"\nDiff√©rence   : w1={abs(w1-w1_manual):.10f}, w0={abs(w0-w0_manual):.10f}")

## 2. R√©gression Lin√©aire Multiple

### 2.1 Dataset Diabetes (scikit-learn)

Pr√©disons la progression du diab√®te √† partir de 10 features biom√©dicales.

In [None]:
# Charger le dataset
diabetes = load_diabetes()
X = diabetes.data  # type: ignore
y = diabetes.target  # type: ignore
feature_names = diabetes.feature_names  # type: ignore

print("=== Dataset Diabetes ===")
print(f"Nombre d'√©chantillons : {X.shape[0]}")
print(f"Nombre de features : {X.shape[1]}")
print(f"\nFeatures : {feature_names}")
print(f"\nStatistiques de la cible (y) :")
print(f"  Min : {y.min():.2f}")
print(f"  Max : {y.max():.2f}")
print(f"  Moyenne : {y.mean():.2f}")
print(f"  √âcart-type : {y.std():.2f}")

# Visualiser les corr√©lations
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Matrice de Corr√©lation - Dataset Diabetes')
plt.tight_layout()
plt.show()

### 2.2 Split Train/Test et Normalisation

In [None]:
# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("=== Split Train/Test ===")
print(f"Train : {X_train.shape[0]} √©chantillons")
print(f"Test  : {X_test.shape[0]} √©chantillons")

# Normalisation (optionnelle mais recommand√©e)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n‚úÖ Donn√©es normalis√©es (moyenne=0, std=1)")

### 2.3 Entra√Ænement du mod√®le

In [None]:
# Mod√®le de r√©gression lin√©aire multiple
model_multi = LinearRegression()
model_multi.fit(X_train_scaled, y_train)

# Pr√©dictions
y_train_pred = model_multi.predict(X_train_scaled)
y_test_pred = model_multi.predict(X_test_scaled)

# M√©triques
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("=== Performance du Mod√®le ===")
print(f"Train R¬≤ : {train_r2:.4f}")
print(f"Test R¬≤  : {test_r2:.4f}")
print(f"\nTrain RMSE : {train_rmse:.2f}")
print(f"Test RMSE  : {test_rmse:.2f}")

if abs(train_r2 - test_r2) < 0.1:
    print("\n‚úÖ Pas de surapprentissage d√©tect√©")
else:
    print("\n‚ö†Ô∏è  Possible surapprentissage")

### 2.4 Importance des Features (Coefficients)

In [None]:
# Coefficients du mod√®le
coefficients = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': model_multi.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print("=== Coefficients du Mod√®le ===")
print(coefficients.to_string(index=False))
print(f"\nIntercept : {model_multi.intercept_:.2f}")

# Visualisation
plt.figure(figsize=(10, 6))
plt.barh(coefficients['Feature'], coefficients['Coefficient'])
plt.xlabel('Coefficient')
plt.title('Importance des Features (Coefficients de R√©gression)')
plt.axvline(x=0, color='k', linestyle='--', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nInterpr√©tation : Les features avec les plus grands coefficients (en valeur absolue)")
print("ont le plus d'impact sur la pr√©diction.")

### 2.5 Visualisations Diagnostiques

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Pr√©dictions vs Valeurs R√©elles
axes[0, 0].scatter(y_test, y_test_pred, alpha=0.6)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', linewidth=2, label='Pr√©diction parfaite')
axes[0, 0].set_xlabel('Valeurs R√©elles')
axes[0, 0].set_ylabel('Valeurs Pr√©dites')
axes[0, 0].set_title(f'Pr√©dictions vs R√©elles (Test R¬≤={test_r2:.3f})')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Distribution des r√©sidus
residuals_test = y_test - y_test_pred
axes[0, 1].hist(residuals_test, bins=30, edgecolor='black', alpha=0.7)
axes[0, 1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('R√©sidus')
axes[0, 1].set_ylabel('Fr√©quence')
axes[0, 1].set_title(f'Distribution des R√©sidus (Test, Œº={residuals_test.mean():.2f})')
axes[0, 1].grid(True, alpha=0.3, axis='y')

# 3. R√©sidus vs Pr√©dictions
axes[1, 0].scatter(y_test_pred, residuals_test, alpha=0.6)
axes[1, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Valeurs Pr√©dites')
axes[1, 0].set_ylabel('R√©sidus')
axes[1, 0].set_title('R√©sidus vs Pr√©dictions')
axes[1, 0].grid(True, alpha=0.3)

# 4. Q-Q Plot (Normalit√© des r√©sidus)
stats.probplot(residuals_test, dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Test de Normalit√© des R√©sidus)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Analyse des diagnostics :")
print("1. Les points doivent √™tre proches de la diagonale (plot 1)")
print("2. Les r√©sidus doivent √™tre centr√©s sur 0 (plot 2)")
print("3. Pas de pattern dans les r√©sidus (plot 3)")
print("4. Les points doivent suivre la ligne rouge (Q-Q plot, normalit√©)")

## 3. R√©gression Polynomiale

### 3.1 Donn√©es Non-Lin√©aires

In [None]:
# G√©n√©rer des donn√©es avec relation polynomiale
np.random.seed(42)
X_poly = np.linspace(-3, 3, 100).reshape(-1, 1)
y_poly_true = 0.5 * X_poly.ravel()**2 - 2*X_poly.ravel() + 1
y_poly = y_poly_true + np.random.normal(0, 1, 100)

plt.figure(figsize=(10, 6))
plt.scatter(X_poly, y_poly, alpha=0.6, label='Donn√©es')
plt.plot(X_poly, y_poly_true, 'r--', label='Vraie relation (polynomiale)', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Donn√©es avec Relation Polynomiale')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### 3.2 Comparaison de diff√©rents degr√©s polynomiaux

In [None]:
degrees = [1, 2, 5, 10]
colors = ['blue', 'green', 'orange', 'red']

plt.figure(figsize=(14, 4))

results = []

for i, degree in enumerate(degrees):
    # Cr√©er features polynomiales
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly_features = poly.fit_transform(X_poly)
    
    # Entra√Æner mod√®le
    model = LinearRegression()
    model.fit(X_poly_features, y_poly)
    y_pred_poly = model.predict(X_poly_features)
    
    # M√©triques
    r2 = r2_score(y_poly, y_pred_poly)
    rmse = np.sqrt(mean_squared_error(y_poly, y_pred_poly))
    results.append({'Degree': degree, 'R¬≤': r2, 'RMSE': rmse})
    
    # Subplot
    plt.subplot(1, 4, i+1)
    plt.scatter(X_poly, y_poly, alpha=0.4, s=20)
    plt.plot(X_poly, y_pred_poly, color=colors[i], linewidth=2, 
             label=f'Degr√© {degree}')
    plt.plot(X_poly, y_poly_true, 'k--', linewidth=1, alpha=0.5, label='Vraie')
    plt.title(f'Degr√© {degree}\nR¬≤={r2:.3f}')
    plt.xlabel('X')
    if i == 0:
        plt.ylabel('y')
    plt.legend(fontsize=8)
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Tableau comparatif
print("\n=== Comparaison des Mod√®les ===")
df_results = pd.DataFrame(results)
print(df_results.to_string(index=False))

print("\nObservation : Degr√© 2 capture bien la relation. Degr√© > 5 surapprentissage.")

### 3.3 Pipeline complet avec Validation Crois√©e

In [None]:
# Tester diff√©rents degr√©s avec validation crois√©e
degrees_cv = range(1, 11)
cv_scores = []

for degree in degrees_cv:
    # Pipeline : Polynomiale + Scaling + R√©gression
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('scaler', StandardScaler()),
        ('regressor', LinearRegression())
    ])
    
    # Validation crois√©e 5-fold
    scores = cross_val_score(pipeline, X_poly, y_poly, cv=5, 
                             scoring='r2')
    cv_scores.append(scores.mean())

# Visualisation
plt.figure(figsize=(10, 6))
plt.plot(degrees_cv, cv_scores, 'o-', linewidth=2, markersize=8)
best_degree = degrees_cv[np.argmax(cv_scores)]
plt.axvline(x=best_degree, color='r', linestyle='--', linewidth=2, 
            label=f'Meilleur degr√©: {best_degree}')
plt.xlabel('Degr√© du Polyn√¥me')
plt.ylabel('R¬≤ (Validation Crois√©e)')
plt.title('S√©lection du Degr√© Polynomial par Validation Crois√©e')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(degrees_cv)
plt.show()

print(f"\n‚úÖ Meilleur degr√© : {best_degree}")
print(f"R¬≤ moyen : {max(cv_scores):.4f}")

## 4. R√©capitulatif et Points Cl√©s

### Points √† retenir :

1. **R√©gression lin√©aire simple** : Une feature, solution analytique directe
2. **R√©gression lin√©aire multiple** : Plusieurs features, matrice de design
3. **R√©gression polynomiale** : Transforme les features pour capturer les non-lin√©arit√©s
4. **M√©triques** : R¬≤, MSE, RMSE, MAE
5. **Diagnostics** : Analyse des r√©sidus cruciale (normalit√©, homosc√©dasticit√©)
6. **Validation crois√©e** : Essentielle pour d√©tecter le surapprentissage

### Prochaine √©tape :

Voir **03_demo_regularisation.ipynb** pour Ridge, Lasso, Elastic Net

In [None]:
print("‚úÖ Notebook termin√© !")
print("\nVous ma√Ætrisez maintenant :")
print("  - R√©gression lin√©aire simple et multiple")
print("  - R√©gression polynomiale")
print("  - √âvaluation et diagnostic de mod√®les")
print("  - Validation crois√©e")