# üöÄ Google Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ogautier1980/sandbox-ml/blob/main/cours/14_best_practices/14_demo_pipeline_complet.ipynb)

**Si vous ex√©cutez ce notebook sur Google Colab**, ex√©cutez la cellule suivante pour installer les d√©pendances.

In [None]:
# Installation des d√©pendances (Google Colab uniquement)import sysIN_COLAB = 'google.colab' in sys.modulesif IN_COLAB:    print('üì¶ Installation des packages...')        # Packages ML de base    !pip install -q numpy pandas matplotlib seaborn scikit-learn        # D√©tection du chapitre et installation des d√©pendances sp√©cifiques    notebook_name = '14_demo_pipeline_complet.ipynb'  # Sera remplac√© automatiquement        # Ch 06-08 : Deep Learning    if any(x in notebook_name for x in ['06_', '07_', '08_']):        !pip install -q torch torchvision torchaudio        # Ch 08 : NLP    if '08_' in notebook_name:        !pip install -q transformers datasets tokenizers        if 'rag' in notebook_name:            !pip install -q sentence-transformers faiss-cpu rank-bm25        # Ch 09 : Reinforcement Learning    if '09_' in notebook_name:        !pip install -q gymnasium[classic-control]        # Ch 04 : Boosting    if '04_' in notebook_name and 'boosting' in notebook_name:        !pip install -q xgboost lightgbm catboost        # Ch 05 : Clustering avanc√©    if '05_' in notebook_name:        !pip install -q umap-learn        # Ch 11 : S√©ries temporelles    if '11_' in notebook_name:        !pip install -q statsmodels prophet        # Ch 12 : Vision avanc√©e    if '12_' in notebook_name:        !pip install -q ultralytics timm segmentation-models-pytorch        # Ch 13 : Recommandation    if '13_' in notebook_name:        !pip install -q scikit-surprise implicit        # Ch 14 : MLOps    if '14_' in notebook_name:        !pip install -q mlflow fastapi pydantic        print('‚úÖ Installation termin√©e !')else:    print('‚ÑπÔ∏è  Environnement local d√©tect√©, les packages sont d√©j√† install√©s.')

# D√©monstration : Pipeline ML Complet avec Scikit-Learn

Ce notebook illustre les **best practices** pour construire un pipeline ML complet et robuste :

1. **Chargement et EDA** : Exploration des donn√©es
2. **Feature Engineering** : Cr√©ation de features pertinentes
3. **Pipeline scikit-learn** : Preprocessing + Model
4. **Validation Crois√©e** : √âvaluation robuste
5. **Hyperparameter Tuning** : GridSearchCV
6. **Persistence** : Sauvegarde du mod√®le (joblib/pickle)

**Dataset** : Titanic (pr√©diction de survie) ou California Housing (r√©gression)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib
import pickle
import warnings
warnings.filterwarnings('ignore')

# Configuration de visualisation
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Biblioth√®ques import√©es avec succ√®s !")

## 1. Chargement et Exploration des Donn√©es (EDA)

Nous utilisons le dataset **California Housing** (r√©gression).

In [None]:
# Chargement du dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)  # type: ignore
df['MedHouseVal'] = housing.target  # Target: prix m√©dian des maisons (en 100k$)  # type: ignore

print("Dataset California Housing charg√© !")
print(f"Shape: {df.shape}")
print(f"\nPremi√®res lignes:")
df.head()

In [None]:
# Informations sur le dataset
print("Informations sur le dataset:")
print(df.info())
print("\nStatistiques descriptives:")
df.describe()

In [None]:
# V√©rification des valeurs manquantes
print("Valeurs manquantes par colonne:")
print(df.isnull().sum())
print(f"\nTotal valeurs manquantes: {df.isnull().sum().sum()}")

In [None]:
# Visualisation de la distribution de la target
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['MedHouseVal'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Prix M√©dian (100k$)')
axes[0].set_ylabel('Fr√©quence')
axes[0].set_title('Distribution des Prix')
axes[0].axvline(df['MedHouseVal'].mean(), color='red', linestyle='--', label=f'Moyenne: {df["MedHouseVal"].mean():.2f}')
axes[0].legend()

axes[1].boxplot(df['MedHouseVal'], vert=True)
axes[1].set_ylabel('Prix M√©dian (100k$)')
axes[1].set_title('Boxplot des Prix')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Prix moyen: ${df['MedHouseVal'].mean() * 100:.0f}k")
print(f"Prix m√©dian: ${df['MedHouseVal'].median() * 100:.0f}k")
print(f"√âcart-type: ${df['MedHouseVal'].std() * 100:.0f}k")

In [None]:
# Matrice de corr√©lation
plt.figure(figsize=(10, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Matrice de Corr√©lation')
plt.tight_layout()
plt.show()

print("\nCorr√©lations avec la target (MedHouseVal):")
print(corr_matrix['MedHouseVal'].sort_values(ascending=False))

In [None]:
# Scatter plots des features les plus corr√©l√©es
top_features = corr_matrix['MedHouseVal'].abs().sort_values(ascending=False)[1:5].index

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for i, feature in enumerate(top_features):
    axes[i].scatter(df[feature], df['MedHouseVal'], alpha=0.3, s=10)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('MedHouseVal')
    axes[i].set_title(f'{feature} vs MedHouseVal (r={corr_matrix.loc[feature, "MedHouseVal"]:.3f})')
    
    # Ligne de r√©gression simple
    z = np.polyfit(df[feature], df['MedHouseVal'], 1)
    p = np.poly1d(z)
    axes[i].plot(df[feature], p(df[feature]), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
plt.show()

## 2. Feature Engineering

Cr√©ation de features d√©riv√©es pour am√©liorer le mod√®le.

In [None]:
# Feature Engineering
def add_features(X):
    """Ajoute des features d√©riv√©es."""
    X = X.copy()
    
    # Ratio chambres par m√©nage
    X['RoomsPerHousehold'] = X['AveRooms'] / X['AveOccup']
    
    # Ratio chambres √† coucher par m√©nage
    X['BedroomsPerHousehold'] = X['AveBedrms'] / X['AveOccup']
    
    # Population par m√©nage
    X['PopulationPerHousehold'] = X['Population'] / X['AveOccup']
    
    # Densit√© de population (personnes par bloc)
    X['PopulationDensity'] = X['Population'] / (X['Latitude'].abs() + X['Longitude'].abs())
    
    # Cat√©gorie d'√¢ge de la maison
    X['HouseAgeCategory'] = pd.cut(X['HouseAge'], bins=[0, 10, 30, 100], labels=['New', 'Mid', 'Old'])
    
    return X

# Application du feature engineering
df_engineered = add_features(df)

print("Feature Engineering appliqu√© !")
print(f"\nNouvelles features:")
print(df_engineered[['RoomsPerHousehold', 'BedroomsPerHousehold', 
                      'PopulationPerHousehold', 'PopulationDensity', 'HouseAgeCategory']].head())

print(f"\nShape apr√®s feature engineering: {df_engineered.shape}")

## 3. Split Train/Test et Pr√©paration des Donn√©es

In [None]:
# S√©paration features / target
X = df_engineered.drop('MedHouseVal', axis=1)
y = df_engineered['MedHouseVal']

# Split train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Split Train/Test:")
print(f"  Train: {X_train.shape}")
print(f"  Test:  {X_test.shape}")
print(f"\nDistribution de la target:")
print(f"  Train - Mean: {y_train.mean():.3f}, Std: {y_train.std():.3f}")
print(f"  Test  - Mean: {y_test.mean():.3f}, Std: {y_test.std():.3f}")

## 4. Construction du Pipeline Scikit-Learn

Un **Pipeline** encapsule preprocessing + model pour √©viter les data leaks et faciliter la production.

In [None]:
# Identification des colonnes num√©riques et cat√©gorielles
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Features num√©riques ({len(numeric_features)}): {numeric_features}")
print(f"\nFeatures cat√©gorielles ({len(categorical_features)}): {categorical_features}")

In [None]:
# Preprocessing pour features num√©riques
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Imputation valeurs manquantes
    ('scaler', StandardScaler())  # Normalisation
])

# Preprocessing pour features cat√©gorielles
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Imputation valeurs manquantes
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))  # One-hot encoding
])

# ColumnTransformer : applique transformations sp√©cifiques par type
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'  # Colonnes non sp√©cifi√©es sont dropp√©es
)

print("Preprocessor cr√©√© avec succ√®s !")
print(f"\nTransformations:")
print(f"  - Num√©riques: Imputation (m√©diane) + StandardScaler")
print(f"  - Cat√©gorielles: Imputation (mode) + OneHotEncoder")

In [None]:
# Pipeline complet : Preprocessing + Model
pipeline_rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))
])

print("Pipeline cr√©√© avec succ√®s !")
print(f"\n√âtapes du pipeline:")
for i, (name, step) in enumerate(pipeline_rf.steps, 1):
    print(f"  {i}. {name}: {step.__class__.__name__}")

## 5. Validation Crois√©e

In [None]:
# Validation crois√©e 5-fold
print("=" * 60)
print("VALIDATION CROIS√âE (5-FOLD)")
print("=" * 60)

cv_scores_r2 = cross_val_score(pipeline_rf, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
cv_scores_rmse = -cross_val_score(pipeline_rf, X_train, y_train, cv=5, 
                                   scoring='neg_root_mean_squared_error', n_jobs=-1)
cv_scores_mae = -cross_val_score(pipeline_rf, X_train, y_train, cv=5, 
                                  scoring='neg_mean_absolute_error', n_jobs=-1)

print(f"\nR¬≤ Scores: {cv_scores_r2}")
print(f"  Mean: {cv_scores_r2.mean():.4f} (+/- {cv_scores_r2.std() * 2:.4f})")

print(f"\nRMSE Scores: {cv_scores_rmse}")
print(f"  Mean: {cv_scores_rmse.mean():.4f} (+/- {cv_scores_rmse.std() * 2:.4f})")

print(f"\nMAE Scores: {cv_scores_mae}")
print(f"  Mean: {cv_scores_mae.mean():.4f} (+/- {cv_scores_mae.std() * 2:.4f})")

# Visualisation
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].bar(range(1, 6), cv_scores_r2, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axhline(cv_scores_r2.mean(), color='red', linestyle='--', label=f'Mean: {cv_scores_r2.mean():.4f}')
axes[0].set_xlabel('Fold')
axes[0].set_ylabel('R¬≤ Score')
axes[0].set_title('Validation Crois√©e - R¬≤')
axes[0].legend()
axes[0].set_ylim([0, 1])

axes[1].bar(range(1, 6), cv_scores_rmse, color='coral', alpha=0.7, edgecolor='black')
axes[1].axhline(cv_scores_rmse.mean(), color='red', linestyle='--', label=f'Mean: {cv_scores_rmse.mean():.4f}')
axes[1].set_xlabel('Fold')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Validation Crois√©e - RMSE')
axes[1].legend()

axes[2].bar(range(1, 6), cv_scores_mae, color='lightgreen', alpha=0.7, edgecolor='black')
axes[2].axhline(cv_scores_mae.mean(), color='red', linestyle='--', label=f'Mean: {cv_scores_mae.mean():.4f}')
axes[2].set_xlabel('Fold')
axes[2].set_ylabel('MAE')
axes[2].set_title('Validation Crois√©e - MAE')
axes[2].legend()

plt.tight_layout()
plt.show()

## 6. Hyperparameter Tuning avec GridSearchCV

In [None]:
# Grid de param√®tres √† tester
param_grid = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__max_depth': [10, 20, None],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4]
}

print("=" * 60)
print("GRID SEARCH CV")
print("=" * 60)
print(f"\nNombre de combinaisons: {np.prod([len(v) for v in param_grid.values()])}")
print(f"Param√®tres √† tester:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")

# GridSearchCV
grid_search = GridSearchCV(
    pipeline_rf,
    param_grid,
    cv=3,
    scoring='r2',
    n_jobs=-1,
    verbose=1
)

print("\nD√©marrage du Grid Search...")
grid_search.fit(X_train, y_train)

print("\nGrid Search termin√© !")
print(f"\nMeilleurs param√®tres:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nMeilleur score (R¬≤ CV): {grid_search.best_score_:.4f}")

In [None]:
# R√©cup√©ration du meilleur mod√®le
best_pipeline = grid_search.best_estimator_

# √âvaluation sur le test set
y_pred_test = best_pipeline.predict(X_test)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

print("=" * 60)
print("√âVALUATION SUR LE TEST SET")
print("=" * 60)
print(f"R¬≤ Score:  {r2_test:.4f}")
print(f"RMSE:      {rmse_test:.4f} (${rmse_test * 100:.0f}k)")
print(f"MAE:       {mae_test:.4f} (${mae_test * 100:.0f}k)")

In [None]:
# Visualisation des pr√©dictions
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Scatter plot: Valeurs r√©elles vs pr√©dites
axes[0].scatter(y_test, y_pred_test, alpha=0.5, s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Valeurs R√©elles')
axes[0].set_ylabel('Valeurs Pr√©dites')
axes[0].set_title(f'Pr√©dictions vs R√©alit√© (R¬≤ = {r2_test:.4f})')
axes[0].grid(True, alpha=0.3)

# Histogramme des r√©sidus
residuals = y_test - y_pred_test
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('R√©sidus')
axes[1].set_ylabel('Fr√©quence')
axes[1].set_title(f'Distribution des R√©sidus (Mean: {residuals.mean():.4f})')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Feature Importance

In [None]:
# Extraction du mod√®le RandomForest du pipeline
rf_model = best_pipeline.named_steps['regressor']

# Feature importance (apr√®s preprocessing)
# Note: le preprocessor transforme les features, donc les noms ne correspondent plus exactement
feature_importance = rf_model.feature_importances_

# Approximation des noms de features (numeric + one-hot encoded categorical)
feature_names_approx = numeric_features.copy()
if len(categorical_features) > 0:
    # OneHotEncoder cr√©e de nouvelles features
    cat_encoder = best_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
    cat_feature_names = cat_encoder.get_feature_names_out(categorical_features)
    feature_names_approx.extend(cat_feature_names)

# Tri par importance
indices = np.argsort(feature_importance)[::-1]
top_n = 15

print("=" * 60)
print(f"TOP {top_n} FEATURES LES PLUS IMPORTANTES")
print("=" * 60)
for i, idx in enumerate(indices[:top_n], 1):
    feature_name = feature_names_approx[idx] if idx < len(feature_names_approx) else f"Feature_{idx}"
    print(f"{i:2d}. {feature_name:30s} : {feature_importance[idx]:.4f}")

# Visualisation
plt.figure(figsize=(10, 8))
top_features = [feature_names_approx[i] if i < len(feature_names_approx) else f"Feature_{i}" for i in indices[:top_n]]
top_importance = feature_importance[indices[:top_n]]

plt.barh(range(top_n), top_importance, color='steelblue', edgecolor='black', alpha=0.7)
plt.yticks(range(top_n), top_features)
plt.xlabel('Importance')
plt.title(f'Top {top_n} Features - Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Persistence : Sauvegarde du Mod√®le

Sauvegarde du pipeline complet (preprocessing + model) pour production.

In [None]:
import os

# Cr√©ation du dossier de sauvegarde
model_dir = '/tmp/models'
os.makedirs(model_dir, exist_ok=True)

# Sauvegarde avec joblib (recommand√© pour scikit-learn)
model_path_joblib = os.path.join(model_dir, 'housing_pipeline.joblib')
joblib.dump(best_pipeline, model_path_joblib)
print(f"Mod√®le sauvegard√© (joblib): {model_path_joblib}")
print(f"  Taille: {os.path.getsize(model_path_joblib) / 1024:.2f} KB")

# Sauvegarde avec pickle (alternative)
model_path_pickle = os.path.join(model_dir, 'housing_pipeline.pkl')
with open(model_path_pickle, 'wb') as f:
    pickle.dump(best_pipeline, f)
print(f"\nMod√®le sauvegard√© (pickle): {model_path_pickle}")
print(f"  Taille: {os.path.getsize(model_path_pickle) / 1024:.2f} KB")

In [None]:
# Chargement du mod√®le depuis le disque
loaded_pipeline = joblib.load(model_path_joblib)

# Test de pr√©diction avec le mod√®le charg√©
sample_data = X_test.head(5)
predictions = loaded_pipeline.predict(sample_data)

print("=" * 60)
print("TEST DU MOD√àLE CHARG√â")
print("=" * 60)
print("\n√âchantillon de donn√©es:")
print(sample_data)
print("\nPr√©dictions:")
for i, (true_val, pred_val) in enumerate(zip(y_test.head(5), predictions), 1):
    print(f"  {i}. True: ${true_val * 100:.0f}k, Predicted: ${pred_val * 100:.0f}k, Error: ${(true_val - pred_val) * 100:+.0f}k")

print("\nMod√®le charg√© et test√© avec succ√®s !")

## 9. Comparaison avec d'Autres Mod√®les

In [None]:
# Comparaison avec Ridge et GradientBoosting
models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'Ridge': Ridge(alpha=1.0)
}

results = []

print("=" * 60)
print("COMPARAISON DES MOD√àLES")
print("=" * 60)

for name, model in models.items():
    # Pipeline avec preprocessing
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    
    # Entra√Ænement
    pipeline.fit(X_train, y_train)
    
    # Pr√©dictions
    y_pred = pipeline.predict(X_test)
    
    # M√©triques
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    
    results.append({
        'Model': name,
        'R¬≤': r2,
        'RMSE': rmse,
        'MAE': mae
    })
    
    print(f"\n{name}:")
    print(f"  R¬≤:   {r2:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  MAE:  {mae:.4f}")

# DataFrame de r√©sultats
results_df = pd.DataFrame(results)
print("\n" + "=" * 60)
print(results_df.to_string(index=False))

In [None]:
# Visualisation de la comparaison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].bar(results_df['Model'], results_df['R¬≤'], color=['steelblue', 'coral', 'lightgreen'], 
            edgecolor='black', alpha=0.7)
axes[0].set_ylabel('R¬≤ Score')
axes[0].set_title('Comparaison R¬≤')
axes[0].set_ylim([0, 1])
axes[0].tick_params(axis='x', rotation=15)

axes[1].bar(results_df['Model'], results_df['RMSE'], color=['steelblue', 'coral', 'lightgreen'], 
            edgecolor='black', alpha=0.7)
axes[1].set_ylabel('RMSE')
axes[1].set_title('Comparaison RMSE (plus bas = mieux)')
axes[1].tick_params(axis='x', rotation=15)

axes[2].bar(results_df['Model'], results_df['MAE'], color=['steelblue', 'coral', 'lightgreen'], 
            edgecolor='black', alpha=0.7)
axes[2].set_ylabel('MAE')
axes[2].set_title('Comparaison MAE (plus bas = mieux)')
axes[2].tick_params(axis='x', rotation=15)

plt.tight_layout()
plt.show()

## 10. Conclusion

### Points Cl√©s du Pipeline

1. **EDA** : Comprendre les donn√©es avant de mod√©liser
2. **Feature Engineering** : Cr√©er des features pertinentes
3. **Pipeline scikit-learn** : Encapsulation preprocessing + model
   - √âvite les data leaks
   - Facilite la production
   - Reproductible
4. **Validation Crois√©e** : √âvaluation robuste des performances
5. **Hyperparameter Tuning** : Optimisation syst√©matique
6. **Persistence** : Sauvegarde avec joblib/pickle

### Best Practices

- Toujours faire un split train/test **avant** toute transformation
- Utiliser des Pipelines pour encapsuler toutes les √©tapes
- Valider avec cross-validation, pas seulement train/test
- Sauvegarder le pipeline complet, pas seulement le mod√®le
- Versionner les mod√®les et les donn√©es
- Documenter les choix de preprocessing et d'hyperparam√®tres

### Fichiers G√©n√©r√©s

- `/tmp/models/housing_pipeline.joblib` : Pipeline complet (recommand√©)
- `/tmp/models/housing_pipeline.pkl` : Pipeline complet (alternative)