# Projet

## Apercu du projet

Vous avez rejoint une nouvelle équipe dans le secteur de la banque de détail, qui connaît actuellement des taux de défaut plus élevés que prévu sur les prêts personnels. Les prêts personnels sont une source de revenus importante pour les banques, mais ils comportent le risque inhérent que les emprunteurs puissent faire défaut. Un défaut de paiement se produit lorsqu'un emprunteur cesse de faire les paiements requis sur une dette.

## Objectif : 

L'équipe de risque analyse le portefeuille de prêts existants pour prévoir les défauts potentiels futurs et estimer la perte attendue. L'objectif principal est de construire un modèle prédictif qui estime la probabilité de défaut pour chaque client en fonction de ses caractéristiques. Des prédictions précises permettront à la banque d'allouer suffisamment de capital pour couvrir les pertes potentielles, maintenant ainsi la stabilité financière.

### 1. Exploration du Dataset

In [1]:
# Import des bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Chargement du fichier CSV
fichier = "Loan_Data.csv"
df = pd.read_csv(fichier)

In [3]:

df.head(10)

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0
5,4661159,0,5376.886873,7189.121298,85529.84591,2,697,0
6,8291909,1,3634.057471,7085.980095,68691.57707,6,722,0
7,4616950,4,3302.172238,13067.57021,50352.16821,3,545,1
8,3395789,0,2938.325123,1918.404472,53497.37754,4,676,0
9,4045948,0,5396.366774,5298.824524,92349.55399,2,447,0


In [4]:

print(df.dtypes)

customer_id                   int64
credit_lines_outstanding      int64
loan_amt_outstanding        float64
total_debt_outstanding      float64
income                      float64
years_employed                int64
fico_score                    int64
default                       int64
dtype: object


In [5]:
# Statistiques descriptives (sans 'customer_id')

print("\n Statistiques descriptives :")
df.describe().drop(columns=['customer_id'])


 Statistiques descriptives :


Unnamed: 0,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.4612,4159.677034,8718.916797,70039.901401,4.5528,637.5577,0.1851
std,1.743846,1421.399078,6627.164762,20072.214143,1.566862,60.657906,0.388398
min,0.0,46.783973,31.652732,1000.0,0.0,408.0,0.0
25%,0.0,3154.235371,4199.83602,56539.867903,3.0,597.0,0.0
50%,1.0,4052.377228,6732.407217,70085.82633,5.0,638.0,0.0
75%,2.0,5052.898103,11272.26374,83429.166133,6.0,679.0,0.0
max,5.0,10750.67781,43688.7841,148412.1805,10.0,850.0,1.0


In [6]:
# Analyse des valeurs manquantes
print("\n Valeurs manquantes :")
print(df.isnull().sum())


 Valeurs manquantes :
customer_id                 0
credit_lines_outstanding    0
loan_amt_outstanding        0
total_debt_outstanding      0
income                      0
years_employed              0
fico_score                  0
default                     0
dtype: int64


In [7]:
# Vérifier s'il y a des doublons dans le DataFrame
nb_doublons = df.duplicated().sum()

if nb_doublons > 0:
    print(f"Il y a {nb_doublons} lignes dupliquées dans le dataset.")
else:
    print(" Aucun doublon détecté dans le dataset.")

 Aucun doublon détecté dans le dataset.


### 2. Pré-traitement

In [None]:
# Définition de la variable cible
target = "default"  
X = df.drop(columns=[target])
y = df[target]

# Split Train/Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"📊 X_train: {X_train.shape}, X_test: {X_test.shape}")

# Normalisation (StandardScaler)
# On standardise pour centrer-réduire les features : (x - mean)/std
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Normalisation effectuée (StandardScaler appliqué sur X_train et X_test)")

📊 X_train: (8000, 7), X_test: (2000, 7)
Normalisation effectuée (StandardScaler appliqué sur X_train et X_test)


### 3. Test de 3 modèles de ML avec MLFLOW

In [9]:
import mlflow 

from mlflow import MlflowClient
from pprint import pprint
from sklearn.ensemble import RandomForestRegressor

In [10]:
# Connexion au tracking
client = MlflowClient(tracking_uri="http://127.0.0.1:8080")

### Logging our runs with MLflow

#### *ML 1 : Decision Tree Experiment*

In [11]:
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
    confusion_matrix,
)
import numpy as np
import joblib

In [None]:
# ✅ Decision Tree — créer OU récupérer l'expérience sans la dupliquer
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("http://127.0.0.1:8080")

EXPERIMENT_NAME_DT = "DecisionTree_Experiment_ML1"
experiment_description_dt = (
    "Modèle Decision Tree pour la prédiction de défaut de prêt. "
    "Cet experiment contient les runs liés au modèle Decision Tree."
)
experiment_tags_dt = {
    "project_name": "loan-default-prediction",
    "model_type": "DecisionTree",
    "team": "mlops-bank",
    "project_quarter": "Q4-2025",
    "mlflow.note.content": experiment_description_dt,
}

client = MlflowClient()  # réutilise le tracking_uri courant
exp = client.get_experiment_by_name(EXPERIMENT_NAME_DT)

if exp is None:
    # première fois uniquement
    exp_id_dt = client.create_experiment(
        name=EXPERIMENT_NAME_DT,
        tags=experiment_tags_dt,
    )
else:
    exp_id_dt = exp.experiment_id
    # garder/mettre à jour les tags d'expérience
    for k, v in experiment_tags_dt.items():
        client.set_experiment_tag(exp_id_dt, k, v)

print("Decision Tree experiment id:", exp_id_dt)


Decision Tree experiment id: 1


In [None]:
# =========================
# Decision Tree 
# =========================

# Imports
import numpy as np, matplotlib.pyplot as plt, joblib
from datetime import datetime

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.tracking import MlflowClient

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    roc_auc_score, confusion_matrix, RocCurveDisplay
)

# ----------
# 0) Câblage MLflow — utilise l'ID existant, sinon récupère par nom
# ----------
EXPERIMENT_NAME_DT = "DecisionTree_Experiment_ML1"

try:
    exp_id = exp_id_dt
except NameError:
    client = MlflowClient()
    exp = client.get_experiment_by_name(EXPERIMENT_NAME_DT)
    assert exp is not None, f"L'expérience {EXPERIMENT_NAME_DT} est introuvable. Exécute d'abord le bloc 'création/récupération d'expérience'."
    exp_id = exp.experiment_id

# Tag clair: "raw" si X_train existe (non-scalé), sinon "scaled"
features_tag = "raw" if "X_train" in globals() else "scaled"

# ----------
# 1) Données
# ----------
Xtr = X_train if "X_train" in globals() else X_train_scaled
Xte = X_test  if "X_test"  in globals() else X_test_scaled

# ----------
# 2) Grille de pruning via cost_complexity_pruning_path
# ----------
tmp_tree = DecisionTreeClassifier(random_state=42)
path = tmp_tree.cost_complexity_pruning_path(Xtr, y_train)
candidate_alphas = path.ccp_alphas

if len(candidate_alphas) > 12:
    qs = np.linspace(0.05, 0.95, 10)          # 10 valeurs réparties
    ccp_grid = np.unique(np.quantile(candidate_alphas[:-1], qs))
else:
    ccp_grid = np.unique(candidate_alphas[:-1])

if ccp_grid.size == 0:
    ccp_grid = np.array([0.0, 1e-4, 1e-3, 1e-2])

# ----------
# 3) GridSearch + CV
# ----------
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

param_grid = {
    "criterion": ["gini", "entropy", "log_loss"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": [None, "sqrt", "log2"],
    "class_weight": [None, "balanced"],
    "ccp_alpha": list(ccp_grid),
    "random_state": [42],
}

scoring = {
    "roc_auc": "roc_auc",
    "f1": "f1",
    "precision": "precision",
    "recall": "recall",
    "accuracy": "accuracy",
}

grid = GridSearchCV(
    DecisionTreeClassifier(),
    param_grid=param_grid,
    scoring=scoring,
    refit="roc_auc",
    cv=cv,
    n_jobs=-1,
    verbose=0,
)

# ----------
# 4) Entraînement + logging MLflow
# ----------
run_name = f"DecisionTree_CV_Prune_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

with mlflow.start_run(experiment_id=exp_id, run_name=run_name):
    # Tags de run
    mlflow.set_tags({
        "stage": "dev",
        "dataset": "Loan_Data.csv",
        "features": features_tag,
        "cv_folds": 5,
        "selector": "GridSearchCV(refit=roc_auc)",
        "model_type": "DecisionTree",
    })

    # Fit + sélection
    grid.fit(Xtr, y_train)
    best_dt = grid.best_estimator_
    best_params = grid.best_params_

    # Test set
    y_pred = best_dt.predict(Xte)
    y_proba = best_dt.predict_proba(Xte)[:, 1] if hasattr(best_dt, "predict_proba") else None
    auc = roc_auc_score(y_test, y_proba) if y_proba is not None else float("nan")

    acc = accuracy_score(y_test, y_pred)
    f1  = f1_score(y_test, y_pred)
    pre = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)

    metrics = {
        "accuracy": acc,
        "precision": pre,
        "recall": rec,
        "f1_score": f1,
        "roc_auc": auc,
        "cv_best_score_roc_auc": float(grid.best_score_),
    }

    # Logs
    mlflow.log_params(best_params)
    mlflow.log_metrics(metrics)

    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(4,4))
    ax.imshow(cm, interpolation="nearest")
    ax.set_title("Confusion matrix (test)")
    ax.set_xlabel("Predicted"); ax.set_ylabel("True")
    for (i, j), v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha="center", va="center")
    plt.tight_layout()
    plt.savefig("confusion_matrix_dt.png", dpi=150); plt.close(fig)
    mlflow.log_artifact("confusion_matrix_dt.png")

    # ROC 
    if y_proba is not None:
        roc_fig, roc_ax = plt.subplots()
        RocCurveDisplay.from_predictions(y_test, y_proba, ax=roc_ax)
        roc_ax.set_title("ROC curve (test)")
        plt.tight_layout()
        plt.savefig("roc_curve_dt.png", dpi=150); plt.close(roc_fig)
        mlflow.log_artifact("roc_curve_dt.png")

    # Modèle + signature
    joblib.dump(best_dt, "model_decision_tree.pkl")
    mlflow.log_artifact("model_decision_tree.pkl")

    signature = infer_signature(Xtr[:50], best_dt.predict(Xtr[:50]))
    mlflow.sklearn.log_model(
        sk_model=best_dt,
        artifact_path="model",
        input_example=Xte[:5],
        signature=signature
    )

    print("✅ Best params:", best_params)
    print("📊 Test metrics:", {k: (round(v,4) if isinstance(v, float) else v) for k,v in metrics.items()})




✅ Best params: {'ccp_alpha': np.float64(0.004653836741748835), 'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'random_state': 42}
📊 Test metrics: {'accuracy': 0.994, 'precision': 0.9735, 'recall': 0.9946, 'f1_score': 0.984, 'roc_auc': 0.9996, 'cv_best_score_roc_auc': 0.9992}
🏃 View run DecisionTree_CV_Prune_20251018_191822 at: http://127.0.0.1:8080/#/experiments/1/runs/455fc58d65d94888ba5d70d82cdb5530
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/1


#### *ML 2 : Logistic Regression*

In [None]:
# ===== MLflow: pointer vers le serveur et s'assurer de l'expérience LR =====
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("http://127.0.0.1:8080")
print("Tracking URI:", mlflow.get_tracking_uri())

EXPERIMENT_NAME_LR = "LogisticRegression_Experiment_ml2"
experiment_tags_lr = {
    "project_name": "loan-default-prediction",
    "model_type": "LogisticRegression",
    "team": "mlops-bank",
    "project_quarter": "Q4-2025",
    "mlflow.note.content": (
        "Modèle de Régression Logistique pour la prédiction de défaut de prêt. "
        "Cet experiment contient les runs liés au modèle Logistic Regression."
    ),
}

client = MlflowClient()
exp = client.get_experiment_by_name(EXPERIMENT_NAME_LR)
if exp is None:
    exp_id_lr = client.create_experiment(name=EXPERIMENT_NAME_LR, tags=experiment_tags_lr)
else:
    exp_id_lr = exp.experiment_id
    # maintenir les tags
    for k, v in experiment_tags_lr.items():
        client.set_experiment_tag(exp_id_lr, k, v)

print("Logistic Regression experiment id:", exp_id_lr)


Tracking URI: http://127.0.0.1:8080
Logistic Regression experiment id: 3


In [15]:
# ===== Régression Logistique — entraînement + logs =====
import numpy as np, pandas as pd, matplotlib.pyplot as plt, joblib
from datetime import datetime
from mlflow.models.signature import infer_signature

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    roc_auc_score, confusion_matrix, RocCurveDisplay
)

# Données: on gère le scaling dans le Pipeline
features_tag = "raw" if "X_train" in globals() else "scaled"
Xtr = X_train if "X_train" in globals() else X_train_scaled
Xte = X_test  if "X_test"  in globals() else X_test_scaled

# Pipeline LR
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, n_jobs=-1))
])

# Grille compatible penalty/solver (+ elasticnet)
param_grid = [
    {
        "clf__penalty": ["l2"],
        "clf__solver": ["lbfgs", "liblinear"],
        "clf__C": [0.01, 0.1, 1.0, 3.0, 10.0],
        "clf__class_weight": [None, "balanced"],
    },
    {
        "clf__penalty": ["l1"],
        "clf__solver": ["saga", "liblinear"],
        "clf__C": [0.01, 0.1, 1.0, 3.0, 10.0],
        "clf__class_weight": [None, "balanced"],
    },
    {
        "clf__penalty": ["elasticnet"],
        "clf__solver": ["saga"],
        "clf__l1_ratio": [0.2, 0.5, 0.8],
        "clf__C": [0.01, 0.1, 1.0, 3.0],
        "clf__class_weight": [None, "balanced"],
    },
]

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "roc_auc": "roc_auc",
    "f1": "f1",
    "precision": "precision",
    "recall": "recall",
    "accuracy": "accuracy",
}

grid = GridSearchCV(
    pipe, param_grid=param_grid, scoring=scoring,
    refit="roc_auc", cv=cv, n_jobs=-1, verbose=0
)

run_name = f"LogReg_CV_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
with mlflow.start_run(experiment_id=exp_id_lr, run_name=run_name):
    # Tags de run
    mlflow.set_tags({
        "stage": "dev",
        "dataset": "Loan_Data.csv",
        "features": features_tag,
        "cv_folds": 5,
        "selector": "GridSearchCV(refit=roc_auc)",
        "model_type": "LogisticRegression",
    })

    # Fit + sélection
    grid.fit(Xtr, y_train)
    best_model = grid.best_estimator_
    best_params = grid.best_params_

    # Évaluation (seuil 0.5)
    y_proba = best_model.predict_proba(Xte)[:, 1]
    y_pred  = (y_proba >= 0.5).astype(int)

    metrics = {
        "roc_auc": roc_auc_score(y_test, y_proba),
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred),
        "recall": recall_score(y_test, y_pred),
        "f1_score": f1_score(y_test, y_pred),
        "cv_best_score_roc_auc": float(grid.best_score_),
    }

    mlflow.log_params(best_params)
    mlflow.log_metrics(metrics)

    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(4,4))
    ax.imshow(cm, interpolation="nearest")
    ax.set_title("Confusion matrix (test)")
    ax.set_xlabel("Predicted"); ax.set_ylabel("True")
    for (i, j), v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha="center", va="center")
    plt.tight_layout()
    plt.savefig("confusion_matrix_lr.png", dpi=150); plt.close(fig)
    mlflow.log_artifact("confusion_matrix_lr.png")

    # ROC curve
    roc_fig, roc_ax = plt.subplots()
    RocCurveDisplay.from_predictions(y_test, y_proba, ax=roc_ax)
    roc_ax.set_title("ROC curve (test)"); plt.tight_layout()
    plt.savefig("roc_curve_lr.png", dpi=150); plt.close(roc_fig)
    mlflow.log_artifact("roc_curve_lr.png")

    # Coefficients & odds ratios (explication)
    feature_names = list(X_train.columns) if hasattr(X_train, "columns") else [f"f{i}" for i in range(Xtr.shape[1])]
    lr = best_model.named_steps["clf"]
    coefs = pd.DataFrame({
        "feature": feature_names,
        "coef": lr.coef_.ravel(),
        "odds_ratio": np.exp(lr.coef_.ravel())
    }).sort_values("odds_ratio", ascending=False)
    coefs.to_csv("logreg_coefficients.csv", index=False)
    mlflow.log_artifact("logreg_coefficients.csv")

    top = coefs.head(15).iloc[::-1]
    plt.figure(figsize=(6,5))
    plt.barh(top["feature"], top["odds_ratio"])
    plt.title("Top Odds Ratios"); plt.tight_layout()
    plt.savefig("logreg_top_oddsratios.png", dpi=150); plt.close()
    mlflow.log_artifact("logreg_top_oddsratios.png")

    # Sauvegarde + signature
    joblib.dump(best_model, "model_logreg.pkl")
    mlflow.log_artifact("model_logreg.pkl")

    signature = infer_signature(Xtr[:50], best_model.predict_proba(Xtr[:50])[:, 1])
    mlflow.sklearn.log_model(
        sk_model=best_model,
        artifact_path="model",
        input_example=Xte[:5],
        signature=signature
    )

    print("✅ Best params (LR):", best_params)
    print("📊 Test metrics (LR):", {k: (round(v,4) if isinstance(v,float) else v) for k,v in metrics.items()})




✅ Best params (LR): {'clf__C': 10.0, 'clf__class_weight': None, 'clf__penalty': 'l1', 'clf__solver': 'liblinear'}
📊 Test metrics (LR): {'roc_auc': 1.0, 'accuracy': 0.999, 'precision': 0.9973, 'recall': 0.9973, 'f1_score': 0.9973, 'cv_best_score_roc_auc': 1.0}
🏃 View run LogReg_CV_20251018_192654 at: http://127.0.0.1:8080/#/experiments/3/runs/5d74b3dfaf37409791fa95d5f1cd2510
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/3


#### *ML 2 : Random Forest*

In [None]:
# ===== MLflow: pointer vers le serveur et s'assurer de l'expérience RF =====
import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("http://127.0.0.1:8080")
print("Tracking URI:", mlflow.get_tracking_uri())

EXPERIMENT_NAME_RF = "RandomForest_Experiment_ml3"
experiment_tags_rf = {
    "project_name": "loan-default-prediction",
    "model_type": "RandomForest",
    "team": "mlops-bank",
    "project_quarter": "Q4-2025",
    "mlflow.note.content": (
        "Modèle Random Forest pour la prédiction de défaut de prêt. "
        "Cet experiment contient les runs liés au modèle Random Forest."
    ),
}

client = MlflowClient()
exp = client.get_experiment_by_name(EXPERIMENT_NAME_RF)
if exp is None:
    exp_id_rf = client.create_experiment(name=EXPERIMENT_NAME_RF, tags=experiment_tags_rf)
else:
    exp_id_rf = exp.experiment_id
    # maintenir/mettre à jour les tags
    for k, v in experiment_tags_rf.items():
        client.set_experiment_tag(exp_id_rf, k, v)

print("Random Forest experiment id:", exp_id_rf)


Tracking URI: http://127.0.0.1:8080
Random Forest experiment id: 4


In [None]:
# ===== Random Forest — entraînement + logs =====
import numpy as np, pandas as pd, matplotlib.pyplot as plt, joblib
from datetime import datetime
from mlflow.models.signature import infer_signature

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    roc_auc_score, confusion_matrix, RocCurveDisplay
)

# Données 
features_tag = "raw" if "X_train" in globals() else "scaled"
Xtr = X_train if "X_train" in globals() else X_train_scaled
Xte = X_test  if "X_test"  in globals() else X_test_scaled

# Grille d'hyperparams 
param_grid = {
    "n_estimators": [200, 400],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2", None],
    "bootstrap": [True],
    "class_weight": [None, "balanced"],
    "random_state": [42],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "roc_auc": "roc_auc",
    "f1": "f1",
    "precision": "precision",
    "recall": "recall",
    "accuracy": "accuracy",
}

grid = GridSearchCV(
    RandomForestClassifier(n_jobs=-1),
    param_grid=param_grid,
    scoring=scoring,
    refit="roc_auc",
    cv=cv,
    n_jobs=-1,
    verbose=0,
)

run_name = f"RandomForest_CV_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
with mlflow.start_run(experiment_id=exp_id_rf, run_name=run_name):
    # Tags de run
    mlflow.set_tags({
        "stage": "dev",
        "dataset": "Loan_Data.csv",
        "features": features_tag,
        "cv_folds": 5,
        "selector": "GridSearchCV(refit=roc_auc)",
        "model_type": "RandomForest",
    })

    # Fit + sélection
    grid.fit(Xtr, y_train)
    best_model = grid.best_estimator_
    best_params = grid.best_params_

    # Évaluation (seuil 0.5)
    y_proba = best_model.predict_proba(Xte)[:, 1]
    y_pred  = (y_proba >= 0.5).astype(int)

    metrics = {
        "roc_auc": roc_auc_score(y_test, y_proba),
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred),
        "recall": recall_score(y_test, y_pred),
        "f1_score": f1_score(y_test, y_pred),
        "cv_best_score_roc_auc": float(grid.best_score_),
    }

    # Logs
    mlflow.log_params(best_params)
    mlflow.log_metrics(metrics)

    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(4,4))
    ax.imshow(cm, interpolation="nearest")
    ax.set_title("Confusion matrix (test)")
    ax.set_xlabel("Predicted"); ax.set_ylabel("True")
    for (i, j), v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha="center", va="center")
    plt.tight_layout()
    plt.savefig("confusion_matrix_rf.png", dpi=150); plt.close(fig)
    mlflow.log_artifact("confusion_matrix_rf.png")

    # ROC curve
    roc_fig, roc_ax = plt.subplots()
    RocCurveDisplay.from_predictions(y_test, y_proba, ax=roc_ax)
    roc_ax.set_title("ROC curve (test)"); plt.tight_layout()
    plt.savefig("roc_curve_rf.png", dpi=150); plt.close(roc_fig)
    mlflow.log_artifact("roc_curve_rf.png")

    # Importances de features (top 20)
    feature_names = list(X_train.columns) if hasattr(X_train, "columns") else [f"f{i}" for i in range(Xtr.shape[1])]
    imp = pd.DataFrame({
        "feature": feature_names,
        "importance": best_model.feature_importances_
    }).sort_values("importance", ascending=False)
    imp.to_csv("rf_feature_importances.csv", index=False)
    mlflow.log_artifact("rf_feature_importances.csv")

    top = imp.head(20).iloc[::-1]
    plt.figure(figsize=(7,6))
    plt.barh(top["feature"], top["importance"])
    plt.title("Random Forest — Top 20 importances"); plt.tight_layout()
    plt.savefig("rf_top_importances.png", dpi=150); plt.close()
    mlflow.log_artifact("rf_top_importances.png")

    # Sauvegarde + signature
    joblib.dump(best_model, "model_random_forest.pkl")
    mlflow.log_artifact("model_random_forest.pkl")

    from mlflow.models.signature import infer_signature
    signature = infer_signature(Xtr[:50], best_model.predict_proba(Xtr[:50])[:, 1])
    mlflow.sklearn.log_model(
        sk_model=best_model,
        artifact_path="model",
        input_example=Xte[:5],
        signature=signature
    )

    print("✅ Best params (RF):", best_params)
    print("📊 Test metrics (RF):", {k: (round(v,4) if isinstance(v,float) else v) for k,v in metrics.items()})




✅ Best params (RF): {'bootstrap': True, 'class_weight': None, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200, 'random_state': 42}
📊 Test metrics (RF): {'roc_auc': 0.9998, 'accuracy': 0.9965, 'precision': 0.9946, 'recall': 0.9865, 'f1_score': 0.9905, 'cv_best_score_roc_auc': 0.9998}
🏃 View run RandomForest_CV_20251018_194402 at: http://127.0.0.1:8080/#/experiments/4/runs/0de940efd4b946c08abd8c230635a447
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/4
