# Projet

## Apercu du projet

Vous avez rejoint une nouvelle équipe dans le secteur de la banque de détail, qui connaît actuellement des taux de défaut plus élevés que prévu sur les prêts personnels. Les prêts personnels sont une source de revenus importante pour les banques, mais ils comportent le risque inhérent que les emprunteurs puissent faire défaut. Un défaut de paiement se produit lorsqu'un emprunteur cesse de faire les paiements requis sur une dette.

## Objectif : 

L'équipe de risque analyse le portefeuille de prêts existants pour prévoir les défauts potentiels futurs et estimer la perte attendue. L'objectif principal est de construire un modèle prédictif qui estime la probabilité de défaut pour chaque client en fonction de ses caractéristiques. Des prédictions précises permettront à la banque d'allouer suffisamment de capital pour couvrir les pertes potentielles, maintenant ainsi la stabilité financière.

### 1. Exploration du Dataset

In [22]:
# Import des bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [23]:
# Chargement du fichier CSV
fichier = "Loan_Data.csv"
df = pd.read_csv(fichier)

In [24]:

df.head(10)

Unnamed: 0,customer_id,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
0,8153374,0,5221.545193,3915.471226,78039.38546,5,605,0
1,7442532,5,1958.928726,8228.75252,26648.43525,2,572,1
2,2256073,0,3363.009259,2027.83085,65866.71246,4,602,0
3,4885975,0,4766.648001,2501.730397,74356.88347,5,612,0
4,4700614,1,1345.827718,1768.826187,23448.32631,6,631,0
5,4661159,0,5376.886873,7189.121298,85529.84591,2,697,0
6,8291909,1,3634.057471,7085.980095,68691.57707,6,722,0
7,4616950,4,3302.172238,13067.57021,50352.16821,3,545,1
8,3395789,0,2938.325123,1918.404472,53497.37754,4,676,0
9,4045948,0,5396.366774,5298.824524,92349.55399,2,447,0


In [25]:

print(df.dtypes)

customer_id                   int64
credit_lines_outstanding      int64
loan_amt_outstanding        float64
total_debt_outstanding      float64
income                      float64
years_employed                int64
fico_score                    int64
default                       int64
dtype: object


In [26]:
# Statistiques descriptives (sans 'customer_id')

print("\n Statistiques descriptives :")
df.describe().drop(columns=['customer_id'])


 Statistiques descriptives :


Unnamed: 0,credit_lines_outstanding,loan_amt_outstanding,total_debt_outstanding,income,years_employed,fico_score,default
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.4612,4159.677034,8718.916797,70039.901401,4.5528,637.5577,0.1851
std,1.743846,1421.399078,6627.164762,20072.214143,1.566862,60.657906,0.388398
min,0.0,46.783973,31.652732,1000.0,0.0,408.0,0.0
25%,0.0,3154.235371,4199.83602,56539.867903,3.0,597.0,0.0
50%,1.0,4052.377228,6732.407217,70085.82633,5.0,638.0,0.0
75%,2.0,5052.898103,11272.26374,83429.166133,6.0,679.0,0.0
max,5.0,10750.67781,43688.7841,148412.1805,10.0,850.0,1.0


In [27]:
# Analyse des valeurs manquantes
print("\n Valeurs manquantes :")
print(df.isnull().sum())


 Valeurs manquantes :
customer_id                 0
credit_lines_outstanding    0
loan_amt_outstanding        0
total_debt_outstanding      0
income                      0
years_employed              0
fico_score                  0
default                     0
dtype: int64


In [28]:
# Vérifier s'il y a des doublons dans le DataFrame
nb_doublons = df.duplicated().sum()

if nb_doublons > 0:
    print(f"Il y a {nb_doublons} lignes dupliquées dans le dataset.")
else:
    print(" Aucun doublon détecté dans le dataset.")

 Aucun doublon détecté dans le dataset.


### 2. Pré-traitement

In [29]:
# Définition de la variable cible
target = "default"  # 🔁 adapte ici si ta colonne s'appelle autrement (ex: "loan_status")
X = df.drop(columns=[target])
y = df[target]

# Split Train/Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"📊 X_train: {X_train.shape}, X_test: {X_test.shape}")

# Normalisation (StandardScaler)
# On standardise pour centrer-réduire les features : (x - mean)/std
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Normalisation effectuée (StandardScaler appliqué sur X_train et X_test)")

📊 X_train: (8000, 7), X_test: (2000, 7)
Normalisation effectuée (StandardScaler appliqué sur X_train et X_test)


### 3. Test de 3 modèles de ML avec MLFLOW

In [30]:
import mlflow 

from mlflow import MlflowClient
from pprint import pprint
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Connexion au tracking
client = MlflowClient(tracking_uri="http://127.0.0.1:8080")

### Logging our runs with MLflow

#### *ML 1 : Decision Tree Experiment*

In [34]:
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
    confusion_matrix,
)
import numpy as np
import joblib

In [38]:
experiment_description_dt = (
    "Modèle Decision Tree pour la prédiction de défaut de prêt. "
    "Cet experiment contient les runs liés au modèle Decision Tree."
)

experiment_tags_dt = {
    "project_name": "loan-default-prediction",
    "model_type": "DecisionTree",
    "team": "mlops-bank",
    "project_quarter": "Q4-2025",
    "mlflow.note.content": experiment_description_dt,
}

experiment_ML_decision_tree = client.create_experiment(
    name="DecisionTree_Experiment_ML1",
    tags=experiment_tags_dt,
)

In [40]:
import os, numpy as np, matplotlib.pyplot as plt, joblib
from datetime import datetime

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.tracking import MlflowClient

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    roc_auc_score, confusion_matrix, RocCurveDisplay
)

# ==========
# 0) Câblage MLflow -> utilise ton experiment déjà créé
# ==========
EXPERIMENT_NAME = "DecisionTree_Experiment_ML1"

try:
    exp_id = experiment_ML_decision_tree  # tu l'as déjà créé juste avant
except NameError:
    # si la variable n'existe pas (par ex. redémarrage kernel), on récupère l'ID par le nom
    client = MlflowClient()  # si déjà défini, tu peux commenter cette ligne
    exp = client.get_experiment_by_name(EXPERIMENT_NAME)
    if exp is None:
        # sécurité: si elle n'existe pas (ex. nettoyage), on la crée avec tags minimaux
        exp_id = client.create_experiment(
            name=EXPERIMENT_NAME,
            tags={
                "project_name": "loan-default-prediction",
                "model_type": "DecisionTree",
                "team": "mlops-bank",
                "project_quarter": "Q4-2025",
                "mlflow.note.content": "Expérience recréée automatiquement."
            },
        )
    else:
        exp_id = exp.experiment_id

# ==========
# 1) Données
# ==========
# Les arbres n'ont pas besoin de scaling ; on préfère X_train/X_test "bruts" si disponibles
Xtr = X_train if "X_train" in globals() else X_train_scaled
Xte = X_test  if "X_test"  in globals() else X_test_scaled

# ==========
# 2) Grille de pruning via cost_complexity_pruning_path
# ==========
tmp_tree = DecisionTreeClassifier(random_state=42)
path = tmp_tree.cost_complexity_pruning_path(Xtr, y_train)
candidate_alphas = path.ccp_alphas

if len(candidate_alphas) > 12:
    qs = np.linspace(0.05, 0.95, 10)  # 10 valeurs réparties
    ccp_grid = np.unique(np.quantile(candidate_alphas[:-1], qs))
else:
    ccp_grid = np.unique(candidate_alphas[:-1])

if ccp_grid.size == 0:
    ccp_grid = np.array([0.0, 1e-4, 1e-3, 1e-2])

# ==========
# 3) GridSearch + CV
# ==========
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

param_grid = {
    "criterion": ["gini", "entropy", "log_loss"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": [None, "sqrt", "log2"],
    "class_weight": [None, "balanced"],
    "ccp_alpha": list(ccp_grid),
    "random_state": [42],
}

scoring = {
    "roc_auc": "roc_auc",
    "f1": "f1",
    "precision": "precision",
    "recall": "recall",
    "accuracy": "accuracy",
}

grid = GridSearchCV(
    DecisionTreeClassifier(),
    param_grid=param_grid,
    scoring=scoring,
    refit="roc_auc",
    cv=cv,
    n_jobs=-1,
    verbose=0,
)

# ==========
# 4) Run MLflow → rattaché à ton experiment (via experiment_id)
# ==========
run_name = f"DecisionTree_CV_Prune_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

with mlflow.start_run(experiment_id=exp_id, run_name=run_name) as run:
    # Tags run (complémentaires aux tags d'expérience)
    mlflow.set_tags({
        "stage": "dev",
        "dataset": "Loan_Data.csv",
        "features_scaled": str("X_train" not in globals()),  # True si on a utilisé *_scaled
        "cv_folds": 5,
        "selector": "GridSearchCV(refit=roc_auc)",
    })

    # Fit + sélection
    grid.fit(Xtr, y_train)
    best_dt = grid.best_estimator_
    best_params = grid.best_params_

    # Évaluation test
    y_pred = best_dt.predict(Xte)
    if hasattr(best_dt, "predict_proba"):
        y_proba = best_dt.predict_proba(Xte)[:, 1]
        auc = roc_auc_score(y_test, y_proba)
    else:
        y_proba = None
        auc = float("nan")

    acc = accuracy_score(y_test, y_pred)
    f1  = f1_score(y_test, y_pred)
    pre = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)

    metrics = {
        "accuracy": acc,
        "precision": pre,
        "recall": rec,
        "f1_score": f1,
        "roc_auc": auc,
        "cv_best_score_roc_auc": grid.best_score_,
    }

    # Log MLflow
    mlflow.log_params(best_params)
    mlflow.log_metrics(metrics)

    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(4,4))
    im = ax.imshow(cm, interpolation="nearest")
    ax.set_title("Confusion matrix (test)")
    ax.set_xlabel("Predicted"); ax.set_ylabel("True")
    for (i, j), v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha="center", va="center")
    plt.tight_layout()
    cm_path = "confusion_matrix_dt.png"
    fig.savefig(cm_path, dpi=150); plt.close(fig)
    mlflow.log_artifact(cm_path)

    # ROC si proba dispo
    if y_proba is not None:
        roc_fig, roc_ax = plt.subplots()
        RocCurveDisplay.from_predictions(y_test, y_proba, ax=roc_ax)
        roc_ax.set_title("ROC curve (test)")
        roc_path = "roc_curve_dt.png"
        roc_fig.savefig(roc_path, dpi=150); plt.close(roc_fig)
        mlflow.log_artifact(roc_path)

    # Modèle + signature
    joblib.dump(best_dt, "model_decision_tree.pkl")
    mlflow.log_artifact("model_decision_tree.pkl")

    signature = infer_signature(Xtr[:50], best_dt.predict(Xtr[:50]))
    mlflow.sklearn.log_model(
        sk_model=best_dt,
        artifact_path="model",
        input_example=Xte[:5],
        signature=signature
    )

    print("✅ Best params:", best_params)
    print("📊 Test metrics:", {k: (round(v,4) if isinstance(v, float) else v) for k,v in metrics.items()})




✅ Best params: {'ccp_alpha': np.float64(0.004653836741748835), 'class_weight': 'balanced', 'criterion': 'entropy', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'random_state': 42}
📊 Test metrics: {'accuracy': 0.994, 'precision': 0.9735, 'recall': 0.9946, 'f1_score': 0.984, 'roc_auc': 0.9996, 'cv_best_score_roc_auc': np.float64(0.9992)}
🏃 View run DecisionTree_CV_Prune_20251018_025041 at: http://127.0.0.1:8080/#/experiments/1/runs/f6b808793b1642c1b34c13dc2885589f
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/1
