# Loan Default Prediction Pipeline with MLflow

Este notebook constrói um pipeline de Machine Learning para prever incumprimento de crédito (default) com base no dataset **lending_data.csv** e regista automaticamente todos os artefactos, métricas e parâmetros no **MLflow**.

## 1. Configuração do ambiente
Certifique‑se de que tem `mlflow`, `pandas` e `scikit‑learn` instalados no ambiente ativo. Se estiver a usar a *conda.yaml* deste projeto, todas as dependências já estão incluídas.

In [4]:

import mlflow
import mlflow.sklearn
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

mlflow.set_experiment("loan_default_experiment")
#mlflow.autolog()  # ativa o auto‑logging (regista métricas, parâmetros e modelo automaticamente)


## 2. Carregar o dataset

In [5]:

# Ajuste o caminho se necessário
DATA_PATH = "../data/lending_data.csv"
df = pd.read_csv(DATA_PATH)

TARGET = "default.payment.next.month"
ID_COL = "ID"

X = df.drop(columns=[TARGET, ID_COL])
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Tamanho train: {X_train.shape}, test: {X_test.shape}")


Tamanho train: (24000, 23), test: (6000, 23)


## 3. Definir Pipeline e Espaço de Hiperparâmetros
Aqui vamos usar um *Pipeline* composto por `StandardScaler` e `LogisticRegression`, avaliado com `GridSearchCV`.

In [6]:

pipeline = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("classifier", LogisticRegression(max_iter=1000, solver="liblinear")),
    ]
)

param_grid = {
    "classifier__C": [0.01, 0.1, 1.0, 10.0],
    "classifier__penalty": ["l1", "l2"],
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    verbose=2,
)


## 4. Treino, Avaliação e Logging no MLflow

In [7]:

with mlflow.start_run(run_name="log_reg_pipeline") as run:
    # Treinar modelo
    grid_search.fit(X_train, y_train)

    # Melhor estimador
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_

    # Avaliar no conjunto de teste
    y_pred = best_model.predict(X_test)
    y_proba = best_model.predict_proba(X_test)[:, 1]

    acc = accuracy_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_proba)

    # Registar métricas e parâmetros extra (autolog já registou vários)
    mlflow.log_metric("test_accuracy", acc)
    mlflow.log_metric("test_roc_auc", roc)
    mlflow.log_params(best_params)

    # Registar dataset como artefacto
    mlflow.log_artifact(DATA_PATH, artifact_path="dataset")

    # Registar o modelo treinado
    mlflow.sklearn.log_model(best_model, artifact_path="model")

    print(f"Run ID: {run.info.run_id}")
    print("Melhores Hiperparâmetros:", best_params)
    print(f"Accuracy  (test): {acc:.4f}")
    print(f"ROC AUC   (test): {roc:.4f}")


Fitting 5 folds for each of 8 candidates, totalling 40 fits


  pid = os.fork()
  pid = os.fork()


[CV] END .........classifier__C=0.01, classifier__penalty=l1; total time=   0.2s
[CV] END .........classifier__C=0.01, classifier__penalty=l1; total time=   0.5s
[CV] END .........classifier__C=0.01, classifier__penalty=l1; total time=   0.8s
[CV] END .........classifier__C=0.01, classifier__penalty=l1; total time=   0.2s
[CV] END .........classifier__C=0.01, classifier__penalty=l1; total time=   0.6s
[CV] END .........classifier__C=0.01, classifier__penalty=l2; total time=   0.3s
[CV] END .........classifier__C=0.01, classifier__penalty=l2; total time=   0.3s
[CV] END .........classifier__C=0.01, classifier__penalty=l2; total time=   0.3s
[CV] END .........classifier__C=0.01, classifier__penalty=l2; total time=   0.2s
[CV] END .........classifier__C=0.01, classifier__penalty=l2; total time=   0.2s
[CV] END ..........classifier__C=0.1, classifier__penalty=l1; total time=   0.2s[CV] END ..........classifier__C=0.1, classifier__penalty=l1; total time=   0.3s

[CV] END ..........classifie

  self.utc_time_created = str(utc_time_created or datetime.utcnow())


## 5. Explorar resultados no UI
Execute no terminal:

```bash
mlflow ui
```

e abra o navegador em [http://localhost:5000](http://localhost:5000) para comparar as *runs* e artefactos.