# 04 â€“ Best Model via AUC

This notebook trains several classification models using the provided loan dataset and
selects the best-performing model based on the Area Under the ROC Curve (AUC).
Both the training and test CSV files are utilized so that the final, best model can
produce predictions for the held-out test set.

In [None]:
import sys
import os
sys.path.append(os.path.abspath(".."))
import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

DATA_DIR = Path('../data')
train_path = DATA_DIR / 'train.csv'
test_path = DATA_DIR / 'test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

train_df.head()

The training data includes the `loan_paid_back` target column that we want to model.
The test data shares the same feature columns (minus the target), which we will use
for generating predictions once the best model is selected.

In [None]:
print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
print(train_df['loan_paid_back'].value_counts(normalize=True))

## Feature engineering and preprocessing

We separate the target (`loan_paid_back`) from the predictor columns. Numerical
features will be imputed with the median and scaled, while categorical features will
be imputed with the most frequent value and one-hot encoded. This combined preprocessing
pipeline keeps the feature transformations consistent across any model we evaluate.

In [None]:
target_col = 'loan_paid_back'
X = train_df.drop(columns=[target_col])
y = train_df[target_col]

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

## Model selection using AUC

We benchmark several commonly used classification algorithms, each wrapped in a pipeline
that applies the preprocessing steps defined above. Using stratified 5-fold cross-validation
helps provide a robust estimate of each model's performance. The model with the highest
mean AUC across folds is selected as the best.

In [None]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'RandomForest': RandomForestClassifier(
        n_estimators=300,
        max_depth=None,
        min_samples_leaf=2,
        random_state=42
    ),
    'GradientBoosting': GradientBoostingClassifier(random_state=42)
}

results = []
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
    scores = cross_val_score(pipeline, X, y, cv=skf, scoring='roc_auc')
    results.append({
        'model': name,
        'mean_auc': scores.mean(),
        'std_auc': scores.std(),
    })

results_df = pd.DataFrame(results).sort_values('mean_auc', ascending=False).reset_index(drop=True)
results_df

In [None]:
best_model_name = results_df.loc[0, 'model']
best_model = models[best_model_name]
print(f"Best model based on CV AUC: {best_model_name}")

## Fit the best model on the full training data and score the test set

We now refit the pipeline containing the best-performing model on all available training data.
The resulting estimator is then used to generate the probability of `loan_paid_back = 1`
for every row in the test data. These probabilities can be used for downstream evaluation
or submission files.

In [None]:
best_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', best_model)])
best_pipeline.fit(X, y)

test_probabilities = best_pipeline.predict_proba(test_df)[:, 1]
submission = pd.DataFrame({
    'id': test_df['id'],
    'loan_paid_back': test_probabilities
})
submission.head()

In [None]:
submission.to_csv('data/best_model_submission.csv', index=False)
print('Saved predictions to data/best_model_submission.csv')