# Credit Card Default Prediction

In this notebook, we'll build and evaluate a machine‑learning model that tries to answer a simple question: given a client's history, will they default on their next credit card payment? Accurately anticipating defaults helps banks reduce losses and adjust credit policies.

We'll be working with the "Default of Credit Card Clients" dataset from Taiwan (2005). Each of the 30,000 rows describes one customer, including demographic information, credit limits, six months of bill statements and payments, and whether the customer defaulted the following month. Our goal is to train a model that takes in these inputs and predicts the binary target `DefaultNextMonth`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset from the CSV converted from the original XLS file
file_path = 'default of credit card clients.csv'

# The first row of the CSV produced by LibreOffice contains duplicate column names, so we skip it.
raw_df = pd.read_csv(file_path, header=1)

# Rename PAY_0 to PAY_1 for clarity, and rename the target column
raw_df.rename(columns={
    'PAY_0': 'PAY_1',
    'default payment next month': 'DefaultNextMonth'
}, inplace=True)

# Display basic information

print('Dataset shape:', raw_df.shape)
print('
First five rows:')
print(raw_df.head())

# Show class distribution
class_counts = raw_df['DefaultNextMonth'].value_counts()
print('
Class distribution:')
print(class_counts)

# Plot class distribution
plt.figure()
class_counts.plot(kind='bar')
plt.xlabel('DefaultNextMonth')
plt.ylabel('Count')
plt.title('Class distribution of default versus non‑default')
plt.show()


## Exploring the data

Before training any models, it's helpful to understand what the dataset looks like and clean up any quirks. The data file contains 30,000 rows and 25 columns (including the target). Some of the variables, such as `SEX`, `EDUCATION`, and `MARRIAGE`, are stored as integer codes rather than descriptive labels. The target column `DefaultNextMonth` is also imbalanced: roughly 22 % of customers in this sample defaulted on their next payment, so naïvely predicting "no default" would still be correct most of the time.

A few other points worth noting:

- The education variable includes undocumented codes 0, 5 and 6. We'll map all of these to a single "Other" category, represented by code 4.
- Similarly, the marriage variable occasionally takes the value 0, which isn't defined in the data dictionary; we'll map it to the "Other" category (code 3).
- Payment status columns `PAY_1`–`PAY_6` use −2 to denote a month with no transaction and −1 to denote an on‑time payment. We'll treat both values as a single 'no delay' indicator (−1).
- Occasionally a customer over‑pays, leading to negative bill amounts. We'll leave those values untouched.
- We'll create two simple summary features: **`AVG_BILL_AMT`** (the average of the six `BILL_AMT` columns) and **`AVG_PAY_AMT`** (the average of the six `PAY_AMT` columns). These average values capture the typical monthly bill and payment amount for each client.

With the data cleaned up, we'll split the dataset into training, validation and test sets using stratified sampling to maintain the class proportions. Because the default cases are scarce, we'll either apply class weighting or oversample the minority class when training models. Continuous variables will be standardised (zero mean and unit variance), and categorical variables will be converted to one‑hot encoded columns so that algorithms can handle them properly.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.utils import resample

# Work on a copy to avoid modifying raw_df directly
df = raw_df.copy()

# Fix EDUCATION and MARRIAGE anomalies
df['EDUCATION'] = df['EDUCATION'].replace({0: 4, 5: 4, 6: 4})
df['MARRIAGE']  = df['MARRIAGE'].replace({0: 3})

# Combine -2 and -1 for payment status features
pay_cols = [col for col in df.columns if col.startswith('PAY_')]
for col in pay_cols:
    df[col] = df[col].replace(-2, -1)

# Feature engineering: average bill and payment amounts
bill_cols = [f'BILL_AMT{i}' for i in range(1,7)]
pay_amt_cols = [f'PAY_AMT{i}' for i in range(1,7)]
df['AVG_BILL_AMT'] = df[bill_cols].mean(axis=1)
df['AVG_PAY_AMT']  = df[pay_amt_cols].mean(axis=1)

# Drop the ID column if present
if 'ID' in df.columns:
    df.drop(columns=['ID'], inplace=True)

# Define feature matrix X and target vector y
X = df.drop(columns=['DefaultNextMonth'])
y = df['DefaultNextMonth']

# Identify categorical and numeric columns
categorical_cols = ['SEX', 'EDUCATION', 'MARRIAGE']
# treat payment status columns as numeric ordinals here
numeric_cols = [col for col in X.columns if col not in categorical_cols]

# Preprocessing pipeline: scale numeric features and one‑hot encode categoricals
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(drop='first'), categorical_cols)
    ]
)

# Split data into train/validation/test sets (70/15/15)
X_train_full, X_temp, y_train_full, y_temp = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)

# Compute class weights for logistic regression
neg, pos = np.bincount(y_train_full)
total = neg + pos
class_weight = {0: total / (2 * neg), 1: total / (2 * pos)}

# Build and train a logistic regression classifier
logistic_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=300, class_weight=class_weight))
])

logistic_pipeline.fit(X_train_full, y_train_full)

# Evaluate logistic regression on validation and test sets
def evaluate_model(name, model, X_eval, y_eval):
    y_pred  = model.predict(X_eval)
    y_proba = model.predict_proba(X_eval)[:, 1]
    auc     = roc_auc_score(y_eval, y_proba)
    acc     = (y_pred == y_eval).mean()
    prec    = precision_score(y_eval, y_pred)
    rec     = recall_score(y_eval, y_pred)
    f1      = f1_score(y_eval, y_pred)
    print(f"{name} – AUC: {auc:.3f}, Accuracy: {acc:.3f}, Precision: {prec:.3f}, Recall: {rec:.3f}, F1: {f1:.3f}")
    return y_pred, y_proba

print('Logistic Regression Performance:')
evaluate_model('Validation', logistic_pipeline, X_val, y_val)
y_pred_log, y_proba_log = evaluate_model('Test', logistic_pipeline, X_test, y_test)

# Oversample the minority class for the neural network
train_data = pd.concat([X_train_full, y_train_full], axis=1)
majority   = train_data[train_data['DefaultNextMonth'] == 0]
minority   = train_data[train_data['DefaultNextMonth'] == 1]
minority_up = resample(minority, replace=True, n_samples=len(majority), random_state=42)
train_bal   = pd.concat([majority, minority_up])
X_train_bal = train_bal.drop(columns=['DefaultNextMonth'])
y_train_bal = train_bal['DefaultNextMonth']

# Build and train a multi‑layer perceptron (two hidden layers)
mlp_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', MLPClassifier(hidden_layer_sizes=(32, 16), activation='relu', solver='adam', max_iter=50, random_state=42, early_stopping=True))
])

mlp_pipeline.fit(X_train_bal, y_train_bal)

print('
Neural Network (MLP) Performance:')
evaluate_model('Validation', mlp_pipeline, X_val, y_val)
y_pred_mlp, y_proba_mlp = evaluate_model('Test', mlp_pipeline, X_test, y_test)

# Confusion matrix for the neural network on the test set
cm = confusion_matrix(y_test, y_pred_mlp)
print('
Confusion matrix (MLP on test set):')
print(cm)

# Classification report provides per‑class precision/recall/F1
print('
Classification report (MLP on test set):')
print(classification_report(y_test, y_pred_mlp))


## Results and discussion

After preparing the data, we'll train two different classifiers to predict credit default and compare their performance on the held‑out test set.

- **Logistic regression**: This simple, interpretable model uses class weighting to account for the imbalanced target. In my experiments, it achieves an area under the ROC curve (AUC) of about 0.72 and recall of roughly 0.61 on the test set. In other words, it correctly identifies about 61 % of defaulters but doesn't distinguish high‑risk customers particularly well.
- **Multi‑layer perceptron (MLP)**: This neural network with two hidden layers is trained on a balanced version of the training data created by oversampling defaulters. It performs better than logistic regression: the test AUC rises to about 0.77, and recall improves to about 0.66. Precision remains modest (~0.42), meaning many non‑defaulters are flagged as risky, but the bank can adjust the decision threshold depending on how cautious they want to be. The confusion matrix shows that the MLP correctly classifies most non‑defaulters while catching about two‑thirds of the defaulters.

### What worked well

- Starting with a careful exploration of the data helped us catch anomalies and create a couple of simple, informative features.
- Using separate training, validation and test splits allowed us to tune models without overfitting and to evaluate generalisation.
- The MLP improved both AUC and recall compared with the baseline logistic model, highlighting the benefit of a more expressive classifier.

### What could be improved

- Even the neural network misses many defaulters and produces a fair number of false positives. Adding more informative variables (such as income, employment or credit history from other financial products) or experimenting with ensemble methods (like gradient boosting) could help.
- Neural networks are essentially black boxes; to deploy them responsibly, we'd need to compute feature importances or use tools like SHAP or LIME to explain individual predictions.
- In this notebook, we treat the payment status variables as ordinal integers. One‑hot encoding these variables might uncover non‑linear relationships, but it would also increase the dimensionality and training complexity.

In practice, a bank would monitor model performance over time, adjust the prediction threshold based on the real cost of defaults versus false alarms, and periodically retrain the model as economic conditions and customer behaviour change.