# Credit Default Prediction using Classical Machine Learning Models

**Dataset:** UCI Credit Card Default  
**Source table:** `workspace.default.default_of_credit_card_clients`  
**Environment:** Databricks Serverless (SQL + Python)

## Purpose of the Study
The purpose of this study is to analyze customer credit behavior and evaluate whether
classical machine learning models can predict credit card default. Credit default prediction
is a core problem in financial risk management, as inaccurate decisions may lead to financial losses.

We compare two interpretable models:
- Logistic Regression
- Decision Tree

The goal is not only predictive performance, but also interpretability and practical usability.


In [0]:
# Load raw table
TABLE = "workspace.default.default_of_credit_card_clients"
df_raw = spark.table(TABLE)
display(df_raw.limit(5))

In [0]:
# Cast all columns safely to INT and clean header row
from pyspark.sql import functions as F

def safe_int(colname):
    cleaned = F.trim(F.col(colname).cast("string"))
    cleaned = F.regexp_replace(cleaned, ",", "")
    is_number = cleaned.rlike(r"^-?\d+(\.\d+)?$")
    return (
        F.when(is_number, cleaned.cast("double").cast("int"))
         .otherwise(F.lit(None).cast("int"))
         .alias(colname)
    )

df = df_raw.select(*[safe_int(c) for c in df_raw.columns])

# Drop header-as-row
df = df.filter(F.col("_c0").isNotNull())

# Rename columns 
rename_map = {
    "_c0": "id",
    "X1": "limit_bal",
    "X2": "sex",
    "X3": "education",
    "X4": "marriage",
    "X5": "age",
    "X6": "pay_0",
    "X7": "pay_2",
    "X8": "pay_3",
    "X9": "pay_4",
    "X10": "pay_5",
    "X11": "pay_6",
    "X12": "bill_amt1",
    "X13": "bill_amt2",
    "X14": "bill_amt3",
    "X15": "bill_amt4",
    "X16": "bill_amt5",
    "X17": "bill_amt6",
    "X18": "pay_amt1",
    "X19": "pay_amt2",
    "X20": "pay_amt3",
    "X21": "pay_amt4",
    "X22": "pay_amt5",
    "X23": "pay_amt6",
    "Y": "default_payment_next_month"
}

for old, new in rename_map.items():
    if old in df.columns:
        df = df.withColumnRenamed(old, new)

df.printSchema()
display(df.limit(5))

df.createOrReplaceTempView("credit_default")
print("Temp view created: credit_default")

In [0]:
%sql
SELECT
  COUNT(*) AS rows,
  SUM(default_payment_next_month) AS defaults,
  ROUND(AVG(default_payment_next_month),4) AS default_rate
FROM credit_default;

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

pdf = df.toPandas()

pdf["default_payment_next_month"].value_counts().plot(
    kind="bar", color=["#4DBBD5", "#E64B35"]
)
plt.title("Class Balance: Default vs No Default")
plt.xlabel("Class (0 = No Default, 1 = Default)")
plt.ylabel("Count")
plt.show()

plt.boxplot(
    [pdf[pdf.default_payment_next_month==0].limit_bal,
     pdf[pdf.default_payment_next_month==1].limit_bal],
    labels=["No Default", "Default"],
    showfliers=False
)
plt.title("Credit Limit by Default Status")
plt.ylabel("Limit Balance")
plt.show()

plt.boxplot(
    [pdf[pdf.default_payment_next_month==0].age,
     pdf[pdf.default_payment_next_month==1].age],
    labels=["No Default", "Default"],
    showfliers=False
)
plt.title("Age by Default Status")
plt.ylabel("Age")
plt.show()

In [0]:
import numpy as np

num_cols = pdf.select_dtypes(include=[np.number])
corr = num_cols.corr()["default_payment_next_month"].drop("default_payment_next_month")
corr.abs().sort_values(ascending=False).head(15).plot(
    kind="barh", figsize=(7,5)
)
plt.title("Top Correlations with Default")
plt.xlabel("Absolute Correlation")
plt.show()

## Machine Learning Models

We split the data into training and test sets using stratification due to class imbalance.
Both models use class weighting to compensate for the lower proportion of default cases.


In [0]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix, ConfusionMatrixDisplay

X = pdf.drop(columns=["default_payment_next_month", "id"])
y = pdf["default_payment_next_month"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Logistic Regression
lr = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=3000, class_weight="balanced"))
])
lr.fit(X_train, y_train)
lr_prob = lr.predict_proba(X_test)[:,1]
print("Logistic Regression AUC:", roc_auc_score(y_test, lr_prob))

# Decision Tree
dt = DecisionTreeClassifier(max_depth=6, class_weight="balanced", random_state=42)
dt.fit(X_train, y_train)
dt_prob = dt.predict_proba(X_test)[:,1]
print("Decision Tree AUC:", roc_auc_score(y_test, dt_prob))

In [0]:
# Confusion matrices
lr_pred = (lr_prob >= 0.5).astype(int)
dt_pred = (dt_prob >= 0.5).astype(int)

ConfusionMatrixDisplay(confusion_matrix(y_test, lr_pred)).plot(cmap="Blues")
plt.title("Logistic Regression Confusion Matrix")
plt.show()

ConfusionMatrixDisplay(confusion_matrix(y_test, dt_pred)).plot(cmap="Blues")
plt.title("Decision Tree Confusion Matrix")
plt.show()

## Results & Conclusion

- Both models are able to predict credit default better than random guessing.
- Logistic Regression provides stable performance and strong interpretability.
- Decision Tree captures non-linear relationships but may overfit.
- Repayment status variables are the strongest predictors of default.

**Conclusion:**  
Classical machine learning models can effectively support credit risk assessment and
provide actionable insights for financial decision-making.
