
# Credit Card Customers: Default Risk & Behavior Analysis (Python)

**What this does**
- Downloads the **Default of Credit Card Clients** dataset (UCI ML Repository)
- Cleans & explores customer attributes (limits, bill/paid amounts, delinquency history)
- Builds a simple **Logistic Regression** model to predict **default payment next month**
- Shows metrics (Accuracy, ROC-AUC) and most important features

**Run it online (recommended):**
- Open in **Google Colab**, then run all cells
- No local setup needed; internet required to download the dataset

**Dataset**
- Source: UCI ML Repository — *Default of Credit Card Clients* (Taiwan, 2005)
- Rows: 30,000 customers; Target: `default.payment.next.month` (0/1)



## 🚀 Quick Start

1. Run the next cell to install dependencies (Colab: OK to run as-is).  
2. Run the **Data Load** cell — it will fetch the dataset from UCI.  
3. Run the remaining cells in order.


In [None]:

# If running on Colab, these ensure everything is available
!pip -q install pandas numpy matplotlib scikit-learn seaborn openpyxl


In [None]:

import io
import sys
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay

pd.set_option('display.max_columns', 100)



## 📥 Load Data

We try the official UCI link first. If it fails, the code prints a helpful hint.


In [None]:

import pandas as pd

uci_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"

try:
    df_raw = pd.read_excel(uci_url, header=1)  # header=1 skips the first header row
    print("✅ Downloaded dataset from UCI. Shape:", df_raw.shape)
    display(df_raw.head())
except Exception as e:
    print("⚠️ Could not download from UCI. Error:\n", e)
    print("\nTry downloading manually from UCI and uploading to this runtime, or place 'default_of_credit_card_clients.xls' in the working directory.")
    raise



## 🧹 Clean & Prepare
- Rename target and a few column names for convenience
- Ensure numeric types where needed
- Create small helper features (e.g., utilization ratios)


In [None]:

df = df_raw.copy()

# Standardize column names
df.columns = [c.strip().replace(' ', '_').replace('.', '_').lower() for c in df.columns]

# Align known column names
rename_map = {
    'default_payment_next_month': 'default_next_month',
    'pay_0': 'pay_1'  # UCI sometimes uses PAY_0 for September; rename to PAY_1 for consistency
}
df = df.rename(columns=rename_map)

# Quick type checks
numeric_cols = [c for c in df.columns if c not in ['sex','education','marriage']]
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors='coerce')

# Drop rows with missing target
df = df.dropna(subset=['default_next_month']).reset_index(drop=True)

print("Shape after cleaning:", df.shape)
df.head()



## 🔎 Exploratory Data Analysis
- Class balance for `default_next_month`
- Distribution of key numeric features
- Correlations (heatmap for a subset to keep it readable)


In [None]:

# Target balance
target_counts = df['default_next_month'].value_counts(dropna=False).sort_index()
print("Target distribution (0=no default, 1=default):\n", target_counts)

fig, ax = plt.subplots(figsize=(5,4))
target_counts.plot(kind='bar', ax=ax)
ax.set_title('Default Next Month — Class Balance')
ax.set_xlabel('Default (0/1)')
ax.set_ylabel('Count')
plt.show()

# Inspect some key numeric columns
cols_to_plot = ['limit_bal', 'age', 'bill_amt1', 'pay_amt1']
df[cols_to_plot].hist(bins=30, figsize=(10,6))
plt.tight_layout()
plt.show()

# Correlation (subset for readability)
subset_cols = ['default_next_month','limit_bal','age','bill_amt1','bill_amt2','bill_amt3','pay_amt1','pay_amt2','pay_amt3','pay_1','pay_2','pay_3']
corr = df[subset_cols].corr(numeric_only=True)
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=False, cmap='viridis')
plt.title('Correlation (selected features)')
plt.show()



## 🧰 Features & Split
We use several billing and payment variables, limits, and delinquency (`pay_1`, `pay_2`, `pay_3`).  
We'll standardize numeric features and fit **Logistic Regression**.


In [None]:

feature_cols = [
    'limit_bal','age',
    'bill_amt1','bill_amt2','bill_amt3','bill_amt4','bill_amt5','bill_amt6',
    'pay_amt1','pay_amt2','pay_amt3','pay_amt4','pay_amt5','pay_amt6',
    'pay_1','pay_2','pay_3','pay_4','pay_5','pay_6',
    'sex','education','marriage'
]

# Some columns may not exist or may be non-numeric; take intersection and drop NAs
X = df[[c for c in feature_cols if c in df.columns]].copy()
y = df['default_next_month'].astype(int)

# Fill missing with 0 (simple strategy for demo; improve in production)
X = X.fillna(0)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale continuous features (leave categorical 0/1 as-is)
scaler = StandardScaler(with_mean=False)  # sparse-friendly; simple choice for demo
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

X_train.shape, X_test.shape



## 🤖 Model: Logistic Regression
We train a basic model and report **Accuracy**, **ROC-AUC**, and a **Confusion Matrix**.


In [None]:

logreg = LogisticRegression(max_iter=200, n_jobs=None)
logreg.fit(X_train_scaled, y_train)

pred = logreg.predict(X_test_scaled)
proba = logreg.predict_proba(X_test_scaled)[:,1]

acc  = accuracy_score(y_test, pred)
auc  = roc_auc_score(y_test, proba)

print(f"Accuracy: {acc:.3f} | ROC-AUC: {auc:.3f}\n")
print("Classification Report:\n", classification_report(y_test, pred))

disp = ConfusionMatrixDisplay.from_predictions(y_test, pred)
plt.title("Confusion Matrix — Logistic Regression")
plt.show()



## 📌 Most Influential Features
Coefficient magnitude (absolute value) as a simple proxy for importance.


In [None]:

coef_series = pd.Series(logreg.coef_[0], index=X.columns).sort_values(key=np.abs, ascending=False)
display(coef_series.head(15).to_frame('coef').style.background_gradient(axis=0))



## ✅ Conclusions & Next Steps
- We demonstrated a **banking use case** (credit card customer default) end-to-end
- Simple logistic regression gives a strong baseline; try tree models (XGBoost, Random Forest) for nonlinearity
- Improve preprocessing (one-hot encoding categorical, better imputation, outlier handling)
- Perform **hyperparameter tuning** and **cross-validation**
- Consider **cost-sensitive learning** if default has asymmetric costs

---

### Optional: Extend to Bank Product KPIs
- Aggregate monthly **bill/paid** amounts for portfolio-level trends
- Build dashboards (Plotly) for **collections, delinquency buckets**, and **roll rates**
- Add **survival analysis** (time-to-default) for deeper risk insights
