# Credit Card Default Risk Analysis
### Predictive Modeling · Risk Analysis · Profit Optimization

This notebook walks through a full professional workflow:
- Data loading and basic inspection
- Data cleaning and feature engineering
- Exploratory Data Analysis (EDA)
- Predictive modeling (Logistic Regression and Random Forest)
- Risk and profit calculations
- Cutoff optimization for approval strategy


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report

# Display settings
pd.set_option('display.max_columns', 50)

# Load dataset
df = pd.read_csv('credit_risk_dataset.csv')
df.head()

## 1. Basic Data Inspection
We first look at the structure of the dataset, data types, and basic statistics.

In [None]:
# Shape and dtypes
print('Shape:', df.shape)
print('\nData types:')
print(df.dtypes)

# Summary statistics for numeric columns
df.describe().T

In [None]:
# Check target balance
print(df['default_12m'].value_counts(normalize=True))
df['default_12m'].value_counts(normalize=True).plot(kind='bar')
plt.xlabel('Default in 12 months')
plt.ylabel('Proportion')
plt.title('Target Class Distribution')
plt.show()

## 2. Data Cleaning and Feature Engineering
We handle duplicates, basic feature checks, and derive key features such as utilization and months on book.

In [None]:
# Remove duplicate rows if any
df = df.drop_duplicates()

# Ensure utilization is within a reasonable range
df['utilization_rate'] = df['current_balance'] / df['credit_limit']
df['utilization_rate'] = df['utilization_rate'].clip(lower=0, upper=1.5)

# Months on book is already computed, but ensure non-negative
df['months_on_book'] = df['months_on_book'].clip(lower=0)

# Simple handling of missing values (if any)
missing = df.isna().mean()
print('Missing value fraction per column:')
print(missing[missing > 0])

# For this simulated dataset, we expect no missing values. In a real project, we would impute or drop as appropriate.

## 3. Exploratory Data Analysis (EDA)
We explore key drivers of risk: age, income, utilization, late payments, and delinquency history.

In [None]:
# Histograms for selected numeric features
numeric_cols = ['age', 'income', 'credit_limit', 'current_balance', 'utilization_rate',
                'num_late_payments_12m', 'max_days_past_due', 'months_on_book']

df[numeric_cols].hist(figsize=(12, 8), bins=30)
plt.tight_layout()
plt.show()

In [None]:
# Default rate by utilization band
util_bins = [0, 0.25, 0.5, 0.75, 1.0, 1.5]
df['util_band'] = pd.cut(df['utilization_rate'], bins=util_bins, include_lowest=True)
util_default = df.groupby('util_band')['default_12m'].mean()
print(util_default)

util_default.plot(kind='bar')
plt.xlabel('Utilization band')
plt.ylabel('Default rate')
plt.title('Default Rate by Utilization Band')
plt.show()

In [None]:
# Correlation matrix for key numeric features
corr_cols = ['age', 'income', 'employment_length_years', 'credit_limit', 'current_balance',
             'utilization_rate', 'num_late_payments_12m', 'max_days_past_due',
             'num_credit_cards', 'delinq_ever', 'months_on_book', 'default_12m']

corr = df[corr_cols].corr()
plt.figure(figsize=(10, 8))
plt.imshow(corr, interpolation='nearest')
plt.xticks(range(len(corr_cols)), corr_cols, rotation=90)
plt.yticks(range(len(corr_cols)), corr_cols)
plt.title('Correlation Matrix')
plt.colorbar()
plt.tight_layout()
plt.show()

## 4. Modeling Setup
We prepare the feature matrix and target vector, then split into training and test sets.

In [None]:
target = 'default_12m'
drop_cols = ['account_id', 'account_open_date', 'observation_date', 'util_band']

features = [c for c in df.columns if c not in drop_cols + [target]]
X = df[features]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
X_train.shape, X_test.shape

## 5. Logistic Regression Model
We fit a baseline interpretable model and evaluate using AUC and classification metrics.

In [None]:
# Fit logistic regression
logit = LogisticRegression(max_iter=1000)
logit.fit(X_train, y_train)

y_pred_proba_logit = logit.predict_proba(X_test)[:, 1]
y_pred_logit = (y_pred_proba_logit >= 0.5).astype(int)

auc_logit = roc_auc_score(y_test, y_pred_proba_logit)
print('Logistic Regression AUC:', round(auc_logit, 3))

print('\nClassification report (threshold = 0.5):')
print(classification_report(y_test, y_pred_logit))

cm_logit = confusion_matrix(y_test, y_pred_logit)
print('Confusion matrix:\n', cm_logit)

In [None]:
# ROC curve for logistic regression
fpr_logit, tpr_logit, _ = roc_curve(y_test, y_pred_proba_logit)
plt.figure()
plt.plot(fpr_logit, tpr_logit, label='Logistic Regression')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression')
plt.legend()
plt.show()

## 6. Random Forest Model
We fit a non-linear model to potentially capture more complex patterns and compare performance.

In [None]:
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

y_pred_proba_rf = rf.predict_proba(X_test)[:, 1]
y_pred_rf = (y_pred_proba_rf >= 0.5).astype(int)

auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
print('Random Forest AUC:', round(auc_rf, 3))

print('\nClassification report (threshold = 0.5):')
print(classification_report(y_test, y_pred_rf))

cm_rf = confusion_matrix(y_test, y_pred_rf)
print('Confusion matrix:\n', cm_rf)

In [None]:
# ROC curve comparison
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)

plt.figure()
plt.plot(fpr_logit, tpr_logit, label='Logistic Regression')
plt.plot(fpr_rf, tpr_rf, label='Random Forest')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend()
plt.show()

### 6.1 Feature Importance (Random Forest)
We inspect which features are most influential in the Random Forest model.

In [None]:
importances = rf.feature_importances_
feat_imp = pd.Series(importances, index=features).sort_values(ascending=False)
print(feat_imp.head(15))

plt.figure(figsize=(8, 6))
feat_imp.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.xlabel('Importance')
plt.title('Top 15 Feature Importances - Random Forest')
plt.tight_layout()
plt.show()

## 7. Risk and Profit Calculations
Using the Random Forest model (higher AUC in many cases), we compute expected loss and expected profit per account.

In [None]:
# Choose Random Forest as the production model for this analysis
df_test = X_test.copy()
df_test['default_12m'] = y_test.values

df_test['pd_hat'] = y_pred_proba_rf

# Financial assumptions
df_test['exposure'] = df_test['credit_limit']
LGD_rate = 0.6
df_test['LGD_amount'] = df_test['exposure'] * LGD_rate
df_test['annual_interest_revenue'] = df_test['exposure'] * 0.18

df_test['Expected_Loss'] = df_test['pd_hat'] * df_test['LGD_amount']
df_test['Expected_Profit'] = (1 - df_test['pd_hat']) * df_test['annual_interest_revenue'] - df_test['Expected_Loss']

df_test[['pd_hat', 'exposure', 'LGD_amount', 'annual_interest_revenue',
         'Expected_Loss', 'Expected_Profit']].head()

## 8. Cutoff Optimization
We simulate different approval strategies by varying the PD cutoff. Accounts with predicted PD below the cutoff are approved.

In [None]:
cutoffs = np.arange(0.01, 0.31, 0.01)
results = []

for c in cutoffs:
    approved = df_test[df_test['pd_hat'] <= c]
    if len(approved) == 0:
        continue
    total_profit = approved['Expected_Profit'].sum()
    avg_pd = approved['pd_hat'].mean()
    total_expected_loss = approved['Expected_Loss'].sum()
    results.append({
        'cutoff': c,
        'n_approved': len(approved),
        'total_profit': total_profit,
        'avg_pd': avg_pd,
        'total_expected_loss': total_expected_loss
    })

results_df = pd.DataFrame(results)
results_df.sort_values('total_profit', ascending=False).head()

In [None]:
# Plot cutoff vs total expected profit
plt.figure()
plt.plot(results_df['cutoff'], results_df['total_profit'])
plt.xlabel('PD Cutoff')
plt.ylabel('Total Expected Profit')
plt.title('Total Expected Profit vs PD Cutoff')
plt.show()

# Plot cutoff vs average PD of approved portfolio
plt.figure()
plt.plot(results_df['cutoff'], results_df['avg_pd'])
plt.xlabel('PD Cutoff')
plt.ylabel('Average PD of Approved Accounts')
plt.title('Average PD vs PD Cutoff')
plt.show()

## 9. Summary and Recommended Strategy
We identify a reasonable PD cutoff that balances profit and risk.

In [None]:
# Identify cutoff with maximum profit
best_row = results_df.loc[results_df['total_profit'].idxmax()]
print('Best cutoff by total expected profit:')
print(best_row)

print('\nRecommended PD cutoff (approx):', round(best_row['cutoff'], 3))
print('Number of approved accounts:', int(best_row['n_approved']))
print('Average PD of approved accounts:', round(best_row['avg_pd'], 4))
print('Total expected loss for approved accounts:', round(best_row['total_expected_loss'], 2))
print('Total expected profit for approved accounts:', round(best_row['total_profit'], 2))

### Interpretation
- The **Random Forest** model typically delivers stronger discriminatory power than the baseline logistic regression in this setup.
- By sweeping the PD cutoff, we quantify the trade-off between **volume** (number of approved accounts), **risk** (expected loss), and **profit**.
- The recommended cutoff near the profit-maximizing point can be proposed as the **risk appetite threshold** for the portfolio.
- In a real-world setting, this cutoff would be further stress-tested and aligned with regulatory, capital, and business constraints.