# ðŸ“Š Exploratory Data Analysis â€” German Credit Dataset

**Project:** Credit Risk Prediction with Explainable AI  
**Milestone 2 â€” Issue #7**  
**Dataset:** UCI Statlog (German Credit Data) â€” 1,000 instances, 20 features  

This notebook explores the raw dataset to understand feature distributions, class balance, correlations, and key patterns before building ML models.

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.preprocessing.data_loader import (
    load_german_credit,
    get_feature_target_split,
    NUMERICAL_COLUMNS,
    CATEGORICAL_COLUMNS,
    TARGET_LABELS,
)

# Style
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["figure.dpi"] = 100

print("âœ… Imports loaded")

## 1. Load Dataset & Overview

In [None]:
df = load_german_credit()

print(f"Shape: {df.shape}")
print(f"Numerical features  ({len(NUMERICAL_COLUMNS)}): {NUMERICAL_COLUMNS}")
print(f"Categorical features ({len(CATEGORICAL_COLUMNS)}): {CATEGORICAL_COLUMNS}")
print(f"\nMissing values: {df.isnull().sum().sum()}")
df.head()

In [None]:
df.describe()

In [None]:
df.dtypes

## 2. Target Variable â€” Class Distribution

The dataset is **imbalanced**: 70% Good (0) vs 30% Bad (1) credit risk. This will need to be addressed during model training (SMOTE / class weights).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
counts = df["credit_risk"].value_counts().sort_index()
colors = ["#2ecc71", "#e74c3c"]
axes[0].bar([TARGET_LABELS[i] for i in counts.index], counts.values, color=colors)
axes[0].set_title("Credit Risk Class Distribution")
axes[0].set_ylabel("Count")
for i, v in enumerate(counts.values):
    axes[0].text(i, v + 10, str(v), ha="center", fontweight="bold")

# Pie chart
axes[1].pie(counts.values, labels=[f"{TARGET_LABELS[i]}\n({v})" for i, v in zip(counts.index, counts.values)],
            colors=colors, autopct="%1.1f%%", startangle=90, textprops={"fontsize": 12})
axes[1].set_title("Class Balance")

plt.suptitle("Target Variable: credit_risk (0=Good, 1=Bad)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

## 3. Numerical Feature Distributions

Histograms of all 7 numerical features, split by credit risk class.

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(18, 9))
axes = axes.flatten()

for i, col in enumerate(NUMERICAL_COLUMNS):
    ax = axes[i]
    for risk, color, label in [(0, "#2ecc71", "Good"), (1, "#e74c3c", "Bad")]:
        subset = df[df["credit_risk"] == risk][col]
        ax.hist(subset, bins=25, alpha=0.6, color=color, label=label, edgecolor="white")
    ax.set_title(col, fontweight="bold")
    ax.legend(fontsize=8)

# Hide unused subplot
axes[-1].set_visible(False)

plt.suptitle("Numerical Feature Distributions by Credit Risk", fontsize=15, fontweight="bold")
plt.tight_layout()
plt.show()

## 4. Box Plots â€” Key Numerical Features by Credit Risk

Compare distributions of the three most informative numerical features between Good and Bad credit risk.

In [None]:
key_features = ["credit_amount", "duration_months", "age"]
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, col in enumerate(key_features):
    sns.boxplot(data=df, x="credit_risk", y=col, ax=axes[i],
                palette={0: "#2ecc71", 1: "#e74c3c"}, width=0.5)
    axes[i].set_xticklabels(["Good (0)", "Bad (1)"])
    axes[i].set_title(col, fontweight="bold")
    axes[i].set_xlabel("")

plt.suptitle("Key Numerical Features by Credit Risk", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

## 5. Correlation Heatmap â€” Numerical Features

In [None]:
corr_cols = NUMERICAL_COLUMNS + ["credit_risk"]
corr = df[corr_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="RdYlGn_r",
            center=0, square=True, linewidths=0.5, ax=ax,
            vmin=-1, vmax=1)
ax.set_title("Correlation Matrix (Numerical Features + Target)", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

## 6. Categorical Feature Distributions by Credit Risk

Stacked proportional bar charts for all 13 categorical features, showing the default rate within each category.

In [None]:
fig, axes = plt.subplots(5, 3, figsize=(20, 28))
axes = axes.flatten()

for i, col in enumerate(CATEGORICAL_COLUMNS):
    ax = axes[i]
    ct = pd.crosstab(df[col], df["credit_risk"], normalize="index")
    ct.columns = ["Good", "Bad"]
    ct.sort_values("Bad", ascending=True).plot.barh(
        stacked=True, ax=ax, color=["#2ecc71", "#e74c3c"], edgecolor="white"
    )
    ax.set_title(col, fontweight="bold")
    ax.set_xlabel("Proportion")
    ax.set_ylabel("")
    ax.legend(loc="lower right", fontsize=8)

# Hide unused subplots
for j in range(len(CATEGORICAL_COLUMNS), len(axes)):
    axes[j].set_visible(False)

plt.suptitle("Categorical Features â€” Default Rate by Category", fontsize=16, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

## 7. Top Categorical Predictors â€” Default Rate Analysis

Which categories show the highest deviation from the overall 30% default rate?

In [None]:
overall_rate = df["credit_risk"].mean()
print(f"Overall default rate: {overall_rate:.1%}\n")

rows = []
for col in CATEGORICAL_COLUMNS:
    for cat, grp in df.groupby(col)["credit_risk"]:
        rate = grp.mean()
        rows.append({"feature": col, "category": cat, "default_rate": rate, "count": len(grp)})

rates_df = pd.DataFrame(rows)
rates_df["deviation"] = rates_df["default_rate"] - overall_rate

# Show top 10 most risky categories
print("ðŸ”´ Top 10 categories with HIGHEST default rate:")
print(rates_df.nlargest(10, "default_rate")[["feature", "category", "default_rate", "count"]].to_string(index=False))

print("\nðŸŸ¢ Top 10 categories with LOWEST default rate:")
print(rates_df.nsmallest(10, "default_rate")[["feature", "category", "default_rate", "count"]].to_string(index=False))

## 8. Summary & Key Findings

### Dataset Overview
- **1,000 samples**, 20 features (7 numerical, 13 categorical), no missing values
- **Class imbalance**: 70% Good / 30% Bad â€” need SMOTE or `class_weight='balanced'`

### Key Observations

**Numerical features:**
- `duration_months` and `credit_amount` are the strongest numerical predictors â€” bad loans tend to have longer durations and higher amounts
- `age` shows a mild effect: younger applicants have slightly higher default rates
- `installment_rate` (% of disposable income) is higher for bad loans
- Low multicollinearity â€” features are largely independent

**Categorical features (strongest predictors):**
- `checking_account`: "< 0 DM" has the highest default rate; "no checking account" has the lowest
- `credit_history`: "critical account" paradoxically has low default rate (survivor bias in data)
- `purpose`: "education" and "car (new)" have higher default rates
- `savings_account`: "unknown / no savings" has higher default rates
- `foreign_worker`: almost all applicants are foreign workers (limited predictive value)

### Implications for Modeling
1. Use **all 20 features** â€” both numerical and categorical carry signal
2. Apply **class balancing** (SMOTE or balanced class weights)
3. Use **ColumnTransformer**: StandardScaler for numerical, OneHotEncoder for categorical
4. `checking_account`, `duration_months`, `credit_amount`, `credit_history` are likely the top predictors
5. Tree-based models (RF, XGBoost) should capture non-linear patterns in categorical variables well