# Credit Risk Classification - Exploratory Data Analysis

This notebook contains exploratory analysis of the German Credit Data.

## Contents:
1. Data loading and overview
2. Class distribution analysis
3. Feature distributions
4. Correlation analysis
5. Missing value analysis
6. Feature relationships with target

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directory to path
import sys
sys.path.append('../src')

from load_data import load_german_credit_data, preprocess_data

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

## 1. Load Data

In [None]:
# Load data
df = load_german_credit_data()
print(f"Dataset shape: {df.shape}")
df.head()

## 2. Class Distribution Analysis

In [None]:
# Check class distribution
target_dist = df['credit_risk'].value_counts()
print("Class Distribution:")
print(target_dist)
print(f"\nImbalance ratio: {target_dist[1] / target_dist[2]:.2f}")

# Plot
plt.figure(figsize=(8, 5))
target_dist.plot(kind='bar', color=['green', 'red'])
plt.title('Class Distribution (1=Good, 2=Bad)')
plt.xlabel('Credit Risk')
plt.ylabel('Count')
plt.show()

## 3. Feature Distributions

In [None]:
# Numerical features
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numerical features: {numerical_cols}")

# Plot distributions
df[numerical_cols].hist(figsize=(15, 10), bins=30)
plt.tight_layout()
plt.show()

## 4. Correlation Analysis

In [None]:
# Preprocess data
X, y, feature_names = preprocess_data(df)

# Add target to X for correlation
X_with_target = X.copy()
X_with_target['target'] = y

# Correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(X_with_target.corr(), cmap='coolwarm', center=0, annot=False)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 5. Feature Importance (Quick Check)

In [None]:
# Calculate correlation with target
correlations = X_with_target.corr()['target'].abs().sort_values(ascending=False)
print("Top 10 features correlated with default:")
print(correlations.head(11))  # 11 because target itself is included

## Conclusions

Key findings from exploratory analysis:
1. **Class Imbalance:** ~30% default rate (2.3:1 ratio)
2. **Key Features:** Duration, credit amount, age show correlation with default
3. **No Missing Values:** Data is complete
4. **Next Steps:** Train models with stratified sampling and class weights