# Credit Risk Assessment - Exploratory Data Analysis (EDA)

This notebook provides a comprehensive analysis of the German Credit Dataset focusing on understanding the credit risk assessment problem where it's **5 times more expensive to classify an unworthy customer as creditworthy** than vice versa.

## Dataset Overview
The German Credit Dataset contains 1000 instances with 20 features for credit risk assessment. The target variable indicates:
- 1 = creditworthy
- 2 = not creditworthy

## Cost Matrix
| Actual \ Predicted | Creditworthy (1) | Not Creditworthy (2) |
|-------------------|------------------|----------------------|
| Creditworthy (1)  | 0                | 1                    |
| Not Creditworthy (2) | **5**         | 0                    |

This means misclassifying a bad customer as good costs 5 times more than misclassifying a good customer as bad.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 10

## 1. Data Loading and Initial Exploration

In [None]:
# Define feature names based on German Credit Dataset documentation
feature_names = [
    'checking_account',    # Status of existing checking account
    'duration',           # Duration in months
    'credit_history',     # Credit history
    'purpose',            # Purpose of credit
    'credit_amount',      # Credit amount
    'savings_account',    # Savings account/bonds
    'employment',         # Present employment since
    'installment_rate',   # Installment rate in percentage of disposable income
    'personal_status',    # Personal status and sex
    'other_debtors',      # Other debtors/guarantors
    'residence_since',    # Present residence since
    'property',           # Property
    'age',               # Age in years
    'other_installments', # Other installment plans
    'housing',           # Housing
    'existing_credits',   # Number of existing credits at this bank
    'job',               # Job
    'dependents',        # Number of people liable to provide maintenance for
    'telephone',         # Telephone
    'foreign_worker',    # Foreign worker
    'target'             # Target variable (1=good, 2=bad)
]

# Load the dataset
data = pd.read_csv('kredit.dat', sep='\t', header=None, names=feature_names)

# Replace '?' with NaN for proper missing value handling
data = data.replace('?', np.nan)

print(f"Dataset shape: {data.shape}")
print(f"\nFirst few rows:")
data.head()

In [None]:
# Basic dataset information
print("Dataset Info:")
data.info()

print("\n" + "="*50)
print("Missing Values Count:")
missing_counts = data.isnull().sum()
missing_percentage = (missing_counts / len(data)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0])

print("\n" + "="*50)
print("Basic Statistics:")
data.describe()

## 2. Target Variable Analysis

In [None]:
# Analyze target distribution
target_counts = data['target'].value_counts().sort_index()
target_percentages = data['target'].value_counts(normalize=True).sort_index() * 100

print("Target Variable Distribution:")
print(f"Creditworthy (1): {target_counts[1]} ({target_percentages[1]:.1f}%)")
print(f"Not Creditworthy (2): {target_counts[2]} ({target_percentages[2]:.1f}%)")

# Calculate class imbalance ratio
imbalance_ratio = target_counts[1] / target_counts[2]
print(f"\nClass Imbalance Ratio (Good:Bad): {imbalance_ratio:.2f}:1")

# Visualize target distribution with cost implications
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Target distribution
labels = ['Creditworthy\n(Class 1)', 'Not Creditworthy\n(Class 2)']
colors = ['lightgreen', 'lightcoral']
ax1.pie(target_counts.values, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
ax1.set_title('Target Variable Distribution')

# Cost implications visualization
cost_matrix = np.array([[0, 1], [5, 0]])
sns.heatmap(cost_matrix, annot=True, fmt='d', cmap='Reds', 
           xticklabels=['Pred: Good', 'Pred: Bad'], 
           yticklabels=['True: Good', 'True: Bad'], ax=ax2)
ax2.set_title('Cost Matrix\n(5x more expensive to misclassify bad as good)')

plt.tight_layout()
plt.show()

# Calculate expected cost for naive strategies
print("\n" + "="*50)
print("Expected Costs for Naive Strategies:")
print(f"Always predict 'Good': Cost = {target_counts[2] * 5} (misclassify {target_counts[2]} bad customers)")
print(f"Always predict 'Bad': Cost = {target_counts[1] * 1} (misclassify {target_counts[1]} good customers)")
print(f"Random prediction would have expected cost around: {(target_counts[1] * 5 * 0.3 + target_counts[2] * 1 * 0.7):.0f}")

## 3. Feature Distribution Analysis

In [None]:
# Identify numerical and categorical features
numerical_features = ['duration', 'credit_amount', 'installment_rate', 'residence_since', 'age', 
                     'existing_credits', 'dependents']
categorical_features = [col for col in data.columns if col not in numerical_features and col != 'target']

print(f"Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")

# Convert numerical columns to numeric (in case they're stored as strings)
for col in numerical_features:
    data[col] = pd.to_numeric(data[col], errors='coerce')

In [None]:
# Visualize numerical feature distributions
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
axes = axes.flatten()

for i, feature in enumerate(numerical_features):
    if i < len(axes):
        ax = axes[i]
        
        # Handle missing values for visualization
        feature_data = data[feature].dropna()
        
        # Histogram
        ax.hist(feature_data, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
        ax.set_title(f'{feature.replace("_", " ").title()}\nMean: {feature_data.mean():.1f}, Std: {feature_data.std():.1f}')
        ax.set_xlabel(feature.replace('_', ' ').title())
        ax.set_ylabel('Frequency')
        
        # Add statistics
        ax.axvline(feature_data.mean(), color='red', linestyle='--', alpha=0.8, label=f'Mean: {feature_data.mean():.1f}')
        ax.axvline(feature_data.median(), color='orange', linestyle='--', alpha=0.8, label=f'Median: {feature_data.median():.1f}')
        ax.legend()

# Remove empty subplots
for i in range(len(numerical_features), len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.suptitle('Distribution of Numerical Features', y=1.02, fontsize=16)
plt.show()

In [None]:
# Visualize categorical feature distributions
n_categorical = len(categorical_features)
n_cols = 3
n_rows = (n_categorical + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 4 * n_rows))
if n_rows == 1:
    axes = axes.reshape(1, -1)
axes = axes.flatten()

for i, feature in enumerate(categorical_features):
    if i < len(axes):
        ax = axes[i]
        
        # Count values including missing
        feature_counts = data[feature].fillna('Missing').value_counts()
        
        # Bar plot
        feature_counts.plot(kind='bar', ax=ax, color='lightblue', alpha=0.7)
        ax.set_title(f'{feature.replace("_", " ").title()}\nUnique Values: {data[feature].nunique()}')
        ax.set_xlabel('Categories')
        ax.set_ylabel('Frequency')
        ax.tick_params(axis='x', rotation=45)
        
        # Add value counts on bars
        for j, (category, count) in enumerate(feature_counts.items()):
            ax.text(j, count + max(feature_counts) * 0.01, str(count), 
                   ha='center', va='bottom', fontsize=8)

# Remove empty subplots
for i in range(len(categorical_features), len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.suptitle('Distribution of Categorical Features', y=1.02, fontsize=16)
plt.show()

## 4. Relationship Analysis with Target

In [None]:
# Numerical features vs target - Box plots
fig, axes = plt.subplots(3, 3, figsize=(18, 15))
axes = axes.flatten()

for i, feature in enumerate(numerical_features):
    if i < len(axes):
        ax = axes[i]
        
        # Create box plot
        data_for_plot = data[[feature, 'target']].dropna()
        
        box_data = [data_for_plot[data_for_plot['target'] == 1][feature],
                   data_for_plot[data_for_plot['target'] == 2][feature]]
        
        bp = ax.boxplot(box_data, labels=['Creditworthy\n(1)', 'Not Creditworthy\n(2)'],
                       patch_artist=True)
        bp['boxes'][0].set_facecolor('lightgreen')
        bp['boxes'][1].set_facecolor('lightcoral')
        
        ax.set_title(f'{feature.replace("_", " ").title()} by Credit Risk')
        ax.set_ylabel(feature.replace('_', ' ').title())
        
        # Add statistical test
        good_data = data_for_plot[data_for_plot['target'] == 1][feature]
        bad_data = data_for_plot[data_for_plot['target'] == 2][feature]
        
        if len(good_data) > 0 and len(bad_data) > 0:
            stat, p_value = stats.mannwhitneyu(good_data, bad_data, alternative='two-sided')
            ax.text(0.02, 0.98, f'Mann-Whitney U p-value: {p_value:.4f}', 
                   transform=ax.transAxes, verticalalignment='top',
                   bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# Remove empty subplots
for i in range(len(numerical_features), len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.suptitle('Numerical Features vs Credit Risk', y=1.02, fontsize=16)
plt.show()

In [None]:
# Categorical features vs target - Contingency tables and chi-square tests
print("Categorical Features vs Target - Statistical Analysis")
print("="*60)

chi_square_results = []

for feature in categorical_features[:6]:  # Show first 6 to avoid too much output
    print(f"\n{feature.replace('_', ' ').title()}:")
    
    # Create contingency table
    contingency_table = pd.crosstab(data[feature].fillna('Missing'), data['target'], margins=True)
    print(contingency_table)
    
    # Chi-square test (exclude margins)
    contingency_no_margins = contingency_table.iloc[:-1, :-1]
    if contingency_no_margins.shape[0] > 1 and contingency_no_margins.shape[1] > 1:
        chi2, p_value, dof, expected = chi2_contingency(contingency_no_margins)
        print(f"Chi-square statistic: {chi2:.4f}, p-value: {p_value:.4f}")
        
        chi_square_results.append({
            'Feature': feature,
            'Chi2': chi2,
            'p_value': p_value,
            'Significant': p_value < 0.05
        })
    
    print("-" * 40)

# Summary of chi-square results
if chi_square_results:
    chi_df = pd.DataFrame(chi_square_results)
    print("\nSummary of Chi-square Test Results:")
    print(chi_df.sort_values('Chi2', ascending=False))

In [None]:
# Visualize top categorical features vs target
top_categorical = ['checking_account', 'credit_history', 'savings_account', 'employment', 'purpose', 'housing']

fig, axes = plt.subplots(2, 3, figsize=(20, 12))
axes = axes.flatten()

for i, feature in enumerate(top_categorical):
    if i < len(axes):
        ax = axes[i]
        
        # Create stacked bar chart
        cross_tab = pd.crosstab(data[feature].fillna('Missing'), data['target'], normalize='index') * 100
        
        cross_tab.plot(kind='bar', stacked=True, ax=ax, 
                      color=['lightgreen', 'lightcoral'],
                      legend=False if i != 0 else True)
        
        ax.set_title(f'{feature.replace("_", " ").title()} vs Credit Risk\n(Percentage within each category)')
        ax.set_xlabel('Categories')
        ax.set_ylabel('Percentage')
        ax.tick_params(axis='x', rotation=45)
        
        if i == 0:
            ax.legend(['Creditworthy (1)', 'Not Creditworthy (2)'], loc='upper right')

plt.tight_layout()
plt.suptitle('Categorical Features vs Credit Risk (Normalized)', y=1.02, fontsize=16)
plt.show()

## 5. Correlation and Association Analysis

In [None]:
# Correlation matrix for numerical features
numerical_data = data[numerical_features + ['target']]
correlation_matrix = numerical_data.corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, fmt='.3f', ax=ax)
ax.set_title('Correlation Matrix - Numerical Features and Target')
plt.tight_layout()
plt.show()

# Show correlations with target
print("Correlations with Target Variable:")
target_correlations = correlation_matrix['target'].drop('target').sort_values(key=abs, ascending=False)
print(target_correlations)

print(f"\nStrongest positive correlation with bad credit: {target_correlations.idxmax()} ({target_correlations.max():.3f})")
print(f"Strongest negative correlation with bad credit: {target_correlations.idxmin()} ({target_correlations.min():.3f})")

In [None]:
# Association analysis for key categorical features
# Calculate Cramér's V for categorical variables
def cramers_v(confusion_matrix):
    """ Calculate Cramér's V statistic for categorical-categorical association."""
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

print("Association Strength (Cramér's V) between Categorical Features:")
print("="*60)

# Calculate associations between key categorical features
key_categorical = ['checking_account', 'credit_history', 'savings_account', 'employment']
association_matrix = np.zeros((len(key_categorical), len(key_categorical)))

for i, feature1 in enumerate(key_categorical):
    for j, feature2 in enumerate(key_categorical):
        if i != j:
            # Create contingency table
            contingency = pd.crosstab(data[feature1].fillna('Missing'), 
                                    data[feature2].fillna('Missing'))
            if contingency.shape[0] > 1 and contingency.shape[1] > 1:
                association_matrix[i, j] = cramers_v(contingency.values)
        else:
            association_matrix[i, j] = 1.0

# Plot association matrix
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(association_matrix, annot=True, cmap='viridis', 
           xticklabels=[f.replace('_', ' ').title() for f in key_categorical],
           yticklabels=[f.replace('_', ' ').title() for f in key_categorical],
           square=True, fmt='.3f', ax=ax)
ax.set_title('Association Matrix (Cramér\'s V)\nCategorical Features')
plt.tight_layout()
plt.show()

## 6. Cost-Sensitive Analysis

In [None]:
# Analyze misclassification costs and their implications
def calculate_expected_cost(true_labels, predicted_labels, cost_matrix=np.array([[0, 1], [5, 0]])):
    """Calculate expected cost given true labels, predictions, and cost matrix."""
    # Convert to 0-1 indexing
    true_binary = (true_labels == 2).astype(int)
    pred_binary = (predicted_labels == 2).astype(int)
    
    total_cost = 0
    for true_val, pred_val in zip(true_binary, pred_binary):
        total_cost += cost_matrix[true_val, pred_val]
    
    return total_cost

# Calculate costs for different prediction strategies
n_samples = len(data)
n_good = sum(data['target'] == 1)
n_bad = sum(data['target'] == 2)

strategies = {
    'Always Predict Good': {
        'predictions': np.ones(n_samples),
        'description': 'Classify everyone as creditworthy'
    },
    'Always Predict Bad': {
        'predictions': np.ones(n_samples) * 2,
        'description': 'Classify everyone as not creditworthy'
    },
    'Random (70% Good)': {
        'predictions': np.random.choice([1, 2], size=n_samples, p=[0.7, 0.3]),
        'description': 'Random classification with 70% good'
    },
    'Perfect Classifier': {
        'predictions': data['target'].values,
        'description': 'Perfect predictions (baseline cost = 0)'
    }
}

print("Cost Analysis for Different Prediction Strategies:")
print("="*60)

strategy_results = []
for strategy_name, strategy_info in strategies.items():
    cost = calculate_expected_cost(data['target'].values, strategy_info['predictions'])
    cost_per_sample = cost / n_samples
    
    strategy_results.append({
        'Strategy': strategy_name,
        'Total Cost': cost,
        'Cost per Sample': cost_per_sample,
        'Description': strategy_info['description']
    })
    
    print(f"{strategy_name}:")
    print(f"  Total Cost: {cost}")
    print(f"  Cost per Sample: {cost_per_sample:.3f}")
    print(f"  Description: {strategy_info['description']}")
    print()

# Visualize strategy costs
strategy_df = pd.DataFrame(strategy_results)
strategy_df = strategy_df.sort_values('Total Cost')

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Total cost comparison
bars1 = ax1.bar(range(len(strategy_df)), strategy_df['Total Cost'], 
                color=['green', 'blue', 'orange', 'red'])
ax1.set_xlabel('Strategy')
ax1.set_ylabel('Total Cost')
ax1.set_title('Total Cost by Prediction Strategy')
ax1.set_xticks(range(len(strategy_df)))
ax1.set_xticklabels(strategy_df['Strategy'], rotation=45, ha='right')

# Add value labels on bars
for i, bar in enumerate(bars1):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 5,
             f'{int(height)}', ha='center', va='bottom')

# Cost breakdown visualization
cost_breakdown = pd.DataFrame({
    'False Positive (Good→Bad)': [n_good * 1, 0, n_good * 0.3, 0],
    'False Negative (Bad→Good)': [0, n_bad * 5, n_bad * 0.7 * 5, 0]
}, index=['Always Good', 'Always Bad', 'Random', 'Perfect'])

cost_breakdown.plot(kind='bar', stacked=True, ax=ax2, color=['orange', 'red'])
ax2.set_title('Cost Breakdown by Error Type')
ax2.set_xlabel('Strategy')
ax2.set_ylabel('Cost')
ax2.tick_params(axis='x', rotation=45)
ax2.legend(title='Error Type')

plt.tight_layout()
plt.show()

In [None]:
# Impact of class imbalance on total cost
print("Impact of Class Imbalance on Cost-Sensitive Learning:")
print("="*60)

# Calculate the break-even point for classification threshold
# At what probability threshold should we classify as "good"?
cost_fp = 1  # Cost of false positive (classify bad as good)
cost_fn = 5  # Cost of false negative (classify good as bad)
prior_good = n_good / n_samples
prior_bad = n_bad / n_samples

# Break-even threshold: P(good|x) > cost_fp / (cost_fp + cost_fn)
break_even_threshold = cost_fp / (cost_fp + cost_fn)

print(f"Class distribution:")
print(f"  Good customers: {n_good} ({prior_good:.3f})")
print(f"  Bad customers: {n_bad} ({prior_bad:.3f})")
print(f"\nCost matrix:")
print(f"  Cost of FP (classify bad as good): {cost_fp}")
print(f"  Cost of FN (classify good as bad): {cost_fn}")
print(f"\nOptimal decision threshold:")
print(f"  Classify as 'good' if P(good|features) > {break_even_threshold:.3f}")
print(f"  This means we need {break_even_threshold*100:.1f}% confidence to classify as good")

# Sensitivity analysis: how does cost change with different thresholds?
thresholds = np.linspace(0, 1, 21)
expected_costs = []

for threshold in thresholds:
    # Simulate predictions based on threshold
    # Assume we have some probability estimates (use random for illustration)
    np.random.seed(42)
    probabilities = np.random.beta(2, 3, n_samples)  # Skewed towards lower probabilities
    predictions = (probabilities > threshold).astype(int) + 1  # Convert to 1,2 scale
    
    cost = calculate_expected_cost(data['target'].values, predictions)
    expected_costs.append(cost)

# Plot threshold sensitivity
plt.figure(figsize=(12, 6))
plt.plot(thresholds, expected_costs, 'b-', linewidth=2, marker='o')
plt.axvline(x=break_even_threshold, color='red', linestyle='--', 
           label=f'Theoretical optimal: {break_even_threshold:.3f}')
plt.xlabel('Classification Threshold (Probability to classify as Good)')
plt.ylabel('Expected Total Cost')
plt.title('Cost Sensitivity Analysis: Impact of Classification Threshold')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

print(f"\nMinimum expected cost occurs at threshold: {thresholds[np.argmin(expected_costs)]:.3f}")
print(f"Minimum expected cost: {min(expected_costs)}")

## 7. Missing Value Analysis

In [None]:
# Detailed analysis of missing values
print("Detailed Missing Value Analysis:")
print("="*50)

# Features with missing values
features_with_missing = data.columns[data.isnull().any()].tolist()
if 'target' in features_with_missing:
    features_with_missing.remove('target')

print(f"Features with missing values: {features_with_missing}")

# Missing value statistics
missing_stats = []
for feature in features_with_missing:
    missing_count = data[feature].isnull().sum()
    missing_pct = (missing_count / len(data)) * 100
    missing_stats.append({
        'Feature': feature,
        'Missing Count': missing_count,
        'Missing Percentage': missing_pct
    })

missing_df = pd.DataFrame(missing_stats)
missing_df = missing_df.sort_values('Missing Percentage', ascending=False)
print("\nMissing Value Statistics:")
print(missing_df)

# Visualize missing value patterns
if len(features_with_missing) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    axes = axes.flatten()
    
    # Missing value heatmap
    missing_matrix = data[features_with_missing].isnull()
    sns.heatmap(missing_matrix.T, cbar=True, cmap='viridis', ax=axes[0])
    axes[0].set_title('Missing Value Pattern\n(Yellow = Missing, Purple = Present)')
    axes[0].set_xlabel('Samples')
    axes[0].set_ylabel('Features')
    
    # Missing value counts
    missing_counts = missing_df.set_index('Feature')['Missing Count']
    missing_counts.plot(kind='bar', ax=axes[1], color='orange')
    axes[1].set_title('Missing Value Counts by Feature')
    axes[1].set_ylabel('Count')
    axes[1].tick_params(axis='x', rotation=45)
    
    # Missing value co-occurrence
    if len(features_with_missing) > 1:
        missing_corr = data[features_with_missing].isnull().corr()
        sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0,
                   square=True, fmt='.3f', ax=axes[2])
        axes[2].set_title('Missing Value Correlation\n(1 = Always missing together)')
    
    # Missing value impact on target
    if len(features_with_missing) > 0:
        missing_target_impact = []
        for feature in features_with_missing:
            missing_mask = data[feature].isnull()
            if missing_mask.sum() > 0:
                good_rate_missing = (data[missing_mask]['target'] == 1).mean()
                good_rate_present = (data[~missing_mask]['target'] == 1).mean()
                impact = good_rate_missing - good_rate_present
                missing_target_impact.append({
                    'Feature': feature.replace('_', ' ').title(),
                    'Impact': impact
                })
        
        if missing_target_impact:
            impact_df = pd.DataFrame(missing_target_impact)
            bars = axes[3].bar(range(len(impact_df)), impact_df['Impact'],
                              color=['red' if x < 0 else 'green' for x in impact_df['Impact']])
            axes[3].set_xticks(range(len(impact_df)))
            axes[3].set_xticklabels(impact_df['Feature'], rotation=45, ha='right')
            axes[3].set_ylabel('Difference in Good Customer Rate')
            axes[3].set_title('Impact of Missing Values on Target\n(+ means missing values → more good customers)')
            axes[3].axhline(y=0, color='black', linestyle='-', alpha=0.5)
            
            # Add value labels
            for bar, value in zip(bars, impact_df['Impact']):
                height = bar.get_height()
                axes[3].text(bar.get_x() + bar.get_width()/2.,
                           height + (0.01 if height >= 0 else -0.01),
                           f'{value:.3f}', ha='center', 
                           va='bottom' if height >= 0 else 'top')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Statistical tests for missing value impact
print("Statistical Tests for Missing Value Impact on Target:")
print("="*60)

for feature in features_with_missing:
    missing_mask = data[feature].isnull()
    if missing_mask.sum() > 0 and missing_mask.sum() < len(data):
        # Create contingency table
        missing_target_crosstab = pd.crosstab(
            data[feature].isnull(),
            data['target'],
            margins=True
        )
        
        print(f"\n{feature.replace('_', ' ').title()}:")
        print("Contingency Table (Missing vs Target):")
        print(missing_target_crosstab)
        
        # Chi-square test
        contingency_no_margins = missing_target_crosstab.iloc[:-1, :-1]
        if contingency_no_margins.shape == (2, 2):
            chi2, p_value, dof, expected = chi2_contingency(contingency_no_margins)
            print(f"Chi-square test: χ² = {chi2:.4f}, p-value = {p_value:.4f}")
            
            # Calculate effect size (Phi coefficient for 2x2 tables)
            n = contingency_no_margins.sum().sum()
            phi = np.sqrt(chi2 / n)
            print(f"Effect size (Phi): {phi:.4f}")
            
            # Interpretation
            if p_value < 0.05:
                print("*** Significant relationship between missing values and target ***")
            else:
                print("No significant relationship between missing values and target")
        
        print("-" * 40)

# Missing value pattern analysis
if len(features_with_missing) > 1:
    print("\n\nMissing Value Pattern Analysis:")
    print("="*40)
    
    # Count unique missing patterns
    missing_patterns = data[features_with_missing].isnull()
    pattern_counts = missing_patterns.value_counts()
    
    print(f"Number of unique missing patterns: {len(pattern_counts)}")
    print("\nTop 10 missing patterns:")
    for i, (pattern, count) in enumerate(pattern_counts.head(10).items()):
        pattern_str = ", ".join([f"{feat}: {'Missing' if val else 'Present'}" 
                                for feat, val in zip(features_with_missing, pattern)])
        print(f"{i+1:2d}. Count: {count:3d} | {pattern_str}")

## 8. Key Insights and Recommendations

In [None]:
# Summary of key insights
print("KEY INSIGHTS AND RECOMMENDATIONS FOR CREDIT RISK ASSESSMENT")
print("="*70)

print("\n1. COST-SENSITIVE NATURE:")
print(f"   • 5:1 cost ratio makes False Negatives extremely expensive")
print(f"   • Optimal threshold for 'good' classification: {break_even_threshold:.3f}")
print(f"   • Always predicting 'bad' costs {n_good * 1} vs always 'good' costs {n_bad * 5}")

print("\n2. CLASS IMBALANCE:")
print(f"   • {n_good} good customers ({prior_good:.1%}) vs {n_bad} bad customers ({prior_bad:.1%})")
print(f"   • Imbalance ratio: {imbalance_ratio:.2f}:1 (Good:Bad)")
print(f"   • Consider cost-sensitive learning algorithms or sampling techniques")

print("\n3. FEATURE IMPORTANCE (based on correlation and statistical tests):")
# Show top correlated features with target
if 'target_correlations' in locals():
    print("   Numerical features with strongest associations:")
    for feature, corr in target_correlations.head(3).items():
        direction = "increases" if corr > 0 else "decreases"
        print(f"   • {feature.replace('_', ' ').title()}: {corr:.3f} ({direction} bad credit risk)")

print("\n4. MISSING VALUES:")
if len(features_with_missing) > 0:
    print(f"   • {len(features_with_missing)} features have missing values")
    for feature in features_with_missing:
        missing_pct = (data[feature].isnull().sum() / len(data)) * 100
        print(f"   • {feature.replace('_', ' ').title()}: {missing_pct:.1f}% missing")
    print("   • Consider imputation strategies or missing indicator variables")
else:
    print("   • No missing values detected in the dataset")

print("\n5. PREPROCESSING RECOMMENDATIONS:")
print("   • Handle missing values with domain-appropriate imputation")
print("   • Encode categorical variables (one-hot or ordinal encoding)")
print("   • Consider feature scaling for numerical variables")
print("   • Create interaction terms for strongly associated features")

print("\n6. MODEL SELECTION RECOMMENDATIONS:")
print("   • Use cost-sensitive algorithms (e.g., cost-sensitive SVM, Random Forest)")
print("   • Consider ensemble methods with cost-sensitive learning")
print("   • Evaluate using cost-sensitive metrics, not just accuracy")
print("   • Use techniques like SMOTE with cost adjustment")
print("   • Cross-validation should maintain cost ratios")

print("\n7. EVALUATION STRATEGY:")
print("   • Primary metric: Total cost (not accuracy)")
print("   • Secondary metrics: Precision/Recall with cost weighting")
print("   • ROC curves with cost-sensitive thresholds")
print("   • Business impact assessment (expected profit/loss)")

print("\n" + "="*70)
print("ANALYSIS COMPLETE - Dataset ready for modeling phase")
print("="*70)