# Week 5 Lab: Classification & Logistic Regression for Credit Risk Assessment

## Business Scenario

You've been hired as a senior data scientist at **SecureBank**, a regional financial institution that processes thousands of loan applications monthly. The bank is facing several challenges:

1. **Rising Default Rates**: Recent economic volatility has increased loan defaults
2. **Manual Review Bottleneck**: Current credit assessment takes 5-7 days per application
3. **Inconsistent Decisions**: Different loan officers make different decisions for similar profiles
4. **Regulatory Compliance**: Need to demonstrate fair and explainable lending practices

Your mission is to build an automated credit risk assessment system that can:
- **Predict loan approval/denial** with high accuracy
- **Handle class imbalance** (most applicants are approved)
- **Optimize business outcomes** by balancing approval rates with default risk
- **Provide explainable results** for regulatory compliance

## Learning Objectives
By completing this lab, you will:
- Understand binary classification vs. regression
- Work with imbalanced datasets using appropriate techniques
- Master logistic regression and interpret coefficients
- Evaluate models using classification metrics (precision, recall, F1, AUC)
- Optimize decision thresholds for business objectives
- Create and interpret confusion matrices
- Analyze ROC and Precision-Recall curves
- Calculate business costs and benefits of model decisions

## Part 1: Setup and Data Generation

First, let's import necessary libraries and generate synthetic loan application data that reflects real-world patterns.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_curve, precision_recall_curve,
    roc_auc_score, average_precision_score, accuracy_score, precision_score, 
    recall_score, f1_score, make_scorer
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_class_weight
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

print("Setup complete!")

### Generate Synthetic Credit Risk Data

We'll create realistic loan application data with the following features:
- **Demographics**: Age, income, employment length
- **Credit History**: Credit score, previous defaults, credit utilization
- **Loan Details**: Requested amount, loan purpose, debt-to-income ratio
- **Financial Profile**: Assets, existing debts, savings

The approval decision will be based on these factors with realistic business rules.

In [None]:
def generate_loan_applications(n_samples=5000):
    """
    Generate synthetic loan application data with realistic credit risk patterns.
    """
    np.random.seed(42)
    
    # Demographics
    age = np.random.normal(40, 12, n_samples)
    age = np.clip(age, 18, 75)
    
    # Income follows log-normal distribution (realistic for income)
    log_income = np.random.normal(10.5, 0.6, n_samples)  # log of income
    annual_income = np.exp(log_income)
    annual_income = np.clip(annual_income, 20000, 500000)
    
    employment_length = np.random.exponential(5, n_samples)
    employment_length = np.clip(employment_length, 0, 40)
    
    # Credit History
    credit_score = np.random.normal(680, 80, n_samples)
    credit_score = np.clip(credit_score, 300, 850)
    
    # Credit utilization (percentage of credit limit used)
    credit_utilization = np.random.beta(2, 3, n_samples) * 100
    
    # Previous defaults (binary)
    default_prob = 1 / (1 + np.exp((credit_score - 600) / 50))  # Sigmoid based on credit score
    previous_defaults = np.random.binomial(1, default_prob, n_samples)
    
    # Loan Details
    loan_amount = np.random.lognormal(10, 0.8, n_samples)
    loan_amount = np.clip(loan_amount, 1000, 100000)
    
    # Debt-to-income ratio
    existing_debt = np.random.exponential(annual_income * 0.3, n_samples)
    debt_to_income = (existing_debt + loan_amount * 0.1) / annual_income
    debt_to_income = np.clip(debt_to_income, 0, 2)
    
    # Loan purpose
    loan_purposes = ['home', 'auto', 'business', 'personal', 'education', 'medical']
    purpose_weights = [0.3, 0.25, 0.15, 0.15, 0.1, 0.05]
    loan_purpose = np.random.choice(loan_purposes, n_samples, p=purpose_weights)
    
    # Assets and savings
    liquid_assets = np.random.exponential(annual_income * 0.2, n_samples)
    liquid_assets = np.clip(liquid_assets, 0, annual_income * 2)
    
    # Calculate approval probability based on realistic credit scoring
    risk_score = (
        0.4 * ((credit_score - 300) / 550)  # Credit score impact
        + 0.2 * np.minimum(annual_income / 100000, 1)  # Income impact (capped)
        + 0.15 * np.minimum(employment_length / 10, 1)  # Employment stability
        + 0.1 * (1 - debt_to_income / 2)  # Debt-to-income (lower is better)
        + 0.1 * np.minimum(liquid_assets / annual_income, 0.5) * 2  # Assets
        + 0.05 * (1 - credit_utilization / 100)  # Credit utilization (lower is better)
        - 0.3 * previous_defaults  # Previous defaults penalty
    )
    
    # Add some noise and business-specific adjustments
    purpose_adjustments = {
        'home': 0.1, 'auto': 0.05, 'business': -0.05,
        'personal': -0.1, 'education': 0.0, 'medical': 0.0
    }
    
    for i, purpose in enumerate(loan_purpose):
        risk_score[i] += purpose_adjustments[purpose]
    
    # Add random noise
    risk_score += np.random.normal(0, 0.1, n_samples)
    
    # Convert to probability and make binary decision
    approval_probability = 1 / (1 + np.exp(-5 * (risk_score - 0.4)))  # Sigmoid
    
    # Create imbalanced dataset (more approvals than denials, but not too extreme)
    approved = np.random.binomial(1, approval_probability, n_samples)
    
    # Create DataFrame
    data = pd.DataFrame({
        'age': age,
        'annual_income': annual_income,
        'employment_length': employment_length,
        'credit_score': credit_score,
        'credit_utilization': credit_utilization,
        'previous_defaults': previous_defaults,
        'loan_amount': loan_amount,
        'debt_to_income': debt_to_income,
        'loan_purpose': loan_purpose,
        'liquid_assets': liquid_assets,
        'approved': approved
    })
    
    return data

# Generate the dataset
df = generate_loan_applications(5000)
print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['approved'].value_counts(normalize=True))
print(f"\nFirst few rows:")
df.head()

## Part 2: Exploratory Data Analysis

Before building our classification model, let's understand the data and identify patterns that distinguish approved from denied applications.

### Exercise 2.1: Basic Statistics and Class Balance
**Task**: Analyze the dataset structure and examine class imbalance.

In [None]:
# TODO: Display basic statistics and check data quality
print("Dataset Summary Statistics:")
# YOUR CODE HERE: Display summary statistics for numeric columns
______

print("\nMissing Values Check:")
# YOUR CODE HERE: Check for missing values
______

print("\nClass Distribution:")
approval_counts = df['approved'].value_counts()
approval_pct = df['approved'].value_counts(normalize=True)
print(f"Denied (0): {approval_counts[0]:,} ({approval_pct[0]:.1%})")
print(f"Approved (1): {approval_counts[1]:,} ({approval_pct[1]:.1%})")

# Calculate imbalance ratio
imbalance_ratio = approval_counts[1] / approval_counts[0]
print(f"\nImbalance Ratio (Approved:Denied): {imbalance_ratio:.2f}:1")
if imbalance_ratio > 2 or imbalance_ratio < 0.5:
    print("⚠️  Class imbalance detected - will need special handling")
else:
    print("✅ Classes are reasonably balanced")

### Exercise 2.2: Visualize Feature Distributions by Class
**Task**: Compare feature distributions between approved and denied applications.

In [None]:
# Create visualizations comparing approved vs denied applications
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

# Numeric features to plot
numeric_features = ['credit_score', 'annual_income', 'debt_to_income', 'credit_utilization',
                   'age', 'employment_length', 'loan_amount', 'liquid_assets', 'previous_defaults']

for i, feature in enumerate(numeric_features):
    # Create overlapping histograms
    denied = df[df['approved'] == 0][feature]
    approved = df[df['approved'] == 1][feature]
    
    axes[i].hist(denied, bins=30, alpha=0.7, label='Denied', color='red', density=True)
    axes[i].hist(approved, bins=30, alpha=0.7, label='Approved', color='green', density=True)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Density')
    axes[i].set_title(f'Distribution of {feature}')
    axes[i].legend()

plt.tight_layout()
plt.show()

# TODO: Calculate and display mean differences for key features
print("Feature Comparison (Approved vs Denied):")
print("="*50)
for feature in ['credit_score', 'annual_income', 'debt_to_income', 'credit_utilization']:
    denied_mean = df[df['approved'] == 0][feature].mean()
    approved_mean = ______  # YOUR CODE HERE: Calculate mean for approved
    difference = approved_mean - denied_mean
    print(f"{feature:18s}: Denied={denied_mean:8.0f}, Approved={approved_mean:8.0f}, Diff={difference:+8.0f}")

### Exercise 2.3: Categorical Feature Analysis
**Task**: Analyze approval rates by loan purpose and other categorical variables.

In [None]:
# Analyze approval rates by loan purpose
purpose_analysis = df.groupby('loan_purpose')['approved'].agg(['count', 'mean', 'sum']).round(3)
purpose_analysis.columns = ['Applications', 'Approval_Rate', 'Approved_Count']
purpose_analysis = purpose_analysis.sort_values('Approval_Rate', ascending=False)

print("Approval Rates by Loan Purpose:")
print("="*40)
print(purpose_analysis)

# Visualize approval rates by purpose
plt.figure(figsize=(12, 6))
purpose_analysis['Approval_Rate'].plot(kind='bar', color='steelblue')
plt.title('Approval Rate by Loan Purpose')
plt.xlabel('Loan Purpose')
plt.ylabel('Approval Rate')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# TODO: Create a crosstab for previous defaults vs approval
print("\nApproval Rates by Previous Defaults:")
print("="*40)
defaults_crosstab = ______  # YOUR CODE HERE: Create crosstab with margins
print(defaults_crosstab)

## Part 3: Data Preprocessing

Before training our classification model, we need to prepare the data properly.

### Exercise 3.1: Handle Categorical Variables
**Task**: Encode categorical variables for machine learning.

In [None]:
# Create a copy for preprocessing
df_processed = df.copy()

# TODO: One-hot encode loan_purpose
# YOUR CODE HERE: Use pd.get_dummies() to encode loan_purpose
purpose_encoded = ______

# Drop original categorical column and add encoded ones
df_processed = df_processed.drop('loan_purpose', axis=1)
df_processed = pd.concat([df_processed, purpose_encoded], axis=1)

print(f"Original features: {df.shape[1]}")
print(f"After encoding: {df_processed.shape[1]}")
print(f"\nNew encoded features:")
print(purpose_encoded.columns.tolist())

### Exercise 3.2: Create Train/Validation/Test Splits
**Task**: Split data while maintaining class proportions.

In [None]:
# Separate features and target
X = df_processed.drop('approved', axis=1)
y = df_processed['approved']

# TODO: Create stratified splits to maintain class balance
# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=______  # YOUR CODE HERE: stratify by target
)

# Second split: separate train (60%) and validation (20%)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=______  # YOUR CODE HERE
)

# Display split information
print(f"Training set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Check class distributions are preserved
print("\nClass distribution across splits:")
print(f"Training:   {y_train.mean():.3f}")
print(f"Validation: {y_val.mean():.3f}")
print(f"Test:       {y_test.mean():.3f}")
print(f"Original:   {y.mean():.3f}")

### Exercise 3.3: Feature Scaling
**Task**: Scale features for logistic regression.

In [None]:
# TODO: Apply feature scaling
scaler = StandardScaler()

# Fit on training data and transform all sets
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = ______  # YOUR CODE HERE: Transform validation set
X_test_scaled = ______  # YOUR CODE HERE: Transform test set

print("Feature scaling completed!")
print(f"\nExample scaling effect on 'annual_income':")
income_idx = list(X.columns).index('annual_income')
print(f"  Original: mean={X_train.iloc[:, income_idx].mean():.0f}, std={X_train.iloc[:, income_idx].std():.0f}")
print(f"  Scaled:   mean={X_train_scaled[:, income_idx].mean():.3f}, std={X_train_scaled[:, income_idx].std():.3f}")

## Part 4: Baseline Model - Logistic Regression

Let's start with a basic logistic regression model.

### Exercise 4.1: Train Basic Logistic Regression
**Task**: Train and evaluate a baseline logistic regression model.

In [None]:
# TODO: Train baseline logistic regression
baseline_model = LogisticRegression(random_state=42)

# YOUR CODE HERE: Fit the model on scaled training data
______

# Make predictions on all sets
y_train_pred = baseline_model.predict(X_train_scaled)
y_val_pred = ______  # YOUR CODE HERE
y_test_pred = ______  # YOUR CODE HERE

# Get prediction probabilities for ROC analysis
y_train_proba = baseline_model.predict_proba(X_train_scaled)[:, 1]
y_val_proba = ______  # YOUR CODE HERE: Get probabilities for validation set
y_test_proba = ______  # YOUR CODE HERE: Get probabilities for test set

print("Baseline logistic regression model trained!")

### Exercise 4.2: Evaluate with Classification Metrics
**Task**: Calculate accuracy, precision, recall, F1-score, and AUC for all sets.

In [None]:
def evaluate_classification_model(y_true, y_pred, y_proba, set_name):
    """
    Calculate and display comprehensive classification metrics.
    """
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = ______  # YOUR CODE HERE: Calculate recall
    f1 = f1_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)
    
    print(f"{set_name} Set Metrics:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f} (of predicted approvals, what % were correct)")
    print(f"  Recall:    {recall:.4f} (of actual approvals, what % were caught)")
    print(f"  F1-Score:  {f1:.4f} (harmonic mean of precision and recall)")
    print(f"  AUC-ROC:   {auc:.4f} (area under ROC curve)")
    
    return accuracy, precision, recall, f1, auc

# Evaluate baseline model
print("BASELINE MODEL PERFORMANCE")
print("="*50)
train_metrics = evaluate_classification_model(y_train, y_train_pred, y_train_proba, "Training")
print()
val_metrics = evaluate_classification_model(______) # YOUR CODE HERE: Validation metrics
print()
test_metrics = evaluate_classification_model(______) # YOUR CODE HERE: Test metrics

# Check for overfitting
print("\n" + "="*50)
print("OVERFITTING ANALYSIS:")
train_acc, val_acc = train_metrics[0], val_metrics[0]
print(f"Training-Validation Accuracy Gap: {train_acc - val_acc:.4f}")
if abs(train_acc - val_acc) > 0.05:
    print("⚠️  Potential overfitting detected")
else:
    print("✅ Model appears well-generalized")

### Exercise 4.3: Confusion Matrix Analysis
**Task**: Create and interpret confusion matrices.

In [None]:
# TODO: Create confusion matrices
def plot_confusion_matrix(y_true, y_pred, set_name, ax=None):
    """
    Plot a formatted confusion matrix with business context.
    """
    cm = confusion_matrix(y_true, y_pred)
    
    if ax is None:
        plt.figure(figsize=(6, 5))
        ax = plt.gca()
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Denied', 'Approved'],
                yticklabels=['Denied', 'Approved'])
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(f'Confusion Matrix - {set_name} Set')
    
    # Add business context labels
    tn, fp, fn, tp = cm.ravel()
    ax.text(0.5, 0.1, f'True Negatives\n(Correctly Denied)\n{tn}', 
            ha='center', va='center', transform=ax.transAxes, fontsize=9)
    ax.text(1.5, 0.1, f'False Positives\n(Incorrectly Approved)\n{fp}', 
            ha='center', va='center', transform=ax.transAxes, fontsize=9)
    ax.text(0.5, 0.9, f'False Negatives\n(Incorrectly Denied)\n{fn}', 
            ha='center', va='center', transform=ax.transAxes, fontsize=9)
    ax.text(1.5, 0.9, f'True Positives\n(Correctly Approved)\n{tp}', 
            ha='center', va='center', transform=ax.transAxes, fontsize=9)
    
    return cm

# Create confusion matrices for all sets
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

cm_train = plot_confusion_matrix(y_train, y_train_pred, "Training", axes[0])
cm_val = ______  # YOUR CODE HERE: Plot validation confusion matrix
cm_test = ______  # YOUR CODE HERE: Plot test confusion matrix

plt.tight_layout()
plt.show()

# Calculate business impact metrics
tn, fp, fn, tp = cm_test.ravel()
print("\nBUSINESS IMPACT ANALYSIS (Test Set):")
print("="*40)
print(f"✅ Correct Approvals: {tp:,} (will generate revenue)")
print(f"✅ Correct Denials: {tn:,} (avoided bad loans)")
print(f"❌ Missed Opportunities: {fn:,} (denied good applicants)")
print(f"💰 Potential Bad Loans: {fp:,} (approved risky applicants)")
print(f"\nError Rate: {(fp + fn) / len(y_test):.1%}")
print(f"Approval Rate: {(tp + fp) / len(y_test):.1%}")

## Part 5: Handle Class Imbalance

Let's address class imbalance using different techniques.

### Exercise 5.1: Class Weights
**Task**: Train logistic regression with balanced class weights.

In [None]:
# TODO: Calculate class weights
class_weights = compute_class_weight(
    'balanced',
    classes=______,  # YOUR CODE HERE: unique classes
    y=______         # YOUR CODE HERE: training target
)
class_weight_dict = dict(zip([0, 1], class_weights))

print(f"Class weights: {class_weight_dict}")
print(f"This gives {class_weights[1]/class_weights[0]:.2f}x more weight to the minority class")

# Train balanced model
balanced_model = LogisticRegression(class_weight='balanced', random_state=42)
balanced_model.fit(______) # YOUR CODE HERE: Fit on scaled training data

# Make predictions
y_val_pred_balanced = balanced_model.predict(X_val_scaled)
y_val_proba_balanced = balanced_model.predict_proba(X_val_scaled)[:, 1]

print("\nBalanced model trained!")

### Exercise 5.2: Compare Model Performance
**Task**: Compare baseline vs balanced models.

In [None]:
# Compare models on validation set
print("MODEL COMPARISON ON VALIDATION SET")
print("="*50)

print("Baseline Model:")
evaluate_classification_model(y_val, y_val_pred, y_val_proba, "Validation")

print("\n" + "-"*30)
print("Balanced Model:")
evaluate_classification_model(______) # YOUR CODE HERE: Evaluate balanced model

# Compare confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

cm_baseline = confusion_matrix(y_val, y_val_pred)
cm_balanced = confusion_matrix(______) # YOUR CODE HERE: Confusion matrix for balanced model

sns.heatmap(cm_baseline, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Denied', 'Approved'], yticklabels=['Denied', 'Approved'])
axes[0].set_title('Baseline Model')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

sns.heatmap(cm_balanced, annot=True, fmt='d', cmap='Oranges', ax=axes[1],
            xticklabels=['Denied', 'Approved'], yticklabels=['Denied', 'Approved'])
axes[1].set_title('Balanced Model')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

# Calculate recall improvement
baseline_recall = recall_score(y_val, y_val_pred)
balanced_recall = ______  # YOUR CODE HERE: Calculate balanced model recall
recall_improvement = balanced_recall - baseline_recall

print(f"\nRecall Improvement: {recall_improvement:+.3f} ({recall_improvement/baseline_recall:+.1%})")

## Part 6: ROC and Precision-Recall Curves

Let's create ROC and Precision-Recall curves to better understand model performance.

### Exercise 6.1: ROC Curve Analysis
**Task**: Plot and interpret ROC curves.

In [None]:
# TODO: Create ROC curves
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# ROC Curve
fpr_baseline, tpr_baseline, _ = roc_curve(y_val, y_val_proba)
fpr_balanced, tpr_balanced, _ = ______  # YOUR CODE HERE: ROC curve for balanced model

# Plot ROC curves
axes[0].plot(fpr_baseline, tpr_baseline, label=f'Baseline (AUC = {roc_auc_score(y_val, y_val_proba):.3f})', linewidth=2)
axes[0].plot(fpr_balanced, tpr_balanced, label=f'Balanced (AUC = {roc_auc_score(y_val, y_val_proba_balanced):.3f})', linewidth=2)
axes[0].plot([0, 1], [0, 1], 'k--', label='Random Classifier')
axes[0].set_xlabel('False Positive Rate (1 - Specificity)')
axes[0].set_ylabel('True Positive Rate (Sensitivity/Recall)')
axes[0].set_title('ROC Curve')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
precision_baseline, recall_baseline, _ = precision_recall_curve(y_val, y_val_proba)
precision_balanced, recall_balanced, _ = ______  # YOUR CODE HERE: PR curve for balanced model

# Plot PR curves
baseline_ap = average_precision_score(y_val, y_val_proba)
balanced_ap = ______  # YOUR CODE HERE: Average precision for balanced model

axes[1].plot(recall_baseline, precision_baseline, label=f'Baseline (AP = {baseline_ap:.3f})', linewidth=2)
axes[1].plot(recall_balanced, precision_balanced, label=f'Balanced (AP = {balanced_ap:.3f})', linewidth=2)
axes[1].axhline(y=y_val.mean(), color='k', linestyle='--', label=f'Random (AP = {y_val.mean():.3f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("CURVE INTERPRETATION:")
print("="*30)
print("ROC Curve:")
print("  • Shows trade-off between sensitivity and specificity")
print("  • Area Under Curve (AUC) ranges from 0.5 (random) to 1.0 (perfect)")
print("  • Good for balanced datasets")
print("\nPrecision-Recall Curve:")
print("  • Shows trade-off between precision and recall")
print("  • More informative for imbalanced datasets")
print("  • Average Precision (AP) is area under PR curve")

## Part 7: Threshold Optimization

Let's optimize the decision threshold based on business objectives.

### Exercise 7.1: Business Cost Analysis
**Task**: Define business costs and find optimal threshold.

In [None]:
# Define business costs
COST_FALSE_POSITIVE = 50000  # Average loss from a bad loan
COST_FALSE_NEGATIVE = 5000   # Opportunity cost of rejecting good applicant
REVENUE_TRUE_POSITIVE = 15000  # Average profit from good loan
COST_TRUE_NEGATIVE = 0       # No cost for correctly rejecting

def calculate_business_value(y_true, y_pred, verbose=True):
    """
    Calculate total business value based on confusion matrix.
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    total_value = (
        tp * REVENUE_TRUE_POSITIVE +  # Revenue from good loans
        tn * COST_TRUE_NEGATIVE +     # Cost of correct rejections (0)
        fp * (-COST_FALSE_POSITIVE) + # Loss from bad loans
        fn * (-COST_FALSE_NEGATIVE)   # Opportunity cost
    )
    
    if verbose:
        print(f"Business Value Breakdown:")
        print(f"  Revenue (TP): {tp} × ${REVENUE_TRUE_POSITIVE:,} = ${tp * REVENUE_TRUE_POSITIVE:,}")
        print(f"  Losses (FP):  {fp} × ${COST_FALSE_POSITIVE:,} = ${fp * COST_FALSE_POSITIVE:,}")
        print(f"  Missed (FN):  {fn} × ${COST_FALSE_NEGATIVE:,} = ${fn * COST_FALSE_NEGATIVE:,}")
        print(f"  Total Value: ${total_value:,}")
    
    return total_value

# Calculate current business value
print("CURRENT BUSINESS VALUE ANALYSIS")
print("="*40)
print("Baseline Model:")
baseline_value = calculate_business_value(y_val, y_val_pred)

print("\nBalanced Model:")
balanced_value = ______  # YOUR CODE HERE: Calculate balanced model value

print(f"\nValue Difference: ${balanced_value - baseline_value:,}")

### Exercise 7.2: Optimize Decision Threshold
**Task**: Find the threshold that maximizes business value.

In [None]:
# TODO: Test different thresholds to maximize business value
thresholds = np.arange(0.1, 0.9, 0.05)
business_values = []
precisions = []
recalls = []
f1_scores = []

for threshold in thresholds:
    # Make predictions with custom threshold
    y_pred_thresh = (y_val_proba_balanced >= threshold).astype(int)
    
    # Calculate metrics
    business_value = calculate_business_value(y_val, y_pred_thresh, verbose=False)
    precision = precision_score(y_val, y_pred_thresh)
    recall = ______  # YOUR CODE HERE: Calculate recall
    f1 = f1_score(y_val, y_pred_thresh)
    
    business_values.append(business_value)
    precisions.append(precision)
    recalls.append(recall)
    f1_scores.append(f1)

# Find optimal threshold
optimal_idx = np.argmax(business_values)
optimal_threshold = thresholds[optimal_idx]
max_business_value = business_values[optimal_idx]

print(f"OPTIMAL THRESHOLD ANALYSIS")
print("="*40)
print(f"Optimal Threshold: {optimal_threshold:.3f}")
print(f"Maximum Business Value: ${max_business_value:,}")
print(f"Precision at optimal: {precisions[optimal_idx]:.3f}")
print(f"Recall at optimal: {recalls[optimal_idx]:.3f}")
print(f"F1-Score at optimal: {f1_scores[optimal_idx]:.3f}")

# Visualize threshold optimization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Business value vs threshold
axes[0, 0].plot(thresholds, business_values, 'b-', linewidth=2)
axes[0, 0].axvline(optimal_threshold, color='r', linestyle='--', label=f'Optimal: {optimal_threshold:.3f}')
axes[0, 0].set_xlabel('Threshold')
axes[0, 0].set_ylabel('Business Value ($)')
axes[0, 0].set_title('Business Value vs Threshold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Precision vs threshold
axes[0, 1].plot(thresholds, precisions, 'g-', linewidth=2)
axes[0, 1].axvline(optimal_threshold, color='r', linestyle='--')
axes[0, 1].set_xlabel('Threshold')
axes[0, 1].set_ylabel('Precision')
axes[0, 1].set_title('Precision vs Threshold')
axes[0, 1].grid(True, alpha=0.3)

# Recall vs threshold
axes[1, 0].plot(thresholds, recalls, 'orange', linewidth=2)
axes[1, 0].axvline(optimal_threshold, color='r', linestyle='--')
axes[1, 0].set_xlabel('Threshold')
axes[1, 0].set_ylabel('Recall')
axes[1, 0].set_title('Recall vs Threshold')
axes[1, 0].grid(True, alpha=0.3)

# F1-Score vs threshold
axes[1, 1].plot(thresholds, f1_scores, 'purple', linewidth=2)
axes[1, 1].axvline(optimal_threshold, color='r', linestyle='--')
axes[1, 1].set_xlabel('Threshold')
axes[1, 1].set_ylabel('F1-Score')
axes[1, 1].set_title('F1-Score vs Threshold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 8: Feature Importance and Model Interpretation

Let's understand which features are most important for credit decisions.

### Exercise 8.1: Logistic Regression Coefficients
**Task**: Interpret model coefficients for business insights.

In [None]:
# Get feature importance from logistic regression coefficients
feature_names = X.columns
coefficients = balanced_model.coef_[0]

# Create feature importance dataframe
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients,
    'abs_coefficient': np.abs(coefficients),
    'odds_ratio': np.exp(coefficients)  # Odds ratio interpretation
}).sort_values('abs_coefficient', ascending=False)

print("TOP 15 MOST IMPORTANT FEATURES")
print("="*60)
print(f"{'Feature':<25} {'Coefficient':<12} {'Odds Ratio':<12} {'Impact'}")
print("-" * 60)

for idx, row in feature_importance.head(15).iterrows():
    if row['coefficient'] > 0:
        impact = "↑ Increases approval odds"
    else:
        impact = "↓ Decreases approval odds"
    
    print(f"{row['feature']:<25} {row['coefficient']:>10.3f}   {row['odds_ratio']:>10.3f}   {impact}")

# TODO: Visualize top 10 features
plt.figure(figsize=(10, 6))
top_10 = feature_importance.head(10)
colors = ['green' if coef > 0 else 'red' for coef in top_10['coefficient']]

plt.barh(range(len(top_10)), top_10['coefficient'], color=colors, alpha=0.7)
plt.yticks(range(len(top_10)), top_10['feature'])
plt.xlabel('Logistic Regression Coefficient')
plt.title('Top 10 Feature Importance (Logistic Regression)')
plt.axvline(x=0, color='black', linestyle='-', alpha=0.3)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nCOEFFICIENT INTERPRETATION:")
print("• Positive coefficient: Increases log-odds of approval")
print("• Negative coefficient: Decreases log-odds of approval")
print("• Odds Ratio > 1: Feature increases odds of approval")
print("• Odds Ratio < 1: Feature decreases odds of approval")

### Exercise 8.2: Feature Importance with Random Forest
**Task**: Compare with Random Forest feature importance for validation.

In [None]:
# TODO: Train Random Forest for feature importance comparison
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(______) # YOUR CODE HERE: Fit on scaled training data

# Get Random Forest feature importance
rf_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Compare top features from both models
print("FEATURE IMPORTANCE COMPARISON")
print("="*50)
print(f"{'Feature':<25} {'Logistic':<10} {'RandomForest':<12}")
print("-" * 50)

# Merge the dataframes for comparison
comparison = feature_importance[['feature', 'abs_coefficient']].merge(
    rf_importance, on='feature', how='outer'
).fillna(0)

# Normalize for comparison
comparison['logistic_norm'] = comparison['abs_coefficient'] / comparison['abs_coefficient'].max()
comparison['rf_norm'] = comparison['importance'] / comparison['importance'].max()

# Show top 10 by average importance
comparison['avg_importance'] = (comparison['logistic_norm'] + comparison['rf_norm']) / 2
top_features = comparison.sort_values('avg_importance', ascending=False).head(10)

for idx, row in top_features.iterrows():
    print(f"{row['feature']:<25} {row['logistic_norm']:>8.3f}   {row['rf_norm']:>10.3f}")

# Visualize comparison
plt.figure(figsize=(12, 6))
x = range(len(top_features))
width = 0.35

plt.bar([i - width/2 for i in x], top_features['logistic_norm'], width, 
        label='Logistic Regression', alpha=0.7)
plt.bar([i + width/2 for i in x], top_features['rf_norm'], width, 
        label='Random Forest', alpha=0.7)

plt.xlabel('Features')
plt.ylabel('Normalized Importance')
plt.title('Feature Importance Comparison')
plt.xticks(x, top_features['feature'], rotation=45, ha='right')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Part 9: Final Model Evaluation

Let's evaluate our optimized model on the test set.

### Exercise 9.1: Test Set Performance
**Task**: Apply optimal threshold to test set and calculate final metrics.

In [None]:
# Apply optimal model to test set
y_test_proba_final = balanced_model.predict_proba(X_test_scaled)[:, 1]
y_test_pred_final = (y_test_proba_final >= optimal_threshold).astype(int)

# Calculate final performance metrics
print("FINAL MODEL PERFORMANCE ON TEST SET")
print("="*50)
final_metrics = evaluate_classification_model(
    y_test, y_test_pred_final, y_test_proba_final, "Test"
)

# Calculate final business value
print("\nFINAL BUSINESS VALUE ANALYSIS")
print("="*40)
final_business_value = calculate_business_value(y_test, y_test_pred_final)

# Compare with baseline performance on test set
y_test_pred_baseline = baseline_model.predict(X_test_scaled)
baseline_test_value = calculate_business_value(y_test, y_test_pred_baseline, verbose=False)

print(f"\nValue Improvement over Baseline: ${final_business_value - baseline_test_value:,}")
print(f"Percentage Improvement: {((final_business_value - baseline_test_value) / abs(baseline_test_value)) * 100:+.1f}%")

# Create final confusion matrix
plt.figure(figsize=(8, 6))
cm_final = confusion_matrix(y_test, y_test_pred_final)
sns.heatmap(cm_final, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Denied', 'Approved'],
            yticklabels=['Denied', 'Approved'])
plt.title('Final Model - Test Set Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Add business context
tn, fp, fn, tp = cm_final.ravel()
plt.text(0.5, -0.15, f'Correctly Denied: {tn}\nIncorrectly Approved: {fp}', 
         ha='center', va='top', transform=plt.gca().transAxes, fontsize=10)
plt.text(1.5, -0.15, f'Incorrectly Denied: {fn}\nCorrectly Approved: {tp}', 
         ha='center', va='top', transform=plt.gca().transAxes, fontsize=10)

plt.tight_layout()
plt.show()

## Part 10: Business Recommendations and Deployment Considerations

### Exercise 10.1: Generate Executive Summary
**Task**: Create a comprehensive business report.

In [None]:
# Generate comprehensive business report
print("\n" + "="*80)
print("SECUREBANK CREDIT RISK ASSESSMENT - EXECUTIVE SUMMARY")
print("="*80)

# Model performance summary
accuracy, precision, recall, f1, auc = final_metrics
print(f"\n📊 MODEL PERFORMANCE METRICS")
print("-" * 40)
print(f"• Accuracy: {accuracy:.1%} (correctly classified applications)")
print(f"• Precision: {precision:.1%} (of approved loans, % that are good)")
print(f"• Recall: {recall:.1%} (of good applications, % we approved)")
print(f"• AUC-ROC: {auc:.3f} (model discrimination ability)")
print(f"• Optimal Threshold: {optimal_threshold:.3f} (custom business threshold)")

# Business impact
print(f"\n💰 BUSINESS IMPACT")
print("-" * 40)
print(f"• Expected Value: ${final_business_value:,} (on test set)")
print(f"• Improvement: ${final_business_value - baseline_test_value:,} over baseline")
print(f"• Risk Reduction: {fp} potentially bad loans identified")
print(f"• Revenue Protection: ${fp * COST_FALSE_POSITIVE:,} in potential losses avoided")

# Key insights
print(f"\n🔍 KEY RISK FACTORS")
print("-" * 40)
top_3_features = feature_importance.head(3)
for idx, row in top_3_features.iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"• {row['feature']}: {direction} approval probability")

# Operational benefits
print(f"\n⚡ OPERATIONAL BENEFITS")
print("-" * 40)
print(f"• Automated Decision Making: Reduce review time from 5-7 days to minutes")
print(f"• Consistent Criteria: Eliminate subjective decision variations")
print(f"• Scalability: Handle increased application volume without proportional staff increase")
print(f"• Compliance: Explainable AI features support regulatory requirements")

# Implementation recommendations
print(f"\n🚀 IMPLEMENTATION RECOMMENDATIONS")
print("-" * 40)
print(f"1. PHASE 1 - PILOT (Month 1-2)")
print(f"   • Deploy model for 20% of applications")
print(f"   • Human oversight for all model decisions")
print(f"   • Monitor performance vs human decisions")
print(f"")
print(f"2. PHASE 2 - GRADUAL ROLLOUT (Month 3-4)")
print(f"   • Increase to 50% of applications")
print(f"   • Auto-approve obvious accepts (high confidence)")
print(f"   • Human review for borderline cases")
print(f"")
print(f"3. PHASE 3 - FULL DEPLOYMENT (Month 5+)")
print(f"   • 100% automated pre-screening")
print(f"   • Human review only for edge cases")
print(f"   • Continuous model monitoring and retraining")

# Monitoring and maintenance
print(f"\n📈 MONITORING & MAINTENANCE")
print("-" * 40)
print(f"• Model Performance: Track precision/recall monthly")
print(f"• Data Drift: Monitor feature distributions for changes")
print(f"• Business Metrics: Actual default rates vs predictions")
print(f"• Retraining: Quarterly model updates with new data")
print(f"• A/B Testing: Continuous threshold optimization")

# Risk mitigation
print(f"\n⚠️  RISK MITIGATION")
print("-" * 40)
print(f"• Bias Monitoring: Regular fairness audits across demographic groups")
print(f"• Explainability: Provide reasons for each decision")
print(f"• Human Override: Maintain ability to override model decisions")
print(f"• Backup Systems: Fallback to manual review if model fails")
print(f"• Regulatory Compliance: Document all model changes and performance")

print(f"\n" + "="*80)

## Conclusion

Congratulations! You've successfully completed a comprehensive classification analysis for credit risk assessment. 

### Key Takeaways:

1. **Classification vs Regression**: Learned to predict categories rather than continuous values
2. **Class Imbalance**: Addressed imbalanced datasets using class weights and custom thresholds
3. **Multiple Metrics**: Used precision, recall, F1-score, and AUC for comprehensive evaluation
4. **ROC and PR Curves**: Visualized model performance across different thresholds
5. **Business Optimization**: Optimized decision threshold based on business costs and benefits
6. **Feature Interpretation**: Understood which factors drive credit decisions
7. **Model Comparison**: Compared different approaches and validation techniques

### Skills Practiced:
- Binary classification modeling
- Handling imbalanced datasets
- Logistic regression implementation and interpretation
- Classification metrics calculation and interpretation
- ROC and Precision-Recall curve analysis
- Threshold optimization for business objectives
- Feature importance analysis
- Model deployment considerations
- Business communication of technical results

### Next Steps:
In the next lab, we'll explore K-Nearest Neighbors for customer segmentation, learning about distance-based classification and the importance of feature scaling in instance-based learning algorithms.