# Student Loan Risk Demo - Model Training and Evaluation

This notebook demonstrates the complete machine learning pipeline for predicting student loan delinquency risk.

**Project Overview:**
- **Client:** Maximus (Student Loan Processing Company)
- **Partner:** FiServ (Follow-up with at-risk students)
- **Objective:** Train and evaluate ML models for delinquency prediction
- **Platform:** Cloudera Machine Learning

## 📋 Notebook Contents

1. **Data Loading and Preprocessing**
2. **Feature Engineering Pipeline**
3. **Model Training (Multiple Algorithms)**
4. **Model Evaluation and Comparison**
5. **Feature Importance Analysis**
6. **Model Selection and Validation**
7. **Performance Metrics and Visualization**


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
import sys
import os
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score
import joblib

# Add utils to path
sys.path.append('../utils')

# Import custom modules
from data_preprocessing import StudentLoanPreprocessor
from ml_models import StudentLoanRiskModels

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Environment setup complete!")


## 1. Load and Prepare Data

First, let's load the synthetic student loan dataset and prepare it for machine learning.


In [None]:
# Load the synthetic dataset
data_path = '../data/synthetic/student_loan_master_dataset.csv'

if os.path.exists(data_path):
    df = pd.read_csv(data_path)
    print(f"Dataset loaded successfully!")
    print(f"Shape: {df.shape}")
    print(f"Delinquency rate: {df['is_delinquent'].mean():.1%}")
    display(df.head())
else:
    print(f"Dataset not found at {data_path}")
    print("Please run the data generation notebook first or execute:")
    print("python main.py --generate-data")


## 2. Data Preprocessing and Feature Engineering

Now let's preprocess the data and engineer features for machine learning.


In [None]:
# Initialize the preprocessor and prepare training data
if 'df' in locals():
    print("Initializing data preprocessing...")
    
    # Create preprocessor instance
    preprocessor = StudentLoanPreprocessor()
    
    # Prepare training and testing datasets
    X_train, X_test, y_train, y_test = preprocessor.prepare_training_data(df, test_size=0.2, random_state=42)
    
    print(f"✅ Data preprocessing completed!")
    print(f"Training set: {X_train.shape}")
    print(f"Testing set: {X_test.shape}")
    print(f"Target distribution - Train: {y_train.mean():.3f}, Test: {y_test.mean():.3f}")
    
    # Display feature names
    print(f"\nNumber of features: {len(preprocessor.get_feature_importance_names())}")
    print("Sample features:", preprocessor.get_feature_importance_names()[:10])
    
else:
    print("⚠️ Please load the dataset first by running the previous cell.")


## 3. Train Multiple ML Models

Let's train and compare multiple machine learning algorithms for delinquency prediction.


In [None]:
# Train multiple ML models
if 'X_train' in locals():
    print("🚀 Starting model training...")
    
    # Initialize ML models
    ml_models = StudentLoanRiskModels(random_state=42)
    
    # Train all models with hyperparameter tuning
    results = ml_models.train_all_models(X_train, y_train, X_test, y_test)
    
    print("\n📊 Model Performance Summary:")
    print("="*50)
    
    # Display results for each model
    for model_name, model_results in results.items():
        print(f"\n{model_name.replace('_', ' ').title()}:")
        print(f"  AUC Score: {model_results['test_auc']:.4f}")
        print(f"  Precision: {model_results['test_precision']:.4f}")
        print(f"  Recall: {model_results['test_recall']:.4f}")
        print(f"  F1-Score: {model_results['test_f1']:.4f}")
    
    print(f"\n🏆 Best model: {ml_models.best_model_name}")
    print(f"🎯 Best AUC score: {ml_models.model_scores[ml_models.best_model_name]['test_auc']:.4f}")
    
    # Save models
    os.makedirs('../models', exist_ok=True)
    ml_models.save_models('../models')
    print(f"\n💾 Models saved to ../models/")
    
else:
    print("⚠️ Please run the data preprocessing step first.")
