## Model Training and Ensemble Creation

This notebook builds upon the feature engineering from Notebook 2 to train multiple machine learning models and create an ensemble for superior credit risk prediction.

### Self-Contained Implementation:

Since each Databricks notebook runs independently, this notebook includes:
- Complete data loading and feature engineering replication
- Preprocessor pipeline setup with One-Hot Encoding
- Training of three distinct model architectures
- Ensemble creation combining all models

### Model Architecture Strategy:

We train three complementary model types:
- **Random Forest**: Ensemble of decision trees, robust to outliers and nonlinear relationships
- **Gradient Boosting**: Sequential tree building with error correction, high predictive accuracy  
- **Logistic Regression**: Linear model with regularization, provides interpretable baseline

### Performance Evaluation:
- **Accuracy**: Overall correct classification rate
- **AUC-ROC**: Area under Receiver Operating Characteristic curve, measures model's ability to distinguish between classes
- **Ensemble Performance**: Weighted combination of all models for optimal results

In [0]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline as SkPipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
import pandas as pd

print("Notebook 3: Model Training and Ensemble Creation")
print("=" * 60)

# Recreate the feature engineering function from Notebook 2
def create_real_time_features(df):
    """Create features suitable for real-time scoring"""
    df_enhanced = df.copy()
    
    # Financial ratios (real-time calculable)
    df_enhanced['Debt_to_Income_Ratio'] = df_enhanced['CCAvg'] / (df_enhanced['Income']/12 + 1e-6)
    df_enhanced['Savings_Rate'] = (df_enhanced['Income'] - df_enhanced['CCAvg'] * 12) / df_enhanced['Income']
    df_enhanced['Credit_Usage_Intensity'] = df_enhanced['CCAvg'] / (df_enhanced['Income']/12 + 1e-6)
    
    # Behavioral features
    df_enhanced['Digital_Engagement'] = df_enhanced['Online'] + df_enhanced['CreditCard']
    df_enhanced['Investment_Profile'] = df_enhanced['Securities Account'] + df_enhanced['CD Account']
    
    # Stability indicators
    df_enhanced['Career_Stage'] = df_enhanced['Experience'] / (df_enhanced['Age'] + 1e-6)
    df_enhanced['Family_Financial_Stress'] = df_enhanced['Family'] / (df_enhanced['Income']/1000 + 1e-6)
    
    # Binning for categorical encoding
    df_enhanced['Income_Bin'] = pd.cut(df_enhanced['Income'], 
                                      bins=[0, 50, 100, 200, 500], 
                                      labels=['Low', 'Medium', 'High', 'Very High'])
    
    df_enhanced['Age_Group'] = pd.cut(df_enhanced['Age'],
                                     bins=[0, 30, 45, 60, 100],
                                     labels=['Young', 'Adult', 'Middle', 'Senior'])
    
    df_enhanced['CCAvg_Level'] = pd.cut(df_enhanced['CCAvg'],
                                       bins=[0, 1, 3, 6, 10],
                                       labels=['Low', 'Medium', 'High', 'Very High'])
    
    return df_enhanced

# Load and process the data (same as Notebook 2)
print("Loading data from Databricks...")
df = spark.table("personal_catalog.default.bank_loan_modelling")
pandas_df = df.toPandas()

print("Applying feature engineering...")
enhanced_pandas_df = create_real_time_features(pandas_df)
print(f"Enhanced dataset shape: {enhanced_pandas_df.shape}")

In [0]:
# Define features for the model (same as Notebook 2)
categorical_features = ['Income_Bin', 'Age_Group', 'CCAvg_Level', 'Education']
numerical_features = [
    'Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage',
    'Debt_to_Income_Ratio', 'Savings_Rate', 'Credit_Usage_Intensity',
    'Digital_Engagement', 'Investment_Profile', 'Career_Stage', 
    'Family_Financial_Stress'
]

# Target variable
target = 'Personal Loan'

# Create preprocessing pipeline with One-Hot Encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_features)
    ])

# Create the full pipeline with multiple model options
def create_sklearn_pipeline(model_type='random_forest'):
    if model_type == 'random_forest':
        model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            random_state=42,
            class_weight='balanced'
        )
    elif model_type == 'gradient_boosting':
        model = GradientBoostingClassifier(
            n_estimators=100,
            max_depth=6,
            random_state=42
        )
    else:  # logistic regression
        model = LogisticRegression(
            random_state=42,
            class_weight='balanced',
            max_iter=1000
        )
    
    return SkPipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

# Prepare data for training
X = enhanced_pandas_df[numerical_features + categorical_features]
y = enhanced_pandas_df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Data preparation complete!")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Number of features: {X_train.shape[1]}")

In [0]:
# Train multiple models for ensemble scoring
models = {}
model_performance = {}

print("Starting model training...")
print("-" * 50)

for model_name in ['random_forest', 'gradient_boosting', 'logistic_regression']:
    print(f"Training {model_name}...")
    
    # Create and train model
    pipeline = create_sklearn_pipeline(model_name)
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
    
    # Store model and performance
    models[model_name] = pipeline
    model_performance[model_name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_pred_proba)
    }
    
    print(f"{model_name} - Accuracy: {model_performance[model_name]['accuracy']:.4f}, "
          f"AUC: {model_performance[model_name]['roc_auc']:.4f}")

## Ensemble Model Creation

This section combines the individual models into a weighted ensemble for improved performance and stability.

### Ensemble Strategy:

The ensemble uses weighted averaging of model predictions:
- **Random Forest**: 40% weight (strong overall performer)
- **Gradient Boosting**: 40% weight (high predictive power)
- **Logistic Regression**: 20% weight (linear perspective)

### Benefits of Ensemble:
- **Reduced Variance**: Averages out individual model errors
- **Improved Robustness**: Less sensitive to noisy data
- **Better Generalization**: Combines different learning biases
- **Enhanced Performance**: Typically outperforms single models

In [0]:
# Create ensemble predictions
def ensemble_predict_proba(X, models, weights=None):
    if weights is None:
        weights = [0.4, 0.4, 0.2]  # RF, GBM, LR
    
    predictions = []
    for i, (name, model) in enumerate(models.items()):
        pred_proba = model.predict_proba(X)[:, 1]
        predictions.append(pred_proba * weights[i])
    
    return np.sum(predictions, axis=0)

# Get ensemble predictions
ensemble_proba = ensemble_predict_proba(X_test, models)
ensemble_auc = roc_auc_score(y_test, ensemble_proba)

print("\n" + "=" * 50)
print("ENSEMBLE MODEL RESULTS")
print("=" * 50)
print(f"Ensemble Model AUC: {ensemble_auc:.4f}")

# Compare ensemble performance with individual models
print("\nPerformance Comparison:")
print("-" * 30)
for model_name in model_performance:
    print(f"{model_name:20} AUC: {model_performance[model_name]['roc_auc']:.4f}")
print(f"{'Ensemble':20} AUC: {ensemble_auc:.4f}")

print("\nModel training completed successfully!")