# Loan Payback Prediction - Training & Evaluation

This notebook loads pre-created training and test splits, with optional SMOTE balancing, preprocesses features, trains a RandomForest model, evaluates on the test split, and auto-increments submissions in a dedicated folder.

## 1. Import Required Libraries

Import pandas for data manipulation and train_test_split from sklearn.model_selection for splitting the data.

In [35]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [36]:
# Configuration: Set to True to use SMOTE-balanced training data
USE_SMOTE = False  # Change to True to use train_split_smote.csv

print(f"Configuration: USE_SMOTE = {USE_SMOTE}")

Configuration: USE_SMOTE = False


In [37]:
# Optional: Enable iterative training to search for best hyperparameters
ENABLE_ITERATIVE_TRAINING = False  # Set to True to try multiple configurations
N_ITERATIONS = 5  # Number of different configurations to try

# Model selection: Train both models and compare
TRAIN_RANDOM_FOREST = True   # Train RandomForest model
TRAIN_NEURAL_NETWORK = True  # Train Neural Network model

# Neural Network configuration
NN_EPOCHS = 50
NN_BATCH_SIZE = 32
NN_VALIDATION_SPLIT = 0.1

print(f"Iterative training: {'ENABLED' if ENABLE_ITERATIVE_TRAINING else 'DISABLED'}")
if ENABLE_ITERATIVE_TRAINING:
    print(f"Will train {N_ITERATIONS} models with different configurations")
print(f"\nModels to train:")
print(f"  - RandomForest: {'YES' if TRAIN_RANDOM_FOREST else 'NO'}")
print(f"  - Neural Network: {'YES' if TRAIN_NEURAL_NETWORK else 'NO'}")
if TRAIN_NEURAL_NETWORK:
    print(f"    Epochs: {NN_EPOCHS}, Batch size: {NN_BATCH_SIZE}, Validation split: {NN_VALIDATION_SPLIT}")

Iterative training: DISABLED

Models to train:
  - RandomForest: YES
  - Neural Network: YES
    Epochs: 50, Batch size: 32, Validation split: 0.1


## 2b. Training Configuration (Optional)

Enable iterative training to find the best hyperparameters by training multiple times with validation monitoring.

## 2. Configuration

Set whether to use SMOTE-balanced training data.

## 3. Load Training and Test Split Data

Load the train_split.csv (or train_split_smote.csv) and test_split.csv files from the Data/splits directory.

In [38]:
# Load the training and test split datasets
train_file = 'Data/splits/train_split_smote.csv' if USE_SMOTE else 'Data/splits/train_split.csv'
train_df = pd.read_csv(train_file)
test_df = pd.read_csv('Data/splits/test_split.csv')

print(f"Training file: {train_file}")
print(f"Train split shape: {train_df.shape}")
print(f"Test split shape: {test_df.shape}")

Training file: Data/splits/train_split.csv
Train split shape: (475195, 13)
Test split shape: (118799, 13)


## 4. Prepare Features and Target Variables

Separate features (X) and target variable (y) for both splits by dropping the 'loan_paid_back' column.

In [39]:
# Separate features and target variables for both splits
X_train = train_df.drop("loan_paid_back", axis=1)
y_train = train_df["loan_paid_back"]

X_test = test_df.drop("loan_paid_back", axis=1)
y_test = test_df["loan_paid_back"]

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape:  {X_test.shape}")
print(f"y_test shape:  {y_test.shape}")

X_train shape: (475195, 12)
y_train shape: (475195,)
X_test shape:  (118799, 12)
y_test shape:  (118799,)


## 5. Check Class Distribution in Provided Splits

Confirm the class balance in y_train and y_test to ensure splits look reasonable.

In [40]:
# Check class distribution in the provided splits
print("Class distribution in y_train (proportion):")
print(y_train.value_counts(normalize=True).sort_index())
print("\nClass distribution in y_test (proportion):")
print(y_test.value_counts(normalize=True).sort_index())

Class distribution in y_train (proportion):
loan_paid_back
0.0    0.201181
1.0    0.798819
Name: proportion, dtype: float64

Class distribution in y_test (proportion):
loan_paid_back
0.0    0.20118
1.0    0.79882
Name: proportion, dtype: float64


## 6. Sanity-Check Feature Columns

Verify that train and test splits have matching feature columns.

In [41]:
# Ensure feature columns align between train and test splits
train_cols = list(X_train.columns)
test_cols = list(X_test.columns)
print(f"X_train columns: ({len(train_cols)})")
print(train_cols)
print(f"\nX_test columns:  ({len(test_cols)})")
print(test_cols)
print("\nColumns match:", train_cols == test_cols)

X_train columns: (12)
['id', 'annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']

X_test columns:  (12)
['id', 'annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']

Columns match: True


## 7. Identify Categorical and Numerical Columns

Separate columns into categorical and numerical types for appropriate preprocessing.

In [42]:
# Identify categorical columns (object dtype)
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# Identify numerical columns (excluding 'id' if present, as it's not a useful feature)
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'id' in numerical_cols:
    numerical_cols.remove('id')

print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")

Categorical columns (6): ['gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']
Numerical columns (5): ['annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate']


## 8. Import Preprocessing Tools

Import OneHotEncoder for categorical features and StandardScaler for numerical features from sklearn.

In [43]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import numpy as np

## 9. One-Hot Encode Categorical Features

Fit the OneHotEncoder on the training data and transform train and test sets.

In [44]:
# Initialize and fit OneHotEncoder on training data
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe.fit(X_train[categorical_cols])

# Transform categorical columns for both datasets
X_train_cat_encoded = ohe.transform(X_train[categorical_cols])
X_test_cat_encoded = ohe.transform(X_test[categorical_cols])

print(f"Encoded categorical features shape (train): {X_train_cat_encoded.shape}")
print(f"Encoded categorical features shape (test): {X_test_cat_encoded.shape}")
print(f"Total one-hot encoded features: {X_train_cat_encoded.shape[1]}")

Encoded categorical features shape (train): (475195, 55)
Encoded categorical features shape (test): (118799, 55)
Total one-hot encoded features: 55


## 10. Scale Numerical Features

Fit the StandardScaler on the training data and transform train and test sets.

In [45]:
# Initialize and fit StandardScaler on training data
scaler = StandardScaler()
scaler.fit(X_train[numerical_cols])

# Transform numerical columns for both datasets
X_train_num_scaled = scaler.transform(X_train[numerical_cols])
X_test_num_scaled = scaler.transform(X_test[numerical_cols])

print(f"Scaled numerical features shape (train): {X_train_num_scaled.shape}")
print(f"Scaled numerical features shape (test): {X_test_num_scaled.shape}")

Scaled numerical features shape (train): (475195, 5)
Scaled numerical features shape (test): (118799, 5)


## 11. Combine Encoded and Scaled Features

Concatenate the one-hot encoded categorical features with scaled numerical features into final NumPy arrays.

In [46]:
# Combine categorical and numerical features
X_train_processed = np.concatenate([X_train_num_scaled, X_train_cat_encoded], axis=1)
X_test_processed = np.concatenate([X_test_num_scaled, X_test_cat_encoded], axis=1)

# Convert target variables to NumPy arrays
y_train_array = y_train.values
y_test_array = y_test.values

print("Final preprocessed data shapes:")
print(f"  X_train_processed: {X_train_processed.shape}")
print(f"  X_test_processed:  {X_test_processed.shape}")
print(f"  y_train_array:     {y_train_array.shape}")
print(f"  y_test_array:      {y_test_array.shape}")
print(f"\nTotal features: {X_train_processed.shape[1]} ({len(numerical_cols)} numerical + {X_train_cat_encoded.shape[1]} categorical)")

Final preprocessed data shapes:
  X_train_processed: (475195, 60)
  X_test_processed:  (118799, 60)
  y_train_array:     (475195,)
  y_test_array:      (118799,)

Total features: 60 (5 numerical + 55 categorical)


## 12. Save Preprocessed Data

Save the preprocessed NumPy arrays to disk for model training and evaluation.

In [47]:
import os

# Create directory for preprocessed data
os.makedirs('Data/preprocessed', exist_ok=True)

# Save preprocessed arrays
np.save('Data/preprocessed/X_train.npy', X_train_processed)
np.save('Data/preprocessed/X_test.npy', X_test_processed)
np.save('Data/preprocessed/y_train.npy', y_train_array)
np.save('Data/preprocessed/y_test.npy', y_test_array)

print("Preprocessed data saved to Data/preprocessed/:")
print("  - X_train.npy")
print("  - X_test.npy")
print("  - y_train.npy")
print("  - y_test.npy")

Preprocessed data saved to Data/preprocessed/:
  - X_train.npy
  - X_test.npy
  - y_train.npy
  - y_test.npy


## 13. Train Models (RandomForest and/or Neural Network)

Train selected models with balanced class weights to handle class imbalance. Compare performance on validation set.

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, f1_score, precision_score, recall_score, accuracy_score
import joblib

# Calculate class weights for handling imbalance
n_neg = (y_train_array == 0).sum()
n_pos = (y_train_array == 1).sum()
class_weight_ratio = n_neg / n_pos

print(f"Class counts: Negative={n_neg}, Positive={n_pos}")
print(f"Class weight ratio (neg/pos) = {class_weight_ratio:.4f}")
print(f"\n{'='*70}")
print("TRAINING MODELS")
print(f"{'='*70}\n")

# Dictionary to store all trained models and their scores
models = {}
model_scores = {}

# ============================================================================
# RANDOM FOREST TRAINING
# ============================================================================
if TRAIN_RANDOM_FOREST:
    print("üå≤ RANDOM FOREST")
    print("-" * 70)
    
    if ENABLE_ITERATIVE_TRAINING:
        # Iterative training: try multiple configurations
        print("Iterative training mode: trying multiple configurations\n")
        
        best_rf_roc_auc = 0
        best_rf_model = None
        best_rf_config = None
        
        # Define configurations to try
        rf_configs = [
            {"n_estimators": 100, "max_depth": 15, "min_samples_split": 10, "min_samples_leaf": 4},
            {"n_estimators": 150, "max_depth": 20, "min_samples_split": 8, "min_samples_leaf": 3},
            {"n_estimators": 200, "max_depth": 25, "min_samples_split": 5, "min_samples_leaf": 2},
            {"n_estimators": 100, "max_depth": 10, "min_samples_split": 15, "min_samples_leaf": 5},
            {"n_estimators": 250, "max_depth": 30, "min_samples_split": 4, "min_samples_leaf": 2},
        ]
        
        for i, config in enumerate(rf_configs[:N_ITERATIONS], 1):
            print(f"  Config {i}/{N_ITERATIONS}: {config}")
            
            temp_rf = RandomForestClassifier(
                class_weight='balanced',
                random_state=42,
                n_jobs=-1,
                **config
            )
            
            temp_rf.fit(X_train_processed, y_train_array)
            
            # Evaluate on validation set
            temp_pred_proba = temp_rf.predict_proba(X_test_processed)[:, 1]
            temp_roc_auc = roc_auc_score(y_test_array, temp_pred_proba)
            
            print(f"    ‚Üí ROC-AUC: {temp_roc_auc:.4f}")
            
            if temp_roc_auc > best_rf_roc_auc:
                best_rf_roc_auc = temp_roc_auc
                best_rf_model = temp_rf
                best_rf_config = config
                print(f"    ‚úì New best!")
            print()
        
        models['RandomForest'] = best_rf_model
        model_scores['RandomForest'] = best_rf_roc_auc
        print(f"Best RandomForest: ROC-AUC = {best_rf_roc_auc:.4f}")
        print(f"Config: {best_rf_config}\n")
    else:
        # Single training with default configuration
        rf_model = RandomForestClassifier(
            class_weight='balanced',
            random_state=42,
            n_estimators=100,
            max_depth=15,
            min_samples_split=10,
            min_samples_leaf=4,
            n_jobs=-1
        )
        
        print("Training RandomForest model...")
        rf_model.fit(X_train_processed, y_train_array)
        
        # Evaluate on validation set
        rf_pred_proba = rf_model.predict_proba(X_test_processed)[:, 1]
        rf_roc_auc = roc_auc_score(y_test_array, rf_pred_proba)
        
        models['RandomForest'] = rf_model
        model_scores['RandomForest'] = rf_roc_auc
        print(f"‚úì Training complete!")
        print(f"RandomForest ROC-AUC: {rf_roc_auc:.4f}\n")

# ============================================================================
# NEURAL NETWORK TRAINING
# ============================================================================
if TRAIN_NEURAL_NETWORK:
    print("üß† NEURAL NETWORK")
    print("-" * 70)
    
    try:
        from tensorflow import keras
        from tensorflow.keras import layers
        from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
        import tensorflow as tf
        
        # Set random seed for reproducibility
        tf.random.set_seed(42)
        
        # Calculate class weights for neural network
        class_weight_dict = {
            0: len(y_train_array) / (2 * n_neg),
            1: len(y_train_array) / (2 * n_pos)
        }
        
        print(f"Building neural network architecture...")
        print(f"Input features: {X_train_processed.shape[1]}")
        
        # Build neural network
        nn_model = keras.Sequential([
            layers.Input(shape=(X_train_processed.shape[1],)),
            layers.Dense(256, activation='relu'),
            layers.BatchNormalization(),
            layers.Dropout(0.3),
            layers.Dense(128, activation='relu'),
            layers.BatchNormalization(),
            layers.Dropout(0.3),
            layers.Dense(64, activation='relu'),
            layers.BatchNormalization(),
            layers.Dropout(0.2),
            layers.Dense(32, activation='relu'),
            layers.Dropout(0.2),
            layers.Dense(1, activation='sigmoid')
        ])
        
        # Compile model
        nn_model.compile(
            optimizer=keras.optimizers.Adam(learning_rate=0.001),
            loss='binary_crossentropy',
            metrics=[
                'accuracy',
                keras.metrics.AUC(name='auc'),
                keras.metrics.Precision(name='precision'),
                keras.metrics.Recall(name='recall')
            ]
        )
        
        print(f"\nModel architecture:")
        nn_model.summary()
        
        # Callbacks
        early_stopping = EarlyStopping(
            monitor='val_auc',
            patience=10,
            restore_best_weights=True,
            mode='max',
            verbose=1
        )
        
        reduce_lr = ReduceLROnPlateau(
            monitor='val_auc',
            factor=0.5,
            patience=5,
            min_lr=1e-6,
            mode='max',
            verbose=1
        )
        
        print(f"\nTraining neural network:")
        print(f"  Epochs: {NN_EPOCHS}")
        print(f"  Batch size: {NN_BATCH_SIZE}")
        print(f"  Validation split: {NN_VALIDATION_SPLIT}")
        print(f"  Class weights: {class_weight_dict}")
        print()
        
        # Train model
        history = nn_model.fit(
            X_train_processed, y_train_array,
            epochs=NN_EPOCHS,
            batch_size=NN_BATCH_SIZE,
            validation_split=NN_VALIDATION_SPLIT,
            class_weight=class_weight_dict,
            callbacks=[early_stopping, reduce_lr],
            verbose=1
        )
        
        # Evaluate on test set
        nn_pred_proba = nn_model.predict(X_test_processed, verbose=0).flatten()
        nn_roc_auc = roc_auc_score(y_test_array, nn_pred_proba)
        
        models['NeuralNetwork'] = nn_model
        model_scores['NeuralNetwork'] = nn_roc_auc
        
        print(f"\n‚úì Neural Network training complete!")
        print(f"Neural Network ROC-AUC: {nn_roc_auc:.4f}")
        print(f"Best epoch: {len(history.history['loss']) - early_stopping.patience if early_stopping.stopped_epoch > 0 else len(history.history['loss'])}\n")
        
        # Store training history for later visualization
        nn_training_history = history.history
        
    except ImportError:
        print("‚ö†Ô∏è  TensorFlow/Keras not installed. Skipping Neural Network training.")
        print("Install with: pip install tensorflow")
        TRAIN_NEURAL_NETWORK = False

# ============================================================================
# MODEL COMPARISON
# ============================================================================
print("="*70)
print("MODEL COMPARISON")
print("="*70)

if model_scores:
    for model_name, score in sorted(model_scores.items(), key=lambda x: x[1], reverse=True):
        print(f"{model_name:20s}: ROC-AUC = {score:.4f}")
    
    # Select best model
    best_model_name = max(model_scores, key=model_scores.get)
    best_model = models[best_model_name]
    best_score = model_scores[best_model_name]
    
    print(f"\nüèÜ BEST MODEL: {best_model_name} (ROC-AUC = {best_score:.4f})")
    print("="*70)
    
    # Set the main 'model' variable to the best model for later use
    model = best_model
    model_type = best_model_name
else:
    print("No models were trained!")
    raise RuntimeError("Please enable at least one model type")

Class counts: Negative=95600, Positive=379595
Class weight ratio (neg/pos) = 0.2518

TRAINING MODELS

üå≤ RANDOM FOREST
----------------------------------------------------------------------
Training RandomForest model...
‚úì Training complete!
RandomForest ROC-AUC: 0.9121

üß† NEURAL NETWORK
----------------------------------------------------------------------
‚ö†Ô∏è  TensorFlow/Keras not installed. Skipping Neural Network training.
Install with: pip install tensorflow
MODEL COMPARISON
RandomForest        : ROC-AUC = 0.9121

üèÜ BEST MODEL: RandomForest (ROC-AUC = 0.9121)
‚úì Training complete!
RandomForest ROC-AUC: 0.9121

üß† NEURAL NETWORK
----------------------------------------------------------------------
‚ö†Ô∏è  TensorFlow/Keras not installed. Skipping Neural Network training.
Install with: pip install tensorflow
MODEL COMPARISON
RandomForest        : ROC-AUC = 0.9121

üèÜ BEST MODEL: RandomForest (ROC-AUC = 0.9121)


In [49]:
if TRAIN_NEURAL_NETWORK and 'nn_training_history' in locals():
    import matplotlib.pyplot as plt
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot 1: Loss
    axes[0, 0].plot(nn_training_history['loss'], label='Training Loss', linewidth=2)
    axes[0, 0].plot(nn_training_history['val_loss'], label='Validation Loss', linewidth=2)
    axes[0, 0].set_xlabel('Epoch')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].set_title('Model Loss over Epochs')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Plot 2: AUC
    axes[0, 1].plot(nn_training_history['auc'], label='Training AUC', linewidth=2)
    axes[0, 1].plot(nn_training_history['val_auc'], label='Validation AUC', linewidth=2)
    axes[0, 1].set_xlabel('Epoch')
    axes[0, 1].set_ylabel('AUC')
    axes[0, 1].set_title('Model AUC over Epochs')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Plot 3: Accuracy
    axes[1, 0].plot(nn_training_history['accuracy'], label='Training Accuracy', linewidth=2)
    axes[1, 0].plot(nn_training_history['val_accuracy'], label='Validation Accuracy', linewidth=2)
    axes[1, 0].set_xlabel('Epoch')
    axes[1, 0].set_ylabel('Accuracy')
    axes[1, 0].set_title('Model Accuracy over Epochs')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Plot 4: Precision & Recall
    axes[1, 1].plot(nn_training_history['precision'], label='Training Precision', linewidth=2)
    axes[1, 1].plot(nn_training_history['val_precision'], label='Validation Precision', linewidth=2, linestyle='--')
    axes[1, 1].plot(nn_training_history['recall'], label='Training Recall', linewidth=2)
    axes[1, 1].plot(nn_training_history['val_recall'], label='Validation Recall', linewidth=2, linestyle='--')
    axes[1, 1].set_xlabel('Epoch')
    axes[1, 1].set_ylabel('Score')
    axes[1, 1].set_title('Precision & Recall over Epochs')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('submissions/nn_training_history.png', dpi=100, bbox_inches='tight')
    print("‚úì Training history plot saved to: submissions/nn_training_history.png")
    plt.show()
    
    # Print best epoch info
    best_epoch = np.argmax(nn_training_history['val_auc']) + 1
    best_val_auc = max(nn_training_history['val_auc'])
    print(f"\nBest validation AUC: {best_val_auc:.4f} at epoch {best_epoch}")
else:
    print("Neural Network not trained or history not available.")

Neural Network not trained or history not available.


## 13b. Visualize Neural Network Training (if applicable)

Plot training history to see how the neural network learned over epochs.

## 14. Evaluate Model on Test Split

Evaluate the trained model using multiple metrics: ROC-AUC, F1, precision, recall, and accuracy on the provided test split.

In [50]:
# Make predictions on test split using the best model
if model_type == 'NeuralNetwork':
    y_test_pred_proba = model.predict(X_test_processed, verbose=0).flatten()
    y_test_pred = (y_test_pred_proba > 0.5).astype(int)
else:  # RandomForest
    y_test_pred = model.predict(X_test_processed)
    y_test_pred_proba = model.predict_proba(X_test_processed)[:, 1]

# Calculate metrics
roc_auc = roc_auc_score(y_test_array, y_test_pred_proba)
f1 = f1_score(y_test_array, y_test_pred)
precision = precision_score(y_test_array, y_test_pred)
recall = recall_score(y_test_array, y_test_pred)
accuracy = accuracy_score(y_test_array, y_test_pred)

# Print metrics
print("="*70)
print(f"TEST SPLIT EVALUATION - {model_type}")
print("="*70)
print(f"ROC-AUC Score:  {roc_auc:.4f}")
print(f"F1 Score:       {f1:.4f}")
print(f"Precision:      {precision:.4f}")
print(f"Recall:         {recall:.4f}")
print(f"Accuracy:       {accuracy:.4f}")
print("="*70)

# Print classification report
print("\nCLASSIFICATION REPORT")
print("="*70)
print(classification_report(y_test_array, y_test_pred, target_names=['Not Paid Back', 'Paid Back']))

# If we trained both models, show comparison
if len(model_scores) > 1:
    print("\n" + "="*70)
    print("ALL MODELS COMPARISON ON TEST SPLIT")
    print("="*70)
    for name, val_score in sorted(model_scores.items(), key=lambda x: x[1], reverse=True):
        # Evaluate each model on test split
        if name == 'NeuralNetwork':
            temp_pred_proba = models[name].predict(X_test_processed, verbose=0).flatten()
        else:
            temp_pred_proba = models[name].predict_proba(X_test_processed)[:, 1]
        temp_roc = roc_auc_score(y_test_array, temp_pred_proba)
        print(f"{name:20s}: ROC-AUC = {temp_roc:.4f} {'üèÜ' if name == model_type else ''}")
    print("="*70)

TEST SPLIT EVALUATION - RandomForest
ROC-AUC Score:  0.9121
F1 Score:       0.9195
Precision:      0.9356
Recall:         0.9040
Accuracy:       0.8736

CLASSIFICATION REPORT
               precision    recall  f1-score   support

Not Paid Back       0.66      0.75      0.71     23900
    Paid Back       0.94      0.90      0.92     94899

     accuracy                           0.87    118799
    macro avg       0.80      0.83      0.81    118799
 weighted avg       0.88      0.87      0.88    118799



## 15. Generate Test Predictions

Apply the trained model to the preprocessed test set to generate predictions.

In [51]:
# Load the real test data (no labels - for final submission)
real_test_df = pd.read_csv('Data/test.csv')

print(f"Real test data shape: {real_test_df.shape}")
print(f"Real test columns: {list(real_test_df.columns)}")
print(f"\nExpected submission rows: {len(real_test_df)}")
print(f"First few rows:")
print(real_test_df.head())

Real test data shape: (254569, 12)
Real test columns: ['id', 'annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']

Expected submission rows: 254569
First few rows:
       id  annual_income  debt_to_income_ratio  credit_score  loan_amount  \
0  593994       28781.05                 0.049           626     11461.42   
1  593995       46626.39                 0.093           732     15492.25   
2  593996       54954.89                 0.367           611      3796.41   
3  593997       25644.63                 0.110           671      6574.30   
4  593998       25169.64                 0.081           688     17696.89   

   interest_rate  gender marital_status education_level employment_status  \
0          14.73  Female         Single     High School          Employed   
1          12.85  Female        Married        Master's          Employed   
2   

In [52]:
# Preprocess real test data using the same transformers
# (Already fitted on training data)

# One-hot encode categorical features
real_test_cat_encoded = ohe.transform(real_test_df[categorical_cols])

# Scale numerical features
real_test_num_scaled = scaler.transform(real_test_df[numerical_cols])

# Combine features
real_test_processed = np.concatenate([real_test_num_scaled, real_test_cat_encoded], axis=1)

print(f"Real test preprocessed shape: {real_test_processed.shape}")
print(f"Features: {real_test_processed.shape[1]} ({len(numerical_cols)} numerical + {real_test_cat_encoded.shape[1]} categorical)")

Real test preprocessed shape: (254569, 60)
Features: 60 (5 numerical + 55 categorical)


In [53]:
# Generate predictions on real test data using the best model
if model_type == 'NeuralNetwork':
    real_test_pred_proba = model.predict(real_test_processed, verbose=0).flatten()
    real_test_predictions = (real_test_pred_proba > 0.5).astype(int)
else:  # RandomForest
    real_test_predictions = model.predict(real_test_processed)

print(f"Generating predictions with: {model_type}")
print(f"Real test predictions shape: {real_test_predictions.shape}")
print(f"Unique predictions: {np.unique(real_test_predictions)}")
print(f"Prediction distribution:")
print(f"  Class 0 (Not Paid): {(real_test_predictions == 0).sum()}")
print(f"  Class 1 (Paid): {(real_test_predictions == 1).sum()}")

# Verify we have exactly 254,569 predictions
assert len(real_test_predictions) == 254569, f"Expected 254569 predictions, got {len(real_test_predictions)}"
print(f"\n‚úì Correct number of predictions: {len(real_test_predictions)}")

Generating predictions with: RandomForest
Real test predictions shape: (254569,)
Unique predictions: [0. 1.]
Prediction distribution:
  Class 0 (Not Paid): 57453
  Class 1 (Paid): 197116

‚úì Correct number of predictions: 254569


## 17. Generate Final Submission Predictions

Predict on the real test data (254,569 rows) to create the final submission file.

## 16. Preprocess Real Test Data

Apply the same preprocessing (one-hot encoding and scaling) to the real test data using the fitted transformers.

## 18. Save Validation Results and Final Submission

Create submissions folder, save validation results (test_split evaluation) and final submission (real test.csv predictions).

In [54]:
import os
import glob

# Create submissions directory
os.makedirs('submissions', exist_ok=True)

# Find the next submission number
existing_submissions = glob.glob('submissions/submission_*.csv')
if existing_submissions:
    # Extract numbers from filenames
    numbers = []
    for f in existing_submissions:
        try:
            num = int(f.split('_')[-1].replace('.csv', ''))
            numbers.append(num)
        except:
            pass
    next_num = max(numbers) + 1 if numbers else 1
else:
    next_num = 1

# Generate submission filenames
submission_filename = f'submission_{next_num:03d}.csv'
submission_path = os.path.join('submissions', submission_filename)
validation_filename = f'validation_{next_num:03d}.csv'
validation_path = os.path.join('submissions', validation_filename)
metrics_filename = f'metrics_{next_num:03d}.csv'
metrics_path = os.path.join('submissions', metrics_filename)

# Save FINAL SUBMISSION (real test.csv predictions - 254,569 rows)
final_submission = pd.DataFrame({
    "id": real_test_df["id"].astype(int),
    "loan_paid_back": real_test_predictions.astype(int)
})
final_submission.to_csv(submission_path, index=False)
print(f"‚úì FINAL SUBMISSION saved to: {submission_path}")
print(f"  Rows: {len(final_submission)} (should be 254,569)")

# Save VALIDATION RESULTS (test_split evaluation with labels)
validation_results = pd.DataFrame({
    "id": X_test["id"].astype(int) if "id" in X_test.columns else np.arange(len(y_test_array)),
    "y_true": y_test_array.astype(int),
    "y_pred": y_test_pred.astype(int),
    "y_proba": y_test_pred_proba.astype(float)
})
validation_results.to_csv(validation_path, index=False)
print(f"‚úì Validation results saved to: {validation_path}")

# Save aggregate metrics with configuration info
metrics_df = pd.DataFrame([
    {
        "submission_num": next_num,
        "model_type": model_type,
        "use_smote": USE_SMOTE,
        "iterative_training": ENABLE_ITERATIVE_TRAINING,
        "roc_auc": roc_auc,
        "f1": f1,
        "precision": precision,
        "recall": recall,
        "accuracy": accuracy,
        "train_samples": len(y_train_array),
        "validation_samples": len(y_test_array),
        "submission_samples": len(final_submission)
    }
])
metrics_df.to_csv(metrics_path, index=False)
print(f"‚úì Metrics saved to: {metrics_path}")

# Also save final submission to root for easy access
final_submission.to_csv('submission.csv', index=False)
print(f"‚úì Copy saved to: submission.csv")

print(f"\n{'='*70}")
print(f"SUBMISSION #{next_num}")
print(f"{'='*70}")
print(f"Model: {model_type}")
print(f"Training: {'SMOTE-balanced' if USE_SMOTE else 'Original'} ({len(y_train_array)} samples)")
print(f"Validation (test_split): {len(y_test_array)} samples")
print(f"  ‚Üí ROC-AUC: {roc_auc:.4f} | F1: {f1:.4f} | Accuracy: {accuracy:.4f}")
print(f"Final Submission (test.csv): {len(final_submission)} predictions")
print(f"{'='*70}")
print("\nFirst few submission rows:")
print(final_submission.head(10))
print("\nLast few submission rows:")
print(final_submission.tail(10))

‚úì FINAL SUBMISSION saved to: submissions/submission_001.csv
  Rows: 254569 (should be 254,569)
‚úì Validation results saved to: submissions/validation_001.csv
‚úì Metrics saved to: submissions/metrics_001.csv
‚úì Copy saved to: submission.csv

SUBMISSION #1
Model: RandomForest
Training: Original (475195 samples)
Validation (test_split): 118799 samples
  ‚Üí ROC-AUC: 0.9121 | F1: 0.9195 | Accuracy: 0.8736
Final Submission (test.csv): 254569 predictions

First few submission rows:
       id  loan_paid_back
0  593994               1
1  593995               1
2  593996               0
3  593997               1
4  593998               1
5  593999               1
6  594000               1
7  594001               1
8  594002               1
9  594003               0

Last few submission rows:
            id  loan_paid_back
254559  848553               1
254560  848554               1
254561  848555               1
254562  848556               0
254563  848557               1
254564  848558 

## 19. Save Model and Preprocessors

Save the trained model and preprocessing objects (scaler and encoder) for reuse.

In [55]:
import os

# Create models directory
os.makedirs('models', exist_ok=True)

# Save the best model (based on model type)
if model_type == 'NeuralNetwork':
    model_path = 'models/loan_model_nn.keras'
    model.save(model_path)
    print(f"‚úì Neural Network model saved to: {model_path}")
else:  # RandomForest
    model_path = 'models/loan_model_rf.pkl'
    joblib.dump(model, model_path)
    print(f"‚úì RandomForest model saved to: {model_path}")

# Save ALL trained models
if TRAIN_RANDOM_FOREST and 'RandomForest' in models:
    rf_path = 'models/loan_model_rf.pkl'
    joblib.dump(models['RandomForest'], rf_path)
    print(f"‚úì RandomForest saved to: {rf_path}")

if TRAIN_NEURAL_NETWORK and 'NeuralNetwork' in models:
    nn_path = 'models/loan_model_nn.keras'
    models['NeuralNetwork'].save(nn_path)
    print(f"‚úì Neural Network saved to: {nn_path}")

# Save the scaler
scaler_path = 'models/scaler.pkl'
joblib.dump(scaler, scaler_path)
print(f"‚úì Scaler saved to: {scaler_path}")

# Save the encoder
encoder_path = 'models/encoder.pkl'
joblib.dump(ohe, encoder_path)
print(f"‚úì Encoder saved to: {encoder_path}")

# Save model comparison results
if len(model_scores) > 1:
    comparison_df = pd.DataFrame([
        {"model": name, "roc_auc": score} 
        for name, score in sorted(model_scores.items(), key=lambda x: x[1], reverse=True)
    ])
    comparison_path = 'models/model_comparison.csv'
    comparison_df.to_csv(comparison_path, index=False)
    print(f"‚úì Model comparison saved to: {comparison_path}")

print("\n" + "="*70)
print("‚úÖ TRAINING COMPLETE")
print("="*70)
print(f"Best Model: {model_type}")
print(f"Configuration: USE_SMOTE = {USE_SMOTE}")
print(f"ROC-AUC: {roc_auc:.4f} | F1: {f1:.4f} | Accuracy: {accuracy:.4f}")
print(f"Submission saved in: submissions/")
if len(model_scores) > 1:
    print(f"\nModels trained: {', '.join(model_scores.keys())}")
    print(f"Winner: {model_type}")
print("="*70)

‚úì RandomForest model saved to: models/loan_model_rf.pkl
‚úì RandomForest saved to: models/loan_model_rf.pkl
‚úì Scaler saved to: models/scaler.pkl
‚úì Encoder saved to: models/encoder.pkl

‚úÖ TRAINING COMPLETE
Best Model: RandomForest
Configuration: USE_SMOTE = False
ROC-AUC: 0.9121 | F1: 0.9195 | Accuracy: 0.8736
Submission saved in: submissions/


In [56]:
import glob
import pandas as pd

# Find all metrics files
metrics_files = sorted(glob.glob('submissions/metrics_*.csv'))

if metrics_files:
    # Load and combine all metrics
    all_metrics = []
    for f in metrics_files:
        try:
            df = pd.read_csv(f)
            all_metrics.append(df)
        except:
            pass
    
    if all_metrics:
        history_df = pd.concat(all_metrics, ignore_index=True)
        history_df = history_df.sort_values('submission_num')
        
        print("="*80)
        print("SUBMISSION HISTORY")
        print("="*80)
        print(history_df.to_string(index=False))
        print("="*80)
        
        # Find best submission by ROC-AUC
        best_idx = history_df['roc_auc'].idxmax()
        best_sub = history_df.loc[best_idx]
        print(f"\nüèÜ BEST SUBMISSION: #{int(best_sub['submission_num'])} with ROC-AUC = {best_sub['roc_auc']:.4f}")
        print(f"   Configuration: USE_SMOTE = {bool(best_sub['use_smote'])}")
    else:
        print("No valid metrics files found.")
else:
    print("No submission history yet. This is your first submission!")

SUBMISSION HISTORY
 submission_num   model_type  use_smote  iterative_training  roc_auc       f1  precision   recall  accuracy  train_samples  validation_samples  submission_samples
              1 RandomForest      False               False 0.912135 0.919517   0.935619 0.903961  0.873593         475195              118799              254569

üèÜ BEST SUBMISSION: #1 with ROC-AUC = 0.9121
   Configuration: USE_SMOTE = False


## 20. View Submission History

Display all previous submissions and their metrics for comparison.