# Load and Split Loan Data

This notebook loads the training and test datasets, then performs a stratified train-validation split on the training data.

## 1. Import Required Libraries

Import pandas for data manipulation and train_test_split from sklearn.model_selection for splitting the data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Load Training and Test Data

Load the train.csv and test.csv files from the Data directory using pandas read_csv() function.

In [2]:
# Load the training and test datasets
train_df = pd.read_csv('Data/train.csv')
test_df = pd.read_csv('Data/test.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

Training data shape: (593994, 13)
Test data shape: (254569, 12)


## 3. Prepare Features and Target Variable

Separate the features (X) from the target variable (y) by dropping the 'loan_paid_back' column from the training dataframe.

In [3]:
# Separate features and target variable
X = train_df.drop("loan_paid_back", axis=1)
y = train_df["loan_paid_back"]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Features shape: (593994, 12)
Target shape: (593994,)


## 4. Perform Stratified Train-Validation Split

Use train_test_split with test_size=0.2, stratify=y, and random_state=42 to split the data into 80% training and 20% validation sets while maintaining class distribution.

In [4]:
# Perform stratified train-validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y, 
    random_state=42
)

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"\nClass distribution in y_train:\n{y_train.value_counts(normalize=True).sort_index()}")
print(f"\nClass distribution in y_val:\n{y_val.value_counts(normalize=True).sort_index()}")

X_train shape: (475195, 12)
X_val shape: (118799, 12)
y_train shape: (475195,)
y_val shape: (118799,)

Class distribution in y_train:
loan_paid_back
0.0    0.201181
1.0    0.798819
Name: proportion, dtype: float64

Class distribution in y_val:
loan_paid_back
0.0    0.20118
1.0    0.79882
Name: proportion, dtype: float64


## 5. Prepare Test Features

Extract the features from the test dataset (test.csv doesn't have the target column 'loan_paid_back').

In [5]:
# Test data doesn't have the target column, so use it as-is for features
X_test = test_df.copy()

print(f"X_test shape: {X_test.shape}")
print(f"Test data columns: {list(X_test.columns)}")

X_test shape: (254569, 12)
Test data columns: ['id', 'annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']


## 6. Identify Categorical and Numerical Columns

Separate columns into categorical and numerical types for appropriate preprocessing.

In [6]:
# Identify categorical columns (object dtype)
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# Identify numerical columns (excluding 'id' if present, as it's not a useful feature)
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
if 'id' in numerical_cols:
    numerical_cols.remove('id')

print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")

Categorical columns (6): ['gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']
Numerical columns (5): ['annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate']


## 7. Import Preprocessing Tools

Import OneHotEncoder for categorical features and StandardScaler for numerical features from sklearn.

In [7]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import numpy as np

## 8. One-Hot Encode Categorical Features

Fit the OneHotEncoder on the training data and transform train, validation, and test sets.

In [8]:
# Initialize and fit OneHotEncoder on training data
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe.fit(X_train[categorical_cols])

# Transform categorical columns for all datasets
X_train_cat_encoded = ohe.transform(X_train[categorical_cols])
X_val_cat_encoded = ohe.transform(X_val[categorical_cols])
X_test_cat_encoded = ohe.transform(X_test[categorical_cols])

print(f"Encoded categorical features shape (train): {X_train_cat_encoded.shape}")
print(f"Encoded categorical features shape (val): {X_val_cat_encoded.shape}")
print(f"Encoded categorical features shape (test): {X_test_cat_encoded.shape}")
print(f"Total one-hot encoded features: {X_train_cat_encoded.shape[1]}")

Encoded categorical features shape (train): (475195, 55)
Encoded categorical features shape (val): (118799, 55)
Encoded categorical features shape (test): (254569, 55)
Total one-hot encoded features: 55


## 9. Scale Numerical Features

Fit the StandardScaler on the training data and transform train, validation, and test sets.

In [9]:
# Initialize and fit StandardScaler on training data
scaler = StandardScaler()
scaler.fit(X_train[numerical_cols])

# Transform numerical columns for all datasets
X_train_num_scaled = scaler.transform(X_train[numerical_cols])
X_val_num_scaled = scaler.transform(X_val[numerical_cols])
X_test_num_scaled = scaler.transform(X_test[numerical_cols])

print(f"Scaled numerical features shape (train): {X_train_num_scaled.shape}")
print(f"Scaled numerical features shape (val): {X_val_num_scaled.shape}")
print(f"Scaled numerical features shape (test): {X_test_num_scaled.shape}")

Scaled numerical features shape (train): (475195, 5)
Scaled numerical features shape (val): (118799, 5)
Scaled numerical features shape (test): (254569, 5)


## 10. Combine Encoded and Scaled Features

Concatenate the one-hot encoded categorical features with scaled numerical features into final NumPy arrays.

In [10]:
# Combine categorical and numerical features
X_train_processed = np.concatenate([X_train_num_scaled, X_train_cat_encoded], axis=1)
X_val_processed = np.concatenate([X_val_num_scaled, X_val_cat_encoded], axis=1)
X_test_processed = np.concatenate([X_test_num_scaled, X_test_cat_encoded], axis=1)

# Convert target variables to NumPy arrays
y_train_array = y_train.values
y_val_array = y_val.values

print("Final preprocessed data shapes:")
print(f"  X_train_processed: {X_train_processed.shape}")
print(f"  X_val_processed: {X_val_processed.shape}")
print(f"  X_test_processed: {X_test_processed.shape}")
print(f"  y_train_array: {y_train_array.shape}")
print(f"  y_val_array: {y_val_array.shape}")
print(f"\nTotal features: {X_train_processed.shape[1]} ({len(numerical_cols)} numerical + {X_train_cat_encoded.shape[1]} categorical)")

Final preprocessed data shapes:
  X_train_processed: (475195, 60)
  X_val_processed: (118799, 60)
  X_test_processed: (254569, 60)
  y_train_array: (475195,)
  y_val_array: (118799,)

Total features: 60 (5 numerical + 55 categorical)


## 11. Save Preprocessed Data

Save the preprocessed NumPy arrays to disk for model training and evaluation.

In [11]:
import os

# Create directory for preprocessed data
os.makedirs('Data/preprocessed', exist_ok=True)

# Save preprocessed arrays
np.save('Data/preprocessed/X_train.npy', X_train_processed)
np.save('Data/preprocessed/X_val.npy', X_val_processed)
np.save('Data/preprocessed/X_test.npy', X_test_processed)
np.save('Data/preprocessed/y_train.npy', y_train_array)
np.save('Data/preprocessed/y_val.npy', y_val_array)

print("Preprocessed data saved to Data/preprocessed/:")
print("  - X_train.npy")
print("  - X_val.npy")
print("  - X_test.npy")
print("  - y_train.npy")
print("  - y_val.npy")

Preprocessed data saved to Data/preprocessed/:
  - X_train.npy
  - X_val.npy
  - X_test.npy
  - y_train.npy
  - y_val.npy


## 12. Train RandomForest Model

Train a RandomForest classifier with balanced class weights to handle class imbalance.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, f1_score, precision_score, recall_score, accuracy_score
import joblib

# Calculate class weights for handling imbalance
n_neg = (y_train_array == 0).sum()
n_pos = (y_train_array == 1).sum()
class_weight_ratio = n_neg / n_pos

print(f"Class counts: Negative={n_neg}, Positive={n_pos}")
print(f"Class weight ratio (neg/pos) = {class_weight_ratio:.4f}")

# Initialize RandomForest classifier with balanced class weights
model = RandomForestClassifier(
    class_weight='balanced',  # Automatically adjusts weights inversely proportional to class frequencies
    random_state=42,
    n_estimators=100,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=4,
    n_jobs=-1  # Use all CPU cores
)

# Train the model
print("\nTraining RandomForest model...")
model.fit(X_train_processed, y_train_array)
print("✓ Model training complete!")

Class counts: Negative=95600, Positive=379595
Class weight ratio (neg/pos) = 0.2518

Training RandomForest model...
✓ Model training complete!


## 13. Evaluate Model on Validation Set

Evaluate the trained model using multiple metrics: ROC-AUC, F1, precision, recall, and accuracy.

In [13]:
# Make predictions on validation set
y_val_pred = model.predict(X_val_processed)
y_val_pred_proba = model.predict_proba(X_val_processed)[:, 1]

# Calculate metrics
roc_auc = roc_auc_score(y_val_array, y_val_pred_proba)
f1 = f1_score(y_val_array, y_val_pred)
precision = precision_score(y_val_array, y_val_pred)
recall = recall_score(y_val_array, y_val_pred)
accuracy = accuracy_score(y_val_array, y_val_pred)

# Print metrics
print("="*60)
print("VALIDATION SET METRICS")
print("="*60)
print(f"ROC-AUC Score:  {roc_auc:.4f}")
print(f"F1 Score:       {f1:.4f}")
print(f"Precision:      {precision:.4f}")
print(f"Recall:         {recall:.4f}")
print(f"Accuracy:       {accuracy:.4f}")
print("="*60)

# Print classification report
print("\nCLASSIFICATION REPORT")
print("="*60)
print(classification_report(y_val_array, y_val_pred, target_names=['Not Paid Back', 'Paid Back']))

VALIDATION SET METRICS
ROC-AUC Score:  0.9121
F1 Score:       0.9195
Precision:      0.9356
Recall:         0.9040
Accuracy:       0.8736

CLASSIFICATION REPORT
               precision    recall  f1-score   support

Not Paid Back       0.66      0.75      0.71     23900
    Paid Back       0.94      0.90      0.92     94899

     accuracy                           0.87    118799
    macro avg       0.80      0.83      0.81    118799
 weighted avg       0.88      0.87      0.88    118799



## 14. Generate Test Predictions

Apply the trained model to the preprocessed test set to generate predictions.

In [14]:
# Generate predictions on test set
test_predictions = model.predict(X_test_processed)

print(f"Test predictions shape: {test_predictions.shape}")
print(f"Unique predictions: {np.unique(test_predictions)}")
print(f"Prediction distribution:")
print(f"  Class 0 (Not Paid): {(test_predictions == 0).sum()}")
print(f"  Class 1 (Paid): {(test_predictions == 1).sum()}")

Test predictions shape: (254569,)
Unique predictions: [0. 1.]
Prediction distribution:
  Class 0 (Not Paid): 57453
  Class 1 (Paid): 197116


## 15. Create Submission File

Create a submission DataFrame matching the format of sample_submission.csv with id and loan_paid_back columns.

In [15]:
# Create submission DataFrame
submission = pd.DataFrame({
    "id": test_df["id"],
    "loan_paid_back": test_predictions
})

# Verify format matches sample_submission.csv
sample_sub = pd.read_csv('Data/sample_submission.csv')
print("Sample submission columns:", list(sample_sub.columns))
print("Our submission columns:", list(submission.columns))
print(f"\nSample submission shape: {sample_sub.shape}")
print(f"Our submission shape: {submission.shape}")

# Ensure data types match
submission['id'] = submission['id'].astype(int)
submission['loan_paid_back'] = submission['loan_paid_back'].astype(int)

print("\nFirst few rows of submission:")
print(submission.head(10))
print("\nLast few rows of submission:")
print(submission.tail(10))

Sample submission columns: ['id', 'loan_paid_back']
Our submission columns: ['id', 'loan_paid_back']

Sample submission shape: (254569, 2)
Our submission shape: (254569, 2)

First few rows of submission:
       id  loan_paid_back
0  593994               1
1  593995               1
2  593996               0
3  593997               1
4  593998               1
5  593999               1
6  594000               1
7  594001               1
8  594002               1
9  594003               0

Last few rows of submission:
            id  loan_paid_back
254559  848553               1
254560  848554               1
254561  848555               1
254562  848556               0
254563  848557               1
254564  848558               1
254565  848559               1
254566  848560               1
254567  848561               1
254568  848562               1


## 16. Save Model and Submission

Save the trained model, preprocessing objects (scaler and encoder), and the submission CSV file.

In [17]:
import os

# Create models directory
os.makedirs('models', exist_ok=True)

# Save the trained model
model_path = 'models/loan_model.pkl'
joblib.dump(model, model_path)
print(f"✓ Model saved to: {model_path}")

# Save the scaler
scaler_path = 'models/scaler.pkl'
joblib.dump(scaler, scaler_path)
print(f"✓ Scaler saved to: {scaler_path}")

# Save the encoder
encoder_path = 'models/encoder.pkl'
joblib.dump(ohe, encoder_path)
print(f"✓ Encoder saved to: {encoder_path}")

# Save submission file
submission_path = 'submission.csv'
submission.to_csv(submission_path, index=False)
print(f"✓ Submission file saved to: {submission_path}")

print("\n" + "="*60)
print("✅ Model trained and submission file generated successfully")
print("="*60)

✓ Model saved to: models/loan_model.pkl
✓ Scaler saved to: models/scaler.pkl
✓ Encoder saved to: models/encoder.pkl
✓ Submission file saved to: submission.csv

✅ Model trained and submission file generated successfully
