# Classification Workflow: Loan Creditworthiness Prediction

## 1. Introduction

This notebook demonstrates a comprehensive classification workflow for predicting loan creditworthiness based on various features such as income, credit history, and property area. 
The dataset contains loan application data with 13 features, including categorical (e.g., Gender, Education) and numerical (e.g., ApplicantIncome, LoanAmount) variables.

- **Goal:** Build a classification model to predict whether a loan application will be approved (Loan_Status) using the given features.

## 2. Exploratory Data Analysis (EDA)
This section includes visualizations and insights to **understand the dataset.**

### Importing necessary libraries

In [None]:
!pip show category_encoders

In [None]:
import sys
print(sys.executable)

Basically, there's a mismatch between the Python environment where `category_encoders` was installed and this Jupyter kernel, so I had to install `category_encoders` in the correct Python environment being used by Jupyter. I did this using `"correct_filepath" -m pip install category_encoders` in my command prompt. (Where correct_filepath = the output of `sys.executable`).

- **In essence you can ignore the previous 2 cells, you hopefully won't need it** 

In [None]:
# Uncooment the following to install (First time only)
# !pip install category_encoders 
# for target encoding

In [None]:
import pandas as pd  #For data manipulation analysis
import numpy as np  #For numerical operations
import matplotlib.pyplot as plt #For visualization
%matplotlib inline
import seaborn as sns

import category_encoders as ce                         #For target encoding
from sklearn.model_selection import train_test_split   #For splitting the data into train and test sets 
from sklearn.preprocessing import StandardScaler     #For scaling numerical features

# For Evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

### Loading the data

In [None]:
train_data = pd.read_csv('Train.csv')
test_data = pd.read_csv('Test.csv')

In [None]:
train_data.head()

In [None]:
test_data.head()

### Summaries

In [None]:
# Get a brief summary of the data
train_data.info()

We can see from the non-Null Count that there are no missing values

In [None]:
train_data['Loan_Status'].value_counts()

In [None]:
# Statistical summary of numeric columns
train_data.describe()

- Several columns like `Gender` and `Married` are in the wrong datatype
- From the mean and IQR distribution we see that they are also largely imbalanced. Either mostly 1 or mostly 0

In [None]:
# Statistical summary of objects
train_data.describe(include = 'object')

- We'll have to use target encoding for `Loan_ID` when encoding since there're too many unique values

#### Datatype adjustments

In [None]:
# Convert relevant columns to categorical data types so we can do relevant EDA on them
train_data['Gender'] = train_data['Gender'].astype('object')
train_data['Married'] = train_data['Married'].astype('object')
train_data['Education'] = train_data['Education'].astype('object')
train_data['Self_Employed'] = train_data['Self_Employed'].astype('object')
train_data['Credit_History'] = train_data['Credit_History'].astype('object')
train_data['Property_Area'] = train_data['Property_Area'].astype('object')

# Confirm the data types
print(train_data.dtypes)

#### Univariate Analysis

In [None]:
# Set up plotting style
sns.set(style="whitegrid")

# Categorical Variables Univariate Analysis
categorical_columns = train_data.select_dtypes(include=['object']).columns

# Loop through categorical columns for count plots
for col in categorical_columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(data=train_data, x=col)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Numerical Variables Univariate Analysis
numerical_columns = train_data.select_dtypes(include=['int64', 'float64']).columns

# Loop through numerical columns for histograms
for col in numerical_columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(train_data[col], bins=30, kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

#### Bivariate Analysis
Here, we’ll examine relationships between `Loan_Status` and each of the other features. This will give us insights into how each feature might influence the outcome.

In [None]:
# Bivariate Analysis with Categorical Variables vs Target Variable
for col in categorical_columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(data=train_data, x=col, hue='Loan_Status')
    plt.title(f'{col} vs Loan_Status')
    plt.show()

In [None]:
# Bivariate Analysis with Numerical Variables vs Target Variable
for col in numerical_columns:
    plt.figure(figsize=(8, 4))
    sns.boxplot(data=train_data, x='Loan_Status', y=col)
    plt.title(f'{col} vs Loan_Status')
    plt.show()

In [None]:
# Correlation heatmap for numerical features
plt.figure(figsize=(10, 8))
sns.heatmap(train_data[numerical_columns].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

## 3. Preprocessing and Feature Engineering
This section covers data cleaning, handling missing values, and preparing the dataset for analysis as well as creating new features and selecting the most relevant ones.

#### Deal with outliers

In [None]:
# List of columns to check for outliers
columns_to_check = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Total_Income']

# Plot boxplots for each column
plt.figure(figsize=(12, 8))
for i, col in enumerate(columns_to_check, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(data=train_data, y=col)
    plt.title(f"Boxplot of {col}")
plt.tight_layout()
plt.show()

#### Doesn't help, so we ignore

### Target Encoding on Loan_ID
- Target encoding can be helpful for categorical variables with a high cardinality, like Loan_ID in this case (since Loan_ID has too many unique values for one hot encoding).
- In target encoding, each category in Loan_ID will be replaced with the mean target value (Loan_Status).

In [None]:
# Define the Target Encoder for Loan_ID
target_encoder = ce.TargetEncoder(cols=['Loan_ID'])

In [None]:
# Fit the encoder on the training set and transform both train and test sets
train_data['Loan_ID_encoded'] = target_encoder.fit_transform(train_data['Loan_ID'], train_data['Loan_Status'])
test_data['Loan_ID_encoded'] = target_encoder.transform(test_data['Loan_ID'])

### Combine Train and Test Data for Preprocessing
To ensure consistency in preprocessing, we'll combine the train and test sets, apply preprocessing, and split them back afterwards.

In [None]:
# Add a source column so that we can easily split back later
train_data['source'] = 'train'
test_data['source'] = 'test'

# Combine train and test sets
X_combined = pd.concat([train_data.drop(columns=['Loan_Status']), test_data])

# Separate the target variable from features
y = train_data['Loan_Status']

In [None]:
X_combined.head()

### Create `TotalIncome` and `IncomeRatio` Columns
- Summing `ApplicantIncome` and `CoapplicantIncome` gives a combined financial profile of the borrower, which can be a useful indicator of total earning potential.
- The ratio provides insight into the relative contributions of the primary applicant versus the co-applicant. For instance, a very high ratio (where `ApplicantIncome` dominates) could signal a single-income household, while a balanced ratio might indicate both parties contribute meaningfully.

In [None]:
# Create TotalIncome as the sum of ApplicantIncome and CoapplicantIncome
X_combined['TotalIncome'] = X_combined['ApplicantIncome'] + X_combined['CoapplicantIncome']

In [None]:
# Create IncomeRatio, handling cases where CoapplicantIncome might be zero
X_combined['IncomeRatio'] = X_combined['ApplicantIncome'] / (X_combined['CoapplicantIncome'] + 1e-5)
# The 1e-5 is added to avoid division by zero in some entries in `CoapplicantIncome`

### Create a `FamilySize` feature by combining `Dependents` and `Married`
The idea is that if `Married` is 1, it implies there is a spouse, so the `FamilySize` is `Dependents` + 2. If `Married` is 0, there’s no spouse, so `FamilySize` would be `Dependents` + 1. 

In [None]:
X_combined['Dependents'].value_counts()

In [None]:
# Replace '3+' with 3 and convert Dependents to integer
X_combined['Dependents'] = X_combined['Dependents'].replace('3+', 3).astype(int)

In [None]:
# Create FamilySize feature based on Dependents and Married columns
X_combined['FamilySize'] = X_combined['Dependents'] + 1  # Adding 1 for the applicant
X_combined['FamilySize'] += X_combined['Married']  # Add 1 if married (indicating a spouse)

### Convert `Loan_Amount_Term` from months to Years

In [None]:
X_combined['Loan_Amount_Term_Years'] = X_combined['Loan_Amount_Term'] / 12

### Create `Debt-to-Income` Ratio

In [None]:
X_combined['Debt_to_Income_Ratio'] = (X_combined['LoanAmount'] / X_combined['TotalIncome']).round(2)

### Create `Income_to_Loan` Ratio

In [None]:
# Interaction: Applicant Income divided by Loan Amount
X_combined['Income_to_Loan_Ratio'] = (X_combined['ApplicantIncome'] / X_combined['LoanAmount']).round(2)

### Create `Income_Per_Person` 

In [None]:
X_combined['FamilySize'] = X_combined['FamilySize'].astype('int')

In [None]:
# Interaction: Total Income multiplied by Household Size
X_combined['Income_Per_Person'] = (X_combined['TotalIncome'] / X_combined['FamilySize']).round(2)

### Convert continuous variables into categorical bins

In [None]:
X_combined[['TotalIncome']].describe()

In [None]:
# Binning TotalIncome into categories

# Define bins and labels for TotalIncome
bins_income = [0, 3755.83, 7813.54, 11012.97, X_combined['TotalIncome'].max()]
labels_income = ['Very Low', 'Low', 'Moderate', 'High']

# Apply binning
X_combined['TotalIncome_Bin'] = pd.cut(X_combined['TotalIncome'], bins=bins_income, labels=labels_income)

In [None]:
X_combined[['LoanAmount']].describe()

In [None]:
# Define bins and labels for LoanAmount
bins_loan = [0, 40, 173, X_combined['LoanAmount'].max()]
labels_loan = ['Very Small', 'Small', 'Large']

# Apply binning
X_combined['LoanAmount_Bin'] = pd.cut(X_combined['LoanAmount'], bins=bins_loan, labels=labels_loan)

In [None]:
X_combined.columns

In [None]:
# Convert relevant columns from object to int

X_combined['Married'] = X_combined['Married'].astype('int')
X_combined['Gender'] = X_combined['Gender'].astype('int')
X_combined['FamilySize'] = X_combined['FamilySize'].astype('int')
X_combined['Education'] = X_combined['Education'].astype('int')
X_combined['Self_Employed'] = X_combined['Self_Employed'].astype('int')
X_combined['Credit_History'] = X_combined['Credit_History'].astype('int')
X_combined['Dependents'] = X_combined['Dependents'].astype('int')
X_combined['Property_Area'] = X_combined['Property_Area'].astype('int')

#### One hot encoding for selected columns

In [None]:
X_combined = pd.get_dummies(X_combined, columns=['TotalIncome_Bin', 'LoanAmount_Bin'])

In [None]:
X_combined.info()

### Drop columns that we don't need

In [None]:
X_combined.drop(columns=['ID', 'Loan_ID'], inplace = True) 

In [None]:
X_combined.drop(columns=['Married', 'Gender'], inplace = True)

### Split Data into Train and Test Sets

In [None]:
pd.set_option('display.max_columns', None)
X_combined.head()

In [None]:
# Separate combined data back into train and test sets
X = X_combined[X_combined['source'] == 'train'].drop(columns=['source'])
test_data_processed = X_combined[X_combined['source'] == 'test'].drop(columns=['source'])

In [None]:
# Split X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Oversampling
Oversampling involves increasing the number of instances in the minority class to balance the dataset and improve the model's performance. It tries to solve the problem of imbalanced datasets

In [None]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

#Reduced model performance, so we ignore

### Scaling

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Fit on the training data and transform both train and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
test_data_processed = scaler.transform(test_data_processed)

## 4. Modelling
This section compares different classification models.

#### 1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
logistic_model = LogisticRegression(random_state=40, max_iter=1000)

# Train the model
logistic_model.fit(X_train_scaled, y_train)

# Predict on the test set and evaluate
y_pred = logistic_model.predict(X_test_scaled)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# Feature importance
features = X_train.columns

# Get coefficients
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': logistic_model.coef_[0]  # For binary classification
})

# Sort by absolute importance
feature_importance_df['Absolute Importance'] = abs(feature_importance_df['Importance'])
feature_importance_df = feature_importance_df.sort_values(by='Absolute Importance', ascending=False)

feature_importance_df

#### 2. CatBoost Classifier

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostClassifier

In [None]:
# Initialize the CatBoost model
catboost_model = CatBoostClassifier(iterations=100, random_seed=42, verbose=0)

# Train the model
catboost_model.fit(X_train, y_train)

# Predict on the test set and evaluate
y_pred_catboost = catboost_model.predict(X_test)
print("CatBoost Accuracy Score:", accuracy_score(y_test, y_pred_catboost))

# Classification report
print("CatBoost Classification Report:\n", classification_report(y_test, y_pred_catboost))

# Confusion matrix
print("CatBoost Confusion Matrix:\n", confusion_matrix(y_test, y_pred_catboost))

In [None]:
# Get feature importances
features = X_train.columns

feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': catboost_model.get_feature_importance()
}).sort_values(by='Importance', ascending=False)

feature_importance_df

#### 3. AdaBoost Classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

In [None]:
# Initialize the AdaBoost model
adaboost_model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the model
adaboost_model.fit(X_train, y_train)

# Predict on the test set and evaluate
y_pred_adaboost = adaboost_model.predict(X_test)
print("AdaBoost Accuracy Score:", accuracy_score(y_test, y_pred_adaboost))

# Classification report
print("AdaBoost Classification Report:\n", classification_report(y_test, y_pred_adaboost))

# Confusion matrix
print("AdaBoost Confusion Matrix:\n", confusion_matrix(y_test, y_pred_adaboost))

#### 4. LightGBM Classifier

In [None]:
from lightgbm import LGBMClassifier

In [None]:
# Initialize the LightGBM model
lightgb_model = LGBMClassifier(n_estimators=100, random_state=42)

# Train the model
lightgb_model.fit(X_train_scaled, y_train)

# Predict on the test set and evaluate
y_pred = lightgb_model.predict(X_test_scaled)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# Feature Importance
importances = lightgb_model.feature_importances_
features = X_train.columns
indices = np.argsort(importances)[::-1]

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.title("Feature Importance")
plt.barh(range(X_train.shape[1]), importances[indices], align="center")  # Horizontal bar plot
plt.yticks(range(X_train.shape[1]), [features[i] for i in indices])     # Labels on the y-axis
plt.xlabel("Importance")  # Label for the x-axis
plt.ylabel("Features")    # Label for the y-axis
plt.gca().invert_yaxis()  # Invert y-axis to have the most important feature at the top
plt.show()

In [None]:
# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({
    'Feature': [features[i] for i in indices],
    'Importance': importances[indices]
})


# Sort the DataFrame by importance for better visualization
feature_importance_df = feature_importance_df.sort_values(by="Importance", ascending=False)

# Display sorted DataFrame
feature_importance_df

#### 5. XGBoost Classifier

In [None]:
from xgboost import XGBClassifier

# Initialize the XGBoost model
xgboost_model = XGBClassifier(random_state=40)

# Train the model
xgboost_model.fit(X_train, y_train)

# Predict on the test set and evaluate
y_pred = xgboost_model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

#### 6. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
randomforest_model = RandomForestClassifier(random_state=40)

# Train the model
randomforest_model.fit(X_train, y_train)

# Predict on the test set and evaluate
y_pred = randomforest_model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

#### 7.  Support Vector Classifier (SVC)

In [None]:
from sklearn.svm import SVC

# Initialize the SVC model
svc_model = SVC(random_state=40, max_iter=1000)

# Train the model
svc_model.fit(X_train_scaled, y_train)

# Predict on the test set and evaluate
y_pred = svc_model.predict(X_test_scaled)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

#### 8. Feed Forward Neural Networks (FNNs)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.callbacks import EarlyStopping

# Build the model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.3),  # Dropout for regularization
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', # Metric to monitor
                               patience=5,         # Number of epochs to wait for improvement
                               restore_best_weights=True)  # Restore best model weights after training


# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy',Precision(), Recall()])

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=100, batch_size=32,
                    validation_split=0.2, verbose=1,
                     callbacks=[early_stopping])  # Apply early stopping

In [None]:
# Evaluate the model on the test set
test_loss, test_accuracy, test_precision, test_recall = model.evaluate(X_test_scaled, y_test)

# Print the results
print(f'Test Accuracy: {test_accuracy}')
print(f'Test Precision: {test_precision}')
print(f'Test Recall: {test_recall}')

# Predict on the test set
y_pred = (model.predict(X_test_scaled) > 0.5).astype("int32")

# Display the Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Display the Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

## 5. Hyperparameter Tuning with RandomizedSearchCV
This section demonstrates how to optimize model parameters.

In [None]:
# We'll use the best performing model, CatBoost
from sklearn.model_selection import RandomizedSearchCV

# Initialize the CatBoost model
catboost_model = CatBoostClassifier(random_seed=42, verbose=0)

# Define parameter grid
param_grid = {
    'depth': [2, 5, 8, 10],
    'learning_rate': [0.001, 0.01, 0.1, 0.2],
    'iterations': [100, 300, 500, 1000, 1500],
    'l2_leaf_reg': [0.0001, 0.001, 0.01, 0.1],
    'bagging_temperature': [0, 0.5, 1],
    'border_count': [32, 64, 128]
}

# Use RandomizedSearchCV for hyperparameter tuning
catboost_search = RandomizedSearchCV(
    estimator=catboost_model,
    param_distributions=param_grid,
    n_iter=100,
    scoring='accuracy',  
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit the model
catboost_search.fit(X_train, y_train)

# Get the best parameters
best_params = catboost_search.best_params_
print("Best Parameters:", best_params)

In [None]:
# Train the CatBoost model with the best parameters
optimized_catboost_model = CatBoostClassifier(**best_params, random_seed=42, verbose=0)
optimized_catboost_model.fit(X_train, y_train)

# Predict on the test set
y_pred_catboost = optimized_catboost_model.predict(X_test)

# Evaluate the model
print("Optimized CatBoost Accuracy Score:", accuracy_score(y_test, y_pred_catboost))
print("Optimized CatBoost Classification Report:\n", classification_report(y_test, y_pred_catboost))
print("Optimized CatBoost Confusion Matrix:\n", confusion_matrix(y_test, y_pred_catboost))

### Submission

In [None]:
X_scaled = scaler.transform(X)

In [None]:
# Initialize the CatBoost model
catboost_model = CatBoostClassifier(iterations=100, random_seed=42, verbose=0)

# Train the model on the entire dataset
catboost_model.fit(X, y)

In [None]:
# Predict using the trained model
predictions = catboost_model.predict(test_data_processed)

In [None]:
# Create a DataFrame with required columns
submission = pd.DataFrame({
    'ID': test_data['ID'],
    'Loan_Status': predictions  
})

In [None]:
submission

In [None]:
# Save to CSV
submission.to_csv('submission.csv', index=False)

---
_**Your Dataness**_,  
**`Obinna Oliseneku`** (_**Hybraid**_)  
**[LinkedIn](https://www.linkedin.com/in/obinnao/)** | **[GitHub](https://github.com/hybraid6)**  