<a href="https://colab.research.google.com/github/mkm-world/Loan-Credit-Worthiness-Prediction/blob/main/Loan_Credit_Worthiness_Prediction_Starter_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Loan Creditworthiness: Baseline Approach


This starter notebook provides a baseline, simple approach to the Bluechip Data & AI 2024 Summit Hackathon on **Predicting Loan Creditworthiness**.

## Import Libraries

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## Loading the Data

In [3]:
input_dir = "/content/"

train = pd.read_csv(input_dir + "Train.csv")

train.head()

Unnamed: 0,ID,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Total_Income
0,74768,LP002231,1,1,0,1,0,8328,0.0,17,363,1,2,1,6000
1,79428,LP001448,1,1,0,0,0,150,3857.458782,188,370,1,1,0,6000
2,70497,LP002231,0,0,0,0,0,4989,314.472511,17,348,1,0,0,6000
3,87480,LP001385,1,1,0,0,0,150,0.0,232,359,1,1,1,3750
4,33964,LP002231,1,1,1,0,0,8059,0.0,17,372,1,0,1,3750


In [4]:
test = pd.read_csv(input_dir + "Test.csv")

test.head()

Unnamed: 0,ID,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Total_Income
0,70607,LP002560,1,1,0,1,0,15890,871.075952,188,371,1,1,6000
1,58412,LP001379,1,1,0,0,1,6582,896.718887,17,373,0,1,6000
2,88755,LP002560,0,0,0,0,0,7869,572.900354,17,373,1,1,6000
3,97271,LP002560,1,1,0,0,0,150,0.0,247,349,1,2,6000
4,70478,LP002231,1,1,0,0,0,8362,0.0,17,12,1,2,3750


In [6]:
sample_sub = pd.read_csv(input_dir + "Sample Submission.csv")

sample_sub.head()

Unnamed: 0,ID,Loan_Status
0,70607,
1,58412,
2,88755,
3,97271,
4,70478,


## Data Cleaning and Preprocessing

### Missing Values Check

In [8]:
train.isnull().sum()

Unnamed: 0,0
ID,0
Loan_ID,0
Gender,0
Married,0
Dependents,0
Education,0
Self_Employed,0
ApplicantIncome,0
CoapplicantIncome,0
LoanAmount,0


In [9]:
test.isnull().sum()

Unnamed: 0,0
ID,0
Loan_ID,0
Gender,0
Married,0
Dependents,0
Education,0
Self_Employed,0
ApplicantIncome,0
CoapplicantIncome,0
LoanAmount,0


Both the train and test sets contain no missing values.

### Checking Data Types

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5898 entries, 0 to 5897
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 5898 non-null   int64  
 1   Loan_ID            5898 non-null   object 
 2   Gender             5898 non-null   int64  
 3   Married            5898 non-null   int64  
 4   Dependents         5898 non-null   object 
 5   Education          5898 non-null   int64  
 6   Self_Employed      5898 non-null   int64  
 7   ApplicantIncome    5898 non-null   int64  
 8   CoapplicantIncome  5898 non-null   float64
 9   LoanAmount         5898 non-null   int64  
 10  Loan_Amount_Term   5898 non-null   int64  
 11  Credit_History     5898 non-null   int64  
 12  Property_Area      5898 non-null   int64  
 13  Loan_Status        5898 non-null   int64  
 14  Total_Income       5898 non-null   int64  
dtypes: float64(1), int64(12), object(2)
memory usage: 691.3+ KB


In [11]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2528 entries, 0 to 2527
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 2528 non-null   int64  
 1   Loan_ID            2528 non-null   object 
 2   Gender             2528 non-null   int64  
 3   Married            2528 non-null   int64  
 4   Dependents         2528 non-null   object 
 5   Education          2528 non-null   int64  
 6   Self_Employed      2528 non-null   int64  
 7   ApplicantIncome    2528 non-null   int64  
 8   CoapplicantIncome  2528 non-null   float64
 9   LoanAmount         2528 non-null   int64  
 10  Loan_Amount_Term   2528 non-null   int64  
 11  Credit_History     2528 non-null   int64  
 12  Property_Area      2528 non-null   int64  
 13  Total_Income       2528 non-null   int64  
dtypes: float64(1), int64(11), object(2)
memory usage: 276.6+ KB


Only two columns, **Loan_ID** and **Dependents** have non-numeric data type. The Loan_Is is not of impoprtance at the moment, so we will  be dropping it. And for the Dependents columnm, we will be performing one hot encoding with the use of the `pd.get_dummies()` method.

### One-Hot Encoding

In [22]:
train_processed = pd.get_dummies(train, columns=['Dependents'])
test_processed = pd.get_dummies(test, columns=['Dependents'])

### Feature Selection

In [23]:
features = train_processed.columns.difference(['ID', 'Loan_Status', 'Loan_ID'])

X = train_processed[features]
y = train_processed['Loan_Status']

## Data Splitting

In [55]:
# splitting into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

## Modeling

### Import Models

In [25]:
!pip install catboost -qq

In [26]:
# import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

### Logistic regression

In [57]:
# train Logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# make predictions
y_pred = log_reg.predict(X_test)

# evaluate model
accuracy_score(y_test, y_pred)

0.8332203389830508

### Random Forest

In [58]:
# train random forest model
rf = RandomForestClassifier(max_depth = 8, random_state = 42, n_estimators= 1000)
rf.fit(X_train, y_train)

# make predictions
y_pred = rf.predict(X_test)

# evaluate model
accuracy_score(y_test, y_pred)

0.8332203389830508

### XGBoost

In [59]:
# train xgboost model
xgb = XGBClassifier(max_depth = 4, random_state = 42, n_estimators= 400, learning_rate = 0.05)
xgb.fit(X_train, y_train)

# make predictions
y_pred = xgb.predict(X_test)

# evaluate model
accuracy_score(y_test, y_pred)

0.8298305084745763

### CatBoost

In [60]:
# train catboost model
cat = CatBoostClassifier(max_depth = 6, random_state = 42, n_estimators= 400, learning_rate = 0.05, verbose = False )
cat.fit(X_train, y_train)

# make predictions
y_pred = cat.predict(X_test)

# evaluate model
accuracy_score(y_test, y_pred)

0.8305084745762712

## Make predictions - Generate submission file


### Re-train model on the whole data

In [62]:
# re-train best performing model on whole data - random forest
rf = RandomForestClassifier(max_depth = 8, random_state = 42, n_estimators= 1000)
rf.fit(X, y)

### Make predictions on the test dataset

In [64]:
# make predictions
test_predictions = rf.predict(test_processed[features])

# write predicions to sample sub file
sample_sub['Loan_Status'] = test_predictions

# display sample submission
display(sample_sub.head())

# save sample submission
sample_sub.to_csv('baseline-submission.csv', index=False)

Unnamed: 0,ID,Loan_Status
0,70607,1
1,58412,1
2,88755,1
3,97271,1
4,70478,1


## Next Steps
- Exploratory Data Analysis (EDA) to help understand the data and uncover inisghts that might aid feature engineering
- Hyper-paramater Tuning - explore manual tuning, GridSearch, Optuna
- Cross Validation with KFold.
- Ensembling Methods; thow heads are better than one.