# ML Challenge 

<img src="https://imageio.forbes.com/specials-images/imageserve/5ecd179f798e4c00060d2c7c/0x0.jpg?format=jpg&height=600&width=1200&fit=bounds" width="500" height="300">

In the bustling city of Financia, the Central Lending Institution (CLI) is the largest provider of loans to individuals and businesses. With a mission to support economic growth and financial stability, CLI processes thousands of loan applications every month. However, the traditional manual review process is time-consuming and prone to human error, leading to delays and inconsistencies in loan approvals.
To address these challenges, CLI has decided to leverage the power of machine learning to streamline their loan approval process. They have compiled a comprehensive dataset containing historical loan application records, including various factors such as credit scores, income levels, employment status, loan terms(measured in years), loan amounts, asset values, and the final loan status (approved or denied).


**Your task is to develop a predictive model that can accurately determine the likelihood of loan approval based on the provided features. By doing so, you will help CLI make faster, more accurate, and fairer lending decisions, ultimately contributing to the financial well-being of the community.**

It is recommended that you follow the typical machine learning workflow, though you are not required to strictly follow each steps: 
1. Data Collection: Gather the data you need for your model. (Already done for you)

2. Data Preprocessing: Clean and prepare the data for analysis. (Already done for you)

3. Exploratory Data Analysis (EDA): Understand the data and its patterns. (Partially done for you)

4. Feature Engineering: Create new features or modify existing ones to improve model performance. (Partially done for you)

5. Model Selection: Choose the appropriate machine learning algorithm.

6. Model Training: Train the model using the training dataset.

7. Model Evaluation: Evaluate the model's performance using a validation dataset.

8. Model Optimization: Optimize the model's parameters to improve performance.

9. Model Testing: Test the final model on a separate test dataset.

**Please include ALL your work and thought process in this notebook**

In [1]:
# You may include any package you deem fit. We sugggest looking into Scikit-learn
import pandas as pd

## Dataset


In [2]:
# DO NOT MODIFY
loan_data = pd.read_csv("../../data/loan_approval.csv")


## EDA
Uncomment to see desired output. Add more analysis if you like

In [3]:

import matplotlib.pyplot as plt

# ------ Display basic information ------
print(loan_data.columns)
print(loan_data.describe())

# ------ Visualize the distribution of loan status ------
# loan_status_counts = loan_data['loan_status'].value_counts()
# plt.bar(loan_status_counts.index, loan_status_counts.values)
# plt.title('Distribution of Loan Status')
# plt.xlabel('Loan Status')
# plt.ylabel('Count')

# ------ Visualize the distribution of numerical features ------ 
# loan_data.hist(bins=30, figsize=(20, 15))

# ------ Correlation matrix ------
# corr_matrix = loan_data.corr()
# fig, ax = plt.subplots(figsize=(10, 8))
# cax = ax.matshow(corr_matrix, cmap='coolwarm')
# fig.colorbar(cax)
# plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns, rotation=90)
# plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)

# ----- MORE (Encouraged but not required) ------
# TODO 

Index(['loan_id', 'no_of_dependents', 'education', 'self_employed',
       'income_annum', 'loan_amount', 'loan_term', 'cibil_score',
       'residential_assets_value', 'commercial_assets_value',
       'luxury_assets_value', 'bank_asset_value', 'loan_status'],
      dtype='object')
           loan_id  no_of_dependents  income_annum   loan_amount    loan_term  \
count  4269.000000       4269.000000  4.269000e+03  4.269000e+03  4269.000000   
mean   2135.000000          2.498712  5.059124e+06  1.513345e+07    10.900445   
std    1232.498479          1.695910  2.806840e+06  9.043363e+06     5.709187   
min       1.000000          0.000000  2.000000e+05  3.000000e+05     2.000000   
25%    1068.000000          1.000000  2.700000e+06  7.700000e+06     6.000000   
50%    2135.000000          3.000000  5.100000e+06  1.450000e+07    10.000000   
75%    3202.000000          4.000000  7.500000e+06  2.150000e+07    16.000000   
max    4269.000000          5.000000  9.900000e+06  3.950000e+07    

In [5]:
# ------ Check for missing values ------
print(loan_data.isnull().sum())

loan_id                     0
no_of_dependents            0
education                   0
self_employed               0
income_annum                0
loan_amount                 0
loan_term                   0
cibil_score                 0
residential_assets_value    0
commercial_assets_value     0
luxury_assets_value         0
bank_asset_value            0
loan_status                 0
dtype: int64


In [6]:
loan_data.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


## Feature Engineering

You may want to convert categorical variables to numerical. For example, education takes on the value Graduate and Not Graduate. But we want it to be 0 or 1 for machine learning algorithms to use.

In [7]:
loan_data['education'] = loan_data['education'].map({'Graduate': 1, 'Not Graduate': 0})
# Hint: Other categorical variables are self_employed and loan_status
# TODO
loan_data['self_employed'] = loan_data['self_employed'].map({'Yes': 1, 'No': 0})
loan_data['loan_status'] = loan_data['loan_status'].map({'Approved': 1, 'Rejected': 0})
loan_data.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,1,0,9600000,29900000,12,778,2400000,17600000,22700000,8000000,1
1,2,0,0,1,4100000,12200000,8,417,2700000,2200000,8800000,3300000,0
2,3,3,1,0,9100000,29700000,20,506,7100000,4500000,33300000,12800000,0
3,4,3,1,0,8200000,30700000,8,467,18200000,3300000,23300000,7900000,0
4,5,5,0,1,9800000,24200000,20,382,12400000,8200000,29400000,5000000,0


In [13]:
for label in ['no_of_dependents', 'income_annum', 'loan_amount', 'loan_term', 'cibil_score', 'residential_assets_value', 'commercial_assets_value', 'luxury_assets_value', 'bank_asset_value']:
    stdscaler = preprocessing.StandardScaler()
    stdscaler.fit(loan_data[[label]])
    loan_data[label]  = stdscaler.transform(loan_data[[label]])
loan_data.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,-0.294102,1,0,1.617979,1.633052,0.192617,1.032792,-0.780058,2.877289,0.832028,0.930304,1
1,2,-1.473548,0,1,-0.34175,-0.324414,-0.508091,-1.061051,-0.733924,-0.631921,-0.694993,-0.515936,0
2,3,0.295621,1,0,1.439822,1.610933,1.594031,-0.54484,-0.0573,-0.107818,1.99652,2.407316,0
3,4,0.295621,1,0,1.119139,1.721525,-0.508091,-0.771045,1.649637,-0.381263,0.897943,0.899533,0
4,5,1.475067,0,1,1.689242,1.002681,1.594031,-1.264055,0.757724,0.735304,1.568075,0.007172,0


In [24]:
loan_data = loan_data.drop("loan_id",axis = 1)
loan_data.head()

Unnamed: 0,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,-0.294102,1,0,1.617979,1.633052,0.192617,1.032792,-0.780058,2.877289,0.832028,0.930304,1
1,-1.473548,0,1,-0.34175,-0.324414,-0.508091,-1.061051,-0.733924,-0.631921,-0.694993,-0.515936,0
2,0.295621,1,0,1.439822,1.610933,1.594031,-0.54484,-0.0573,-0.107818,1.99652,2.407316,0
3,0.295621,1,0,1.119139,1.721525,-0.508091,-0.771045,1.649637,-0.381263,0.897943,0.899533,0
4,1.475067,0,1,1.689242,1.002681,1.594031,-1.264055,0.757724,0.735304,1.568075,0.007172,0


## Model Selection

You are free to use any classification machine learning models you like: Logistic Regression, Decision Trees/Random Forests, Support Vector Machines, KNN ... 

In [25]:
# TODO
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X = loan_data.drop(columns='loan_status')
y = loan_data['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1)

## Model Training and Evaluation

### Support Vector Machine 

In [53]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
score = accuracy_score(y_pred,y_test)
score

0.9473067915690867

### Logistic Regression

In [54]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(X_train, y_train)

y_pred = LR.predict(X_test)

from sklearn.metrics import accuracy_score
score = accuracy_score (y_pred, y_test)
print(score)

0.9215456674473068


In [55]:
LR.intercept_

array([1.6537967])

In [56]:
LR.coef_

array([[-0.00567979,  0.14117614,  0.02619463, -1.49542206,  1.27558917,
        -0.81680569,  4.19111657, -0.04806466,  0.07802968,  0.11998631,
         0.19309334]])

### Decision Tree Classifier

In [58]:
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier(max_depth=3,random_state=123)
DTC.fit(X_train,y_train)

y_pred = DTC.predict(X_test)

from sklearn.metrics import accuracy_score
score = accuracy_score (y_pred, y_test)
print(score)

0.9637002341920374


In [59]:
features = X.columns
importances = DTC.feature_importances_
importances_df = pd.DataFrame()
importances_df['Feature Name']  = features
importances_df['Feature Importance'] = importances
importances_df.sort_values('Feature Importance',ascending=False)

Unnamed: 0,Feature Name,Feature Importance
6,cibil_score,0.90812
5,loan_term,0.080826
4,loan_amount,0.010089
7,residential_assets_value,0.000579
8,commercial_assets_value,0.000386
0,no_of_dependents,0.0
1,education,0.0
2,self_employed,0.0
3,income_annum,0.0
9,luxury_assets_value,0.0


### Gaussian NB 

In [60]:
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()
GNB.fit(X_train,y_train)

y_pred = GNB.predict(X_test)

from sklearn.metrics import accuracy_score
score = accuracy_score (y_pred, y_test)
print(score)

0.9355971896955504


### AdaBoostClassifier

In [64]:
from sklearn.ensemble import AdaBoostClassifier
ABC = AdaBoostClassifier(random_state=123)
ABC.fit(X_train, y_train)
y_pred = ABC.predict(X_test)

from sklearn.metrics import accuracy_score
score = accuracy_score(y_pred, y_test)
print(score)

0.968384074941452




In [65]:
features = X.columns
importances = ABC.feature_importances_

importances_df = pd.DataFrame()
importances_df['Feature Name']  = features
importances_df['Feature Importance'] = importances
importances_df.sort_values('Feature Importance',ascending=False)

Unnamed: 0,Feature Name,Feature Importance
6,cibil_score,0.26
4,loan_amount,0.16
10,bank_asset_value,0.16
3,income_annum,0.14
5,loan_term,0.12
7,residential_assets_value,0.04
8,commercial_assets_value,0.04
9,luxury_assets_value,0.04
0,no_of_dependents,0.02
1,education,0.02


## Model Optimization and Testing

In [66]:
from sklearn.model_selection import GridSearchCV
import numpy as np

parameters = {'max_depth':np.arange(1,10),'criterion':['gini','entropy'],'min_samples_split':np.arange(5,20)}

DTC = DecisionTreeClassifier()

grid_search = GridSearchCV(DTC, parameters, scoring='roc_auc', cv=5)
grid_search.fit(X_train, y_train)

cri,mdpt,mss = grid_search.best_params_['criterion'],grid_search.best_params_['max_depth'],grid_search.best_params_['min_samples_split']
print(cri,mdpt,mss)

gini 4 9


In [67]:
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier(max_depth=mdpt,criterion=cri,min_samples_split=mss)
DTC.fit(X_train,y_train)

y_pred = DTC.predict(X_test)

from sklearn.metrics import accuracy_score
score = accuracy_score (y_pred, y_test)
print(score)

0.9660421545667447
