# Loan predictions

## Problem Statement

We want to automate the loan eligibility process based on customer details that are provided as online application forms are being filled. You can find the dataset [here](https://drive.google.com/file/d/1h_jl9xqqqHflI5PsuiQd_soNYxzFfjKw/view?usp=sharing). These details concern the customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and other things as well. 

|Variable| Description|
|------------- |-------------|
|Loan_ID| Unique Loan ID|
|Gender| Male/ Female|
|Married| Applicant married (Y/N)|
|Dependents| Number of dependents|
|Education| Applicant Education (Graduate/ Under Graduate)|
|Self_Employed| Self employed (Y/N)|
|ApplicantIncome| Applicant income|
|CoapplicantIncome| Coapplicant income|
|LoanAmount| Loan amount in thousands|
|Loan_Amount_Term| Term of loan in months|
|Credit_History| credit history meets guidelines|
|Property_Area| Urban/ Semi Urban/ Rural|
|Loan_Status| Loan approved (Y/N)



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

## 1. Hypothesis Generation

Generating a hypothesis is a major step in the process of analyzing data. This involves understanding the problem and formulating a meaningful hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analyses which we can potentially perform if data is available.

#### Possible hypotheses
Which applicants are more likely to get a loan

1. Applicants having a credit history 
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives

Do more brainstorming and create some hypotheses of your own. Remember that the data might not be sufficient to test all of these, but forming these enables a better understanding of the problem.

To avoid biases I will be excluding the applicant gender and martial statuses. The number of dependents and whether there is suffucient total (applicant + coapplicant) income should be sufficient to determine their ability to meet their family financial needs.

## 2. Data Exploration
Let's do some basic data exploration here and come up with some inferences about the data. Go ahead and try to figure out some irregularities and address them in the next section. 

In [505]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import plotly.express as px

df = pd.read_csv("../data/data.csv") 
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [506]:
df.shape

(614, 13)

In [507]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


One of the key challenges in any data set are missing values. Lets start by checking which columns contain missing values.

In [508]:
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [509]:
NUMERICAL_FEATURES = [
    "ApplicantIncome",
    "CoapplicantIncome",
    "LoanAmount",
    "Loan_Amount_Term"
]
CATEGORICAL_FEATURES = [
    "Dependents",
    "Education",
    "Self_Employed",
    "Credit_History",
    "Property_Area",
]
TARGET = ["Loan_Status"]
FEATURES = NUMERICAL_FEATURES + CATEGORICAL_FEATURES

In [510]:
for col in NUMERICAL_FEATURES:
    print(df[col].value_counts(dropna=False))

2500    9
4583    6
6000    6
2600    6
3333    5
       ..
3244    1
4408    1
3917    1
3992    1
7583    1
Name: ApplicantIncome, Length: 505, dtype: int64
0.00        273
2,500.00      5
2,083.00      5
1,666.00      5
2,250.00      3
           ... 
2,791.00      1
1,010.00      1
1,695.00      1
2,598.00      1
240.00        1
Name: CoapplicantIncome, Length: 287, dtype: int64
NaN       22
120.00    20
110.00    17
100.00    15
160.00    12
          ..
240.00     1
214.00     1
59.00      1
166.00     1
253.00     1
Name: LoanAmount, Length: 204, dtype: int64
360.00    512
180.00     44
480.00     15
NaN        14
300.00     13
240.00      4
84.00       4
120.00      3
60.00       2
36.00       2
12.00       1
Name: Loan_Amount_Term, dtype: int64


Look at some basic statistics for numerical variables.

In [511]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.46,1621.25,146.41,342.0,0.84
std,6109.04,2926.25,85.59,65.12,0.36
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [512]:
df.select_dtypes("number").mode()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,2500,0.0,120.0,360.0,1.0


In [513]:
df.skew(axis=0, skipna=True, numeric_only=True)

ApplicantIncome      6.54
CoapplicantIncome    7.49
LoanAmount           2.68
Loan_Amount_Term    -2.36
Credit_History      -1.88
dtype: float64

In [514]:
for col in NUMERICAL_FEATURES:
    px.histogram(df, x=col, width=800, height=500).show()

1. How many applicants have a `Credit_History`? (`Credit_History` has value 1 for those who have a credit history and 0 otherwise)
2. Is the `ApplicantIncome` distribution in line with your expectation? Similarly, what about `CoapplicantIncome`?
3. Tip: Can you see a possible skewness in the data by comparing the mean to the median, i.e. the 50% figure of a feature.



Let's discuss nominal (categorical) variable. Look at the number of unique values in each of them.

In [515]:
df.loc[:, CATEGORICAL_FEATURES].nunique(dropna=False)

Dependents        5
Education         2
Self_Employed     3
Credit_History    3
Property_Area     3
dtype: int64

Explore further using the frequency of different categories in each nominal variable. Exclude the ID obvious reasons.

In [516]:
for col in CATEGORICAL_FEATURES:
    px.histogram(df, x=col, width=800, height=500).show()

### Distribution analysis

Study distribution of various variables. Plot the histogram of ApplicantIncome, try different number of bins.



In [517]:
px.histogram(df, "ApplicantIncome", nbins=250).show()


Look at box plots to understand the distributions. 

In [518]:
for col in NUMERICAL_FEATURES:
    px.box(df, x="Dependents", y=col, points="suspectedoutliers").show()

Look at the distribution of income segregated  by `Education`

In [519]:
px.box(df, x="Education", y="ApplicantIncome", points="suspectedoutliers").show()

Look at the histogram and boxplot of LoanAmount

In [520]:
px.box(df, x="Education", y="LoanAmount", points="suspectedoutliers").show()

There might be some extreme values. Both `ApplicantIncome` and `LoanAmount` require some amount of data munging. `LoanAmount` has missing and well as extreme values values, while `ApplicantIncome` has a few extreme values, which demand deeper understanding. 

### Categorical variable analysis

Try to understand categorical variables in more details using `pandas.DataFrame.pivot_table` and some visualizations.

In [521]:
def count_na(x: pd.Series) -> int:
    return sum(x.isna())

def count_zero(x: pd.Series) -> int:
    return sum(x == 0)


In [522]:
with pd.option_context("display.float_format", "{:,.2f}".format, "display.max_columns", 30):
    for col in CATEGORICAL_FEATURES:
        display(
            df.pivot_table(
                index=col,
                dropna=False,
                aggfunc={x: ["count", "median", "mean", np.std, count_na, count_zero, pd.Series.skew] for x in NUMERICAL_FEATURES},
            )
        )

Unnamed: 0_level_0,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term
Unnamed: 0_level_1,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std
Dependents,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
0,345,0,0,4917.42,3598.0,6.26,5029.42,345,0,157,1623.7,1330.0,6.61,2630.54,334,11,0,135.89,120.0,2.88,75.85,334,11,0,348.11,360.0,-2.81,57.95
1,102,0,0,5962.27,4051.5,3.47,5587.4,102,0,43,1426.24,1219.5,2.27,1830.04,98,4,0,158.62,139.0,2.24,95.1,101,1,0,329.35,360.0,-2.07,75.45
2,101,0,0,4926.78,4006.0,2.14,3153.83,101,0,37,1687.25,1387.0,4.21,2556.27,98,3,0,150.22,133.0,1.83,71.28,101,0,0,340.87,360.0,-2.42,64.91
3+,51,0,0,8581.22,4691.0,4.06,13603.94,51,0,28,2024.31,0.0,5.93,6050.79,49,2,0,190.9,130.0,2.04,134.89,50,1,0,325.2,360.0,-1.6,79.57


Unnamed: 0_level_0,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term
Unnamed: 0_level_1,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std
Education,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
Graduate,480,0,0,5857.43,4000.0,6.02,6739.8,480,0,221,1717.47,1059.0,7.03,3230.97,465,15,0,154.06,132.0,2.46,92.88,472,8,0,344.67,360.0,-2.62,61.3
Not Graduate,134,0,0,3777.28,3357.5,3.82,2237.08,134,0,52,1276.54,1356.5,1.11,1310.34,127,7,0,118.41,115.0,0.69,39.77,128,6,0,332.16,360.0,-1.71,77.08


Unnamed: 0_level_0,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term
Unnamed: 0_level_1,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std
Self_Employed,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
No,500,0,0,5049.75,3705.5,7.42,5682.9,500,0,213,1580.99,1293.5,6.13,2438.16,482,18,0,141.75,125.0,2.72,79.78,489,11,0,343.26,360.0,-2.36,64.7
Yes,82,0,0,7380.82,5809.0,2.47,5883.56,82,0,47,1501.34,0.0,4.07,2780.71,79,3,0,172.0,150.0,2.4,108.63,80,2,0,336.3,360.0,-2.47,69.4


Unnamed: 0_level_0,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term
Unnamed: 0_level_1,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std
Credit_History,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
0.0,89,0,0,5679.44,3547.0,6.77,9301.9,89,0,41,1542.18,1330.0,2.19,2023.91,85,4,0,146.72,125.0,3.08,85.16,83,6,0,341.93,360.0,-1.33,66.74
1.0,475,0,0,5426.53,3859.0,5.35,5535.39,475,0,213,1528.25,1040.0,6.31,2548.73,458,17,0,144.79,128.0,2.56,83.05,467,8,0,342.19,360.0,-2.46,64.27


Unnamed: 0_level_0,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term
Unnamed: 0_level_1,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std
Property_Area,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
Rural,179,0,0,5554.08,3975.0,8.18,6782.66,179,0,72,1645.54,1587.0,1.11,1785.1,173,6,0,152.26,133.0,2.35,80.23,175,4,0,345.53,360.0,-2.91,54.42
Semiurban,233,0,0,5292.26,3859.0,4.42,5279.63,233,0,106,1520.13,983.0,3.57,2249.36,228,5,0,145.5,127.5,2.7,81.67,230,3,0,347.11,360.0,-2.68,60.5
Urban,202,0,0,5398.25,3505.0,5.97,6392.93,202,0,95,1716.35,1007.9,7.15,4175.1,191,11,0,142.2,120.0,2.88,94.55,195,7,0,332.8,360.0,-1.8,77.39


In [523]:
with pd.option_context("display.float_format", "{:,.2f}".format, "display.max_columns", 30):
    display(df.pivot_table(index=["Education", "Self_Employed", "Dependents"], dropna=False, aggfunc={x: ["count", "median", "mean", np.std, count_na, count_zero, pd.Series.skew] for x in NUMERICAL_FEATURES}))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,ApplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,CoapplicantIncome,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,LoanAmount,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term,Loan_Amount_Term
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std,count,count_na,count_zero,mean,median,skew,std
Education,Self_Employed,Dependents,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2
Graduate,No,0,231,0,0,4566.9,3625.0,2.72,3343.93,231,0,96,1863.7,1666.0,6.26,3029.06,224,7,0,137.16,122.0,2.44,73.67,226,5,0,352.35,360.0,-3.31,51.06
Graduate,No,1,59,0,0,6449.08,4384.0,3.34,6566.49,59,0,26,1385.81,1032.0,2.71,1954.16,58,1,0,158.57,140.0,1.77,96.19,58,1,0,334.55,360.0,-2.5,78.29
Graduate,No,2,59,0,0,5133.32,4283.0,1.94,2997.8,59,0,26,1342.23,985.8,1.75,1775.66,58,1,0,150.09,137.0,0.24,54.21,59,0,0,343.12,360.0,-2.73,63.27
Graduate,No,3+,30,0,0,11490.5,5208.5,3.06,17149.1,30,0,17,1599.27,0.0,2.25,2655.51,30,0,0,219.8,157.5,1.7,155.58,30,0,0,332.0,360.0,-2.06,62.5
Graduate,Yes,0,29,0,0,8282.69,6400.0,2.6,7668.94,29,0,21,1003.52,0.0,1.44,1768.84,29,0,0,158.03,141.0,3.49,107.96,29,0,0,341.38,360.0,-1.99,68.23
Graduate,Yes,1,17,0,0,7682.41,7787.0,0.81,5153.67,17,0,10,1131.0,0.0,1.23,1605.62,15,2,0,193.6,155.0,2.08,139.15,17,0,0,335.29,360.0,-2.38,60.22
Graduate,Yes,2,14,0,0,6861.93,5873.0,1.43,4802.81,14,0,5,3695.14,1691.5,2.36,5408.34,13,1,0,223.31,176.0,0.98,124.98,14,0,0,327.43,360.0,-2.77,78.11
Graduate,Yes,3+,4,0,0,7609.75,7517.0,0.06,2495.24,4,0,2,799.5,712.0,0.12,934.18,4,0,0,209.75,223.5,-0.7,85.35,4,0,0,360.0,360.0,0.0,0.0
Not Graduate,No,0,58,0,0,3373.33,3013.5,1.17,1369.58,58,0,26,1231.66,1164.0,0.64,1321.04,55,3,0,110.53,110.0,0.43,39.1,54,4,0,338.44,360.0,-2.17,75.62
Not Graduate,No,1,17,0,0,3571.94,3399.0,0.89,1314.42,17,0,3,1972.71,1500.0,1.72,1822.58,16,1,0,136.38,133.0,1.89,47.27,17,0,0,300.0,360.0,-0.75,84.85


In [524]:
px.histogram(df, x="Loan_Status").show()

## 3. Data Cleaning

This step typically involves imputing missing values and treating outliers. 

### Imputing Missing Values

Missing values may not always be NaNs. For instance, the `Loan_Amount_Term` might be 0, which does not make sense.



Impute missing values for all columns. Use the values which you find most meaningful (mean, mode, median, zero.... maybe different mean values for different groups)

In [525]:
# Due to the skewness of the data the median will be used.
# It was found that at a Education, Self_Employed, Dependents groups are likely to have the highest variance, so that median will be obtained in reference to those combined subgroups.
mapLoanAmount = (
    df.groupby(by=["Education", "Self_Employed", "Dependents"])["LoanAmount"]
    .median()
    .to_dict()
)

In [526]:
def fillna_group_median(
    df: pd.DataFrame, group: list[str], fill_column: str
) -> pd.Series:
    """
    Fills na values with the the group's median for the fill_column.
    """
    print(df.shape)
    return_df = df.copy()
    group_values = df.groupby(by=group)[fill_column].median().to_dict()
    isna_mask = df[fill_column].isna()

    for key, value in group_values.items():
        groupna_mask = (df[group] == key).all(axis=1) & isna_mask
        return_df[groupna_mask] = value

    return return_df

In [527]:
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [528]:
filled = fillna_group_median(df, group=["Education", "Self_Employed", "Dependents"], fill_column="LoanAmount")
filled.loc[filled["LoanAmount"].isna()]

(614, 13)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
95,LP001326,Male,No,0.0,Graduate,,6782.0,0.0,,360.0,,Urban,N
102,LP001350,Male,Yes,,Graduate,No,13650.0,0.0,,360.0,1.0,Urban,Y
435,LP002393,Female,,,Graduate,No,10047.0,0.0,,240.0,1.0,Semiurban,Y


### Extreme values
Try a log transformation to get rid of the extreme values in `LoanAmount`. Plot the histogram before and after the transformation

In [529]:
logLoanAmount = df["LoanAmount"].apply(np.log)
px.histogram(logLoanAmount).show()

Combine both incomes as total income and take a log transformation of the same.

In [530]:
totalIncome = (df["ApplicantIncome"] + df["CoapplicantIncome"]).apply(np.log)
px.histogram(totalIncome).show()

## 4. Building a Predictive Model

### Preprocessing
#### Categorical features:
1. Fillna's
    1. Dependents = mode
    2. Self_Employed = mode
    3. Credit_History = mode
2. OneHotEncoder

#### Numerical features:
1. Fillna's with median
    1. LoanAmount = group median
    2. LoanTerm = median
2. Create totalIncome column (ApplicantIncome + CoapplicantIncome)
3. Logtransform totalIncome
4. StandardScaler all columns

### Main Pipeline
1. FeatureUnion categorical features and numerical features.
2. Classifiers:
    1. RandomForestClassifier
    2. LogisticRegression
3. GridsearchCV for cross validation and hyperparameter tuning

In [531]:
import numpy.typing as npt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, FunctionTransformer, Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV

### Test / Train split

In [532]:
y = df["Loan_Status"].replace({"Y": 1, "N": 0})
X = df.drop(columns=["Loan_Status"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### Custom transformers

In [533]:
# class CreateTotalIncome():
#     def fit(self, X, y=None):
#         return self
    
#     def transform(self, X: npt.NDArray):
#         total_income = (X[:, 0] + X[:, 1]).reshape(-1, 1)
#         X = np.concatenate((X, total_income), axis=1)
#         return total_income.reshape(-1, 1)

from custom_transformers import CreateTotalIncome

### Pipelines

In [534]:
fillna_pipe = Pipeline(
    [
        (
            "fillna",
            ColumnTransformer(
                [
                    (
                        "categorical",
                        SimpleImputer(strategy="most_frequent"),
                        CATEGORICAL_FEATURES,
                    ),
                    ("numerical", SimpleImputer(strategy="mean"), NUMERICAL_FEATURES),
                ],
                remainder="drop",
            ),
        )
    ]
)

categorical_pipe = Pipeline(
    [
        (
            "fillna",
            ColumnTransformer(
                [
                    (
                        "categorical",
                        SimpleImputer(strategy="most_frequent"),
                        CATEGORICAL_FEATURES,
                    ),
                ],
                remainder="drop",
            ),
        ),
        ("one-hot-encoding", OneHotEncoder(drop="if_binary", sparse=False)),
    ]
)

totalIncome_pipe = Pipeline(
    [
        (
            "fillna",
            ColumnTransformer(
                [
                    (
                        "income",
                        SimpleImputer(strategy="mean"),
                        ["ApplicantIncome", "CoapplicantIncome"],
                    ),
                ],
                remainder="drop",
            ),
        ),
        ("create_feature", CreateTotalIncome()),
        ("log_transform", FunctionTransformer(np.log)),
        ("scale", StandardScaler()),
    ]
)

other_numerical_columns = Pipeline(
    [
        (
            "fillna",
            ColumnTransformer(
                [
                    (
                        "other_numerical",
                        SimpleImputer(strategy="median"),
                        ["LoanAmount", "Loan_Amount_Term"],
                    ),
                ],
                remainder="drop",
            ),
        ),
        ("scale", StandardScaler()),
    ]
)

rfc = RandomForestClassifier(max_depth=5, n_jobs=-1)
lr = LogisticRegression(n_jobs=-1)

classifier_pipe = Pipeline(
    [
        ("join_features", FeatureUnion([("categorical", categorical_pipe), ("other_numerical", other_numerical_columns), ("total_income", totalIncome_pipe)])),
        ("classifier", rfc),
    ]
)

In [535]:
classifier_pipe.fit(X_train, y_train)

In [536]:
classifier_pipe.score(X_test, y_test)

0.7783783783783784

Try paramater grid search to improve the results

In [537]:
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

params = {
    "classifier": [
        KNeighborsClassifier(3),
        SVC(kernel="linear", C=0.025),
        SVC(gamma=2, C=1),
        GaussianProcessClassifier(1.0 * RBF(1.0)),
        RandomForestClassifier(max_depth=5),
        AdaBoostClassifier(),
        GaussianNB(),
    ],
}

grid = GridSearchCV(classifier_pipe, param_grid=params, n_jobs=-1, scoring="roc_auc", cv=10)
grid.fit(X_train, y_train)

In [538]:
print(grid.best_score_)
print(grid.best_params_)

0.7606134546457126
{'classifier': GaussianNB()}


## 5. Using Pipeline
If you didn't use pipelines before, transform your data prep, feat. engineering and modeling steps into Pipeline. It will be helpful for deployment.

The goal here is to create the pipeline that will take one row of our dataset and predict the probability of being granted a loan.

`pipeline.predict(x)`

In [539]:
sample = X_test.sample()
print(grid.predict_proba(sample))
print(grid.predict(sample))

[[1.00000000e+00 1.63207034e-20]]
[0]


## 6. Deploy your model to cloud and test it with PostMan, BASH or Python

In [540]:
import joblib

In [541]:
with open("../src/model.joblib", "wb") as f:
    joblib.dump(grid, f)    