# Loan Defaulter Detection

### A. Import dependencies

In [1]:
import pandas as pd
from matplotlib import pyplot

### B. Import / Explore the Dataset
Ensure that you have the Loan_Default.csv in the root directory for this, the link to the Kaggle site can be found in the README file.

In [2]:
df = pd.read_csv("Loan_Default.csv")

# Let's take a look at the first 5 rows
df.head()

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1
0,24890,2019,cf,Sex Not Available,nopre,type1,p1,l1,nopc,nob/c,...,EXP,758,CIB,25-34,to_inst,98.728814,south,direct,1,45.0
1,24891,2019,cf,Male,nopre,type2,p1,l1,nopc,b/c,...,EQUI,552,EXP,55-64,to_inst,,North,direct,1,
2,24892,2019,cf,Male,pre,type1,p1,l1,nopc,nob/c,...,EXP,834,CIB,35-44,to_inst,80.019685,south,direct,0,46.0
3,24893,2019,cf,Male,nopre,type1,p4,l1,nopc,nob/c,...,EXP,587,CIB,45-54,not_inst,69.3769,North,direct,0,42.0
4,24894,2019,cf,Joint,pre,type1,p1,l1,nopc,nob/c,...,CRIF,602,EXP,25-34,not_inst,91.886544,North,direct,0,39.0


### C. Data Cleaning / Preparation

Look at the column breakdown for the dataset first, to see what we may be dealing with

In [5]:
# Reveals column information, such as number of rows that are not null, and the data type held
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148670 entries, 0 to 148669
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   ID                         148670 non-null  int64  
 1   year                       148670 non-null  int64  
 2   loan_limit                 145326 non-null  object 
 3   Gender                     148670 non-null  object 
 4   approv_in_adv              147762 non-null  object 
 5   loan_type                  148670 non-null  object 
 6   loan_purpose               148536 non-null  object 
 7   Credit_Worthiness          148670 non-null  object 
 8   open_credit                148670 non-null  object 
 9   business_or_commercial     148670 non-null  object 
 10  loan_amount                148670 non-null  int64  
 11  rate_of_interest           112231 non-null  float64
 12  Interest_rate_spread       112031 non-null  float64
 13  Upfront_charges            10

from here, we can see that many of the columns are not completely filled with non-null values. We can drop those rows since training any model using them will cause incomplete data to be used, which can negatively affect results.

Remove rows with null columns

In [7]:
no_na_df = df.dropna()
# All columns in this new dataframe should have the same number of non-null rows
no_na_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98187 entries, 2 to 148669
Data columns (total 34 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         98187 non-null  int64  
 1   year                       98187 non-null  int64  
 2   loan_limit                 98187 non-null  object 
 3   Gender                     98187 non-null  object 
 4   approv_in_adv              98187 non-null  object 
 5   loan_type                  98187 non-null  object 
 6   loan_purpose               98187 non-null  object 
 7   Credit_Worthiness          98187 non-null  object 
 8   open_credit                98187 non-null  object 
 9   business_or_commercial     98187 non-null  object 
 10  loan_amount                98187 non-null  int64  
 11  rate_of_interest           98187 non-null  float64
 12  Interest_rate_spread       98187 non-null  float64
 13  Upfront_charges            98187 non-null  float64

View the 1st 5 rows again (just to revisit the data)

In [8]:
no_na_df.head()

Unnamed: 0,ID,year,loan_limit,Gender,approv_in_adv,loan_type,loan_purpose,Credit_Worthiness,open_credit,business_or_commercial,...,credit_type,Credit_Score,co-applicant_credit_type,age,submission_of_application,LTV,Region,Security_Type,Status,dtir1
2,24892,2019,cf,Male,pre,type1,p1,l1,nopc,nob/c,...,EXP,834,CIB,35-44,to_inst,80.019685,south,direct,0,46.0
4,24894,2019,cf,Joint,pre,type1,p1,l1,nopc,nob/c,...,CRIF,602,EXP,25-34,not_inst,91.886544,North,direct,0,39.0
5,24895,2019,cf,Joint,pre,type1,p1,l1,nopc,nob/c,...,EXP,864,EXP,35-44,not_inst,70.089286,North,direct,0,40.0
6,24896,2019,cf,Joint,pre,type1,p3,l1,nopc,nob/c,...,EXP,860,EXP,55-64,to_inst,79.109589,North,direct,0,44.0
8,24898,2019,cf,Joint,nopre,type1,p3,l1,nopc,nob/c,...,CIB,580,EXP,55-64,to_inst,78.76569,central,direct,0,44.0


Before visualizing anything, we can first 1-hot encode the data. This converts categorical values into numerical values, to provide standardization such that ML models can work with the data.

The columns that are categorical are:
1. loan_limit
2. gender
3. approv_in_adv
4. loan_type
5. loan_purpose
6. Credit_Worthiness
7. open_credit
8. business_or_commercial
9. neg_ammortization
10. Lump_sum_payment
11. Construction_type
12. Occupancy_type
13. Secured_by
14. Total_units
15. credit_type
16. Co-applicant_credit_type
17. Age
18. Submission_of_application
19. Region
20. Security_type

In [14]:
columns_to_encode = [
    'loan_limit',
    'Gender',
    'approv_in_adv',
    'loan_type',
    'loan_purpose',
    'Credit_Worthiness',
    'open_credit',
    'business_or_commercial',
    'Neg_ammortization',
    'lump_sum_payment',
    'construction_type',
    'occupancy_type',
    'Secured_by',
    'total_units',
    'credit_type',
    'co-applicant_credit_type',
    'age',
    'submission_of_application',
    'Region',
    'Security_Type',
]

# Perform the encoding and view the new df
one_hot_encoded_df = pd.get_dummies(no_na_df, columns=columns_to_encode)
one_hot_encoded_df.head()

Unnamed: 0,ID,year,loan_amount,rate_of_interest,Interest_rate_spread,Upfront_charges,term,interest_only,property_value,income,...,age_65-74,age_<25,age_>74,submission_of_application_not_inst,submission_of_application_to_inst,Region_North,Region_North-East,Region_central,Region_south,Security_Type_direct
2,24892,2019,406500,4.56,0.2,595.0,360.0,not_int,508000.0,9480.0,...,False,False,False,False,True,False,False,False,True,True
4,24894,2019,696500,4.0,0.3042,0.0,360.0,not_int,758000.0,10440.0,...,False,False,False,True,False,True,False,False,False,True
5,24895,2019,706500,3.99,0.1523,370.0,360.0,not_int,1008000.0,10080.0,...,False,False,False,True,False,True,False,False,False,True
6,24896,2019,346500,4.5,0.9998,5120.0,360.0,not_int,438000.0,5040.0,...,False,False,False,False,True,True,False,False,False,True
8,24898,2019,376500,4.875,0.7395,1150.0,360.0,not_int,478000.0,5580.0,...,False,False,False,False,True,False,False,True,False,True
