# Predicting Loan Default: Part 1

### Background

Banks run into losses when a customer doesn't pay their loans on time. Because of this, every year, banks have losses in millions, and this also impacts the country's economic growth to a large extent. Being able to predict if a customer will default on a loan at the application stage would eliminate the problem before the loan is approved. The bank however runs the risk of losing potential business if the prediction is wrong. 

### Problem Statement

Using the given dataset, this project aims to achieve 3 things as its Primary objectives.
1. To utilize the information given and quantify feature importance to accurately predict loan defaults.
2. To engineer features to help better predict loan defaults.
3. To act as a stepping stone to develop better models that can address this issue that banks have and reduce monetary losses.


### Metrics

I will be using Accuracy, F1 and Log Loss scores in order to evaluate the models in this project.

1. Accuracy: Banks need to accurately and quickly predict potential defaults. Accuracy in predicting both potential business and potential losses are essential for any business.
2. F1: False negatives and False positives will be crucial to the evaluating of bank loan approvals based on the predictions in order to reduce loss, especially in imbalanced datasets like this.
3. Log Loss: Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value. It was also used to evaluate a MachineHack competition and a good way to benchmark my model to others who have tried this.

In [1]:
# Import the basic libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score


import re

%matplotlib inline

In [2]:
# Read in the dataset
train = pd.read_csv('data/train.csv')

In [3]:
dataframes = [train]

# Converting column names to snake case
def camel_to_snake(name):
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', name).lower()

for df in dataframes:
    df.columns = [camel_to_snake(col) for col in df.columns]

In [4]:
# remove spaces in the column names for train
train.columns = train.columns.map(lambda x: x.replace(' ', ''))

In [5]:
# initial observations for train data
pd.set_option('display.max_columns',None)
print(train.shape, train['id'].nunique())
print(train.columns)
train.head()

(67463, 35) 67463
Index(['id', 'loan_amount', 'funded_amount', 'funded_amount_investor', 'term',
       'batch_enrolled', 'interest_rate', 'grade', 'sub_grade',
       'employment_duration', 'home_ownership', 'verification_status',
       'payment_plan', 'loan_title', 'debitto_income', 'delinquency-twoyears',
       'inquires-sixmonths', 'open_account', 'public_record',
       'revolving_balance', 'revolving_utilities', 'total_accounts',
       'initial_list_status', 'total_received_interest',
       'total_received_late_fee', 'recoveries', 'collection_recovery_fee',
       'collection12months_medical', 'application_type', 'lastweek_pay',
       'accounts_delinquent', 'total_collection_amount',
       'total_current_balance', 'total_revolving_credit_limit', 'loan_status'],
      dtype='object')


Unnamed: 0,id,loan_amount,funded_amount,funded_amount_investor,term,batch_enrolled,interest_rate,grade,sub_grade,employment_duration,home_ownership,verification_status,payment_plan,loan_title,debitto_income,delinquency-twoyears,inquires-sixmonths,open_account,public_record,revolving_balance,revolving_utilities,total_accounts,initial_list_status,total_received_interest,total_received_late_fee,recoveries,collection_recovery_fee,collection12months_medical,application_type,lastweek_pay,accounts_delinquent,total_collection_amount,total_current_balance,total_revolving_credit_limit,loan_status
0,65087372,10000,32236,12329.36286,59,BAT2522922,11.135007,B,C4,MORTGAGE,176346.6267,Not Verified,n,Debt Consolidation,16.284758,1,0,13,0,24246,74.932551,7,w,2929.646315,0.102055,2.498291,0.793724,0,INDIVIDUAL,49,0,31,311301,6619,0
1,1450153,3609,11940,12191.99692,59,BAT1586599,12.237563,C,D3,RENT,39833.921,Source Verified,n,Debt consolidation,15.412409,0,0,12,0,812,78.297186,13,f,772.769385,0.036181,2.377215,0.974821,0,INDIVIDUAL,109,0,53,182610,20885,0
2,1969101,28276,9311,21603.22455,59,BAT2136391,12.545884,F,D4,MORTGAGE,91506.69105,Source Verified,n,Debt Consolidation,28.137619,0,0,14,0,1843,2.07304,20,w,863.324396,18.77866,4.316277,1.020075,0,INDIVIDUAL,66,0,34,89801,26155,0
3,6651430,11170,6954,17877.15585,59,BAT2428731,16.731201,C,C3,MORTGAGE,108286.5759,Source Verified,n,Debt consolidation,18.04373,1,0,7,0,13819,67.467951,12,w,288.173196,0.044131,0.10702,0.749971,0,INDIVIDUAL,39,0,40,9189,60214,0
4,14354669,16890,13226,13539.92667,59,BAT5341619,15.0083,C,D4,MORTGAGE,44234.82545,Source Verified,n,Credit card refinancing,17.209886,1,3,13,1,1544,85.250761,22,w,129.239553,19.306646,1294.818751,0.368953,0,INDIVIDUAL,18,0,430,126029,22579,0


In [6]:
# check for duplicates
train.duplicated().value_counts()

False    67463
dtype: int64

In [7]:
# set thet id as the index
train.set_index('id')

Unnamed: 0_level_0,loan_amount,funded_amount,funded_amount_investor,term,batch_enrolled,interest_rate,grade,sub_grade,employment_duration,home_ownership,verification_status,payment_plan,loan_title,debitto_income,delinquency-twoyears,inquires-sixmonths,open_account,public_record,revolving_balance,revolving_utilities,total_accounts,initial_list_status,total_received_interest,total_received_late_fee,recoveries,collection_recovery_fee,collection12months_medical,application_type,lastweek_pay,accounts_delinquent,total_collection_amount,total_current_balance,total_revolving_credit_limit,loan_status
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
65087372,10000,32236,12329.36286,59,BAT2522922,11.135007,B,C4,MORTGAGE,176346.62670,Not Verified,n,Debt Consolidation,16.284758,1,0,13,0,24246,74.932551,7,w,2929.646315,0.102055,2.498291,0.793724,0,INDIVIDUAL,49,0,31,311301,6619,0
1450153,3609,11940,12191.99692,59,BAT1586599,12.237563,C,D3,RENT,39833.92100,Source Verified,n,Debt consolidation,15.412409,0,0,12,0,812,78.297186,13,f,772.769385,0.036181,2.377215,0.974821,0,INDIVIDUAL,109,0,53,182610,20885,0
1969101,28276,9311,21603.22455,59,BAT2136391,12.545884,F,D4,MORTGAGE,91506.69105,Source Verified,n,Debt Consolidation,28.137619,0,0,14,0,1843,2.073040,20,w,863.324396,18.778660,4.316277,1.020075,0,INDIVIDUAL,66,0,34,89801,26155,0
6651430,11170,6954,17877.15585,59,BAT2428731,16.731201,C,C3,MORTGAGE,108286.57590,Source Verified,n,Debt consolidation,18.043730,1,0,7,0,13819,67.467951,12,w,288.173196,0.044131,0.107020,0.749971,0,INDIVIDUAL,39,0,40,9189,60214,0
14354669,16890,13226,13539.92667,59,BAT5341619,15.008300,C,D4,MORTGAGE,44234.82545,Source Verified,n,Credit card refinancing,17.209886,1,3,13,1,1544,85.250761,22,w,129.239553,19.306646,1294.818751,0.368953,0,INDIVIDUAL,18,0,430,126029,22579,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16164945,13601,6848,13175.28583,59,BAT3193689,9.408858,C,A4,MORTGAGE,83961.15003,Verified,n,Credit card refinancing,28.105127,1,0,13,0,4112,97.779389,19,w,1978.945960,0.023478,564.614852,0.865230,0,INDIVIDUAL,69,0,48,181775,34301,1
35182714,8323,11046,15637.46301,59,BAT1780517,9.972104,C,B3,RENT,65491.12817,Source Verified,n,Credit card refinancing,17.694279,0,0,12,0,9737,15.690703,14,w,3100.803125,0.027095,2.015494,1.403368,0,INDIVIDUAL,14,0,37,22692,8714,0
16435904,15897,32921,12329.45775,59,BAT1761981,19.650943,A,F3,MORTGAGE,34813.96985,Verified,n,Lending loan,10.295774,0,0,7,1,2195,1.500090,9,w,2691.995532,0.028212,5.673092,1.607093,0,INDIVIDUAL,137,0,17,176857,42330,0
5300325,16567,4975,21353.68465,59,BAT2333412,13.169095,D,E3,OWN,96938.83564,Not Verified,n,Debt consolidation,7.614624,0,0,14,0,1172,68.481882,15,f,3659.334202,0.074508,1.157454,0.207608,0,INDIVIDUAL,73,0,61,361339,39075,0


In [8]:
train['loan_status'].describe()

count    67463.000000
mean         0.092510
std          0.289747
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: loan_status, dtype: float64

In [9]:
# checking for imbalanced data
train['loan_status'].value_counts()

0    61222
1     6241
Name: loan_status, dtype: int64

The data is highly imbalanced with an almost 10 to 1 ratio of negative to positive observations.

In [10]:
train.isnull().sum()

id                              0
loan_amount                     0
funded_amount                   0
funded_amount_investor          0
term                            0
batch_enrolled                  0
interest_rate                   0
grade                           0
sub_grade                       0
employment_duration             0
home_ownership                  0
verification_status             0
payment_plan                    0
loan_title                      0
debitto_income                  0
delinquency-twoyears            0
inquires-sixmonths              0
open_account                    0
public_record                   0
revolving_balance               0
revolving_utilities             0
total_accounts                  0
initial_list_status             0
total_received_interest         0
total_received_late_fee         0
recoveries                      0
collection_recovery_fee         0
collection12months_medical      0
application_type                0
lastweek_pay  

There are no null values and no duplicates in the train dataset.

In [11]:
train.dtypes

id                                int64
loan_amount                       int64
funded_amount                     int64
funded_amount_investor          float64
term                              int64
batch_enrolled                   object
interest_rate                   float64
grade                            object
sub_grade                        object
employment_duration              object
home_ownership                  float64
verification_status              object
payment_plan                     object
loan_title                       object
debitto_income                  float64
delinquency-twoyears              int64
inquires-sixmonths                int64
open_account                      int64
public_record                     int64
revolving_balance                 int64
revolving_utilities             float64
total_accounts                    int64
initial_list_status              object
total_received_interest         float64
total_received_late_fee         float64


In [12]:
train['loan_title'] = train['loan_title'].astype(str)
train['loan_title'] = train['loan_title'].str.lower()

In [13]:
train['loan_title'].unique()

array(['debt consolidation', 'credit card refinancing',
       'home improvement', 'credit consolidation', 'green loan', 'other',
       'moving and relocation', 'credit cards', 'medical expenses',
       'refinance', 'credit card consolidation', 'lending club',
       'debt consolidation loan', 'major purchase', 'vacation',
       'business', 'credit card payoff', 'credit card',
       'credit card refi', 'personal loan', 'cc refi', 'consolidate',
       'medical', 'loan 1', 'consolidation', 'card consolidation',
       'car financing', 'debt', 'home buying', 'freedom', 'consolidated',
       'get out of debt', 'consolidation loan', 'dept consolidation',
       'personal', 'cards', 'bathroom', 'refi', 'credit card loan',
       'credit card debt', 'house', 'debt consolidation 2013',
       'debt loan', 'cc refinance', 'home', 'cc consolidation',
       'credit card refinance', 'credit loan', 'payoff',
       'bill consolidation', 'credit card paydown', 'credit card pay off',
       'g

In [14]:
# grouping similar categories into one main category in loan title
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('consolidated', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt consolidation loan', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('consolidate', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('dept consolidation', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('consolidation loan', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt consolidation 2013', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt loan', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('loan consolidation', 'debt consolidation'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt consolidation loan', 'debt consolidation'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('home loan', 'home'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('house', 'home'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('home buying', 'home'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit cards', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card refinancing', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit consolidation', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card consolidation', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card payoff', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card refi', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('cc refi', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card pay off', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit loan', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card paydown', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('cc-refinance', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit payoff', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card refinance loan', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('cc loan', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('cc', 'credit'))

In [15]:
# checking the unique values
train['loan_title'].unique()

array(['debt consolidation', 'credit', 'home improvement', 'green loan',
       'other', 'moving and relocation', 'medical expenses', 'refinance',
       'lending club', 'major purchase', 'vacation', 'business',
       'credit card', 'personal loan', 'medical', 'loan 1',
       'consolidation', 'card consolidation', 'car financing', 'debt',
       'home', 'freedom', 'get out of debt', 'personal', 'cards',
       'bathroom', 'refi', 'credit card loan', 'credit card debt',
       'creditnance', 'credit consolidation', 'payoff',
       'bill consolidation', 'get debt free', 'myloan', 'credit pay off',
       'my loan', 'loan', 'bill payoff', 'debt reduction', 'medical loan',
       'wedding loan', 'pay off bills', 'refinance loan', 'debt payoff',
       'car loan', 'pay off', 'pool', 'creditnance loan', 'debt free',
       'conso', 'home improvement loan', 'lending loan', 'relief',
       'loan1', 'getting ahead', 'bills'], dtype=object)

In [16]:
# grouping 2nd round
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('card consolidation', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card loan', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit card debt', 'credit'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('refinance loan', 'refi'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('refinance', 'refi'))

In [17]:
train['loan_title'].unique()

array(['debt consolidation', 'credit', 'home improvement', 'green loan',
       'other', 'moving and relocation', 'medical expenses', 'refi',
       'lending club', 'major purchase', 'vacation', 'business',
       'personal loan', 'medical', 'loan 1', 'consolidation',
       'car financing', 'debt', 'home', 'freedom', 'get out of debt',
       'personal', 'cards', 'bathroom', 'credit loan', 'credit debt',
       'creditnance', 'credit consolidation', 'payoff',
       'bill consolidation', 'get debt free', 'myloan', 'credit pay off',
       'my loan', 'loan', 'bill payoff', 'debt reduction', 'medical loan',
       'wedding loan', 'pay off bills', 'debt payoff', 'car loan',
       'pay off', 'pool', 'creditnance loan', 'debt free', 'conso',
       'home improvement loan', 'lending loan', 'relief', 'loan1',
       'getting ahead', 'bills'], dtype=object)

In [18]:
# grouping 3rd round
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('cards', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit loan', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit debt', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('creditnance loan', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit pay off', 'credit'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('get out of debt', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt reduction', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt payoff', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt free', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('pay off', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('payoff bills', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('bill payoff', 'debt'))

In [19]:
train['loan_title'].unique()

array(['debt consolidation', 'credit', 'home improvement', 'green loan',
       'other', 'moving and relocation', 'medical expenses', 'refi',
       'lending club', 'major purchase', 'vacation', 'business',
       'personal loan', 'medical', 'loan 1', 'consolidation',
       'car financing', 'debt', 'home', 'freedom', 'personal', 'bathroom',
       'creditnance', 'credit consolidation', 'payoff',
       'bill consolidation', 'get debt', 'myloan', 'my loan', 'loan',
       'medical loan', 'wedding loan', 'debt bills', 'car loan', 'pool',
       'conso', 'home improvement loan', 'lending loan', 'relief',
       'loan1', 'getting ahead', 'bills'], dtype=object)

In [20]:
# grouping 4th round
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('pool', 'home improvement'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('bathroom', 'home improvement'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('home improvement loan', 'home improvement'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('credit consolidation', 'credit'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('creditnance', 'credit'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt bills', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt consolitation', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('bill consolitation', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('get debt', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('consolidation', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('bills', 'debt'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('loan 1', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('loan1', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('getting ahead', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('lending loan', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('myloan', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('my loan', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('personal loan', 'loan'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('personal', 'loan'))

In [21]:
train['loan_title'].unique()

array(['debt debt', 'credit', 'home improvement', 'green loan', 'other',
       'moving and relocation', 'medical expenses', 'refi',
       'lending club', 'major purchase', 'vacation', 'business', 'loan',
       'medical', 'debt', 'car financing', 'home', 'freedom', 'payoff',
       'bill debt', 'medical loan', 'wedding loan', 'car loan', 'conso',
       'relief'], dtype=object)

In [22]:
# grouping 5th round
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('conso', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('bill debt', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('payoff', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('debt debt', 'debt'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('freedom', 'debt'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('medical expenses', 'medical'))
train['loan_title'] = train['loan_title'].map(lambda x: x.replace('medical loan', 'medical'))

train['loan_title'] = train['loan_title'].map(lambda x: x.replace('car financing', 'car loan'))

In [23]:
train['loan_title'].unique()

array(['debt', 'credit', 'home improvement', 'green loan', 'other',
       'moving and relocation', 'medical', 'refi', 'lending club',
       'major purchase', 'vacation', 'business', 'loan', 'car loan',
       'home', 'wedding loan', 'relief'], dtype=object)

In [24]:
# payment plan and accounts delinquent seems to be useless as a feature as there is only 1 value
train.columns = train.columns.str.replace('-', '_')
train.drop(columns = ['payment_plan', 'accounts_delinquent'], inplace=True)

In [25]:
# split into continuous and categorical dataframes
train_num = train.select_dtypes(include ='number')
train_cat = train.select_dtypes(include = ['object','category'])

train_cat

Unnamed: 0,batch_enrolled,grade,sub_grade,employment_duration,verification_status,loan_title,initial_list_status,application_type
0,BAT2522922,B,C4,MORTGAGE,Not Verified,debt,w,INDIVIDUAL
1,BAT1586599,C,D3,RENT,Source Verified,debt,f,INDIVIDUAL
2,BAT2136391,F,D4,MORTGAGE,Source Verified,debt,w,INDIVIDUAL
3,BAT2428731,C,C3,MORTGAGE,Source Verified,debt,w,INDIVIDUAL
4,BAT5341619,C,D4,MORTGAGE,Source Verified,credit,w,INDIVIDUAL
...,...,...,...,...,...,...,...,...
67458,BAT3193689,C,A4,MORTGAGE,Verified,credit,w,INDIVIDUAL
67459,BAT1780517,C,B3,RENT,Source Verified,credit,w,INDIVIDUAL
67460,BAT1761981,A,F3,MORTGAGE,Verified,loan,w,INDIVIDUAL
67461,BAT2333412,D,E3,OWN,Not Verified,debt,f,INDIVIDUAL


In [26]:
train_cat['loan_status'] = train['loan_status']

In [27]:
print(train_num.shape)
print(train_cat.shape)

(67463, 25)
(67463, 9)


In [28]:
# loop through all the columns for categorical features
for column in train_cat.columns:
    # get all the unique values in the column
    unique_values = train_cat[column].unique()
    
    # print the unique values for each column
    print(f"Number of unique values in {column}: {len(unique_values)}")

Number of unique values in batch_enrolled: 41
Number of unique values in grade: 7
Number of unique values in sub_grade: 35
Number of unique values in employment_duration: 3
Number of unique values in verification_status: 3
Number of unique values in loan_title: 17
Number of unique values in initial_list_status: 2
Number of unique values in application_type: 2
Number of unique values in loan_status: 2


In [29]:
# saving clean train data to csv file
train.to_csv('train_clean.csv', index=False)

In [30]:
# saving continuous and categorical features in separate datasets in case it becomes useful later on
train_cat.to_csv('train_cat.csv', index=False)
train_num.to_csv('train_num.csv', index=False)