# Home Credit Default Risk

#### In this homework, we first handled the data to merge the different datasets of this project. After that, we filled the missing values and dealt with non-numerical columns.

#### Lastly, we created two models : one based on logistic regression, and the other on the Light GBM method.

In [36]:
import numpy as np
import pandas as pd 

import warnings
warnings.filterwarnings("ignore")

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler, Imputer

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from lightgbm import LGBMClassifier
import lightgbm

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

import scipy.stats as ss

#### We first read all the datasets

In [37]:
train = pd.read_csv("application_train.csv")
test = pd.read_csv("application_test.csv")
finaltest = test.copy()

# We will also use the other files to make the prediction more accurate

bureau = pd.read_csv('bureau.csv')
credit_card_balance = pd.read_csv('credit_card_balance.csv')
previous_application = pd.read_csv('previous_application.csv')
installments_payments = pd.read_csv('installments_payments.csv')
POS_CASH_balance = pd.read_csv('POS_CASH_balance.csv')
bureau_balance = pd.read_csv('bureau_balance.csv')

In [33]:
train.head()

Unnamed: 0,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,REGION_POPULATION_RELATIVE,DAYS_BIRTH,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,-0.324395,-0.717914,0.664531,-0.577538,0.142129,-0.478095,-0.166143,-0.507236,-0.149452,1.50688,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,1.935056,-0.133215,0.963763,-0.08734
1,-0.324395,-0.717914,-1.50482,-0.577538,0.426792,1.72545,0.592683,1.600873,-1.25275,-0.166821,...,-0.062904,5.677469,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,0.963763,-0.08734
2,3.082659,1.392925,0.664531,-0.577538,-0.427196,-1.152888,-1.404669,-1.092145,-0.783451,-0.689509,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734
3,-0.324395,-0.717914,0.664531,-0.577538,-0.142533,-0.71143,0.177874,-0.653463,-0.928991,-0.680114,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734
4,-0.324395,-0.717914,0.664531,-0.577538,-0.199466,-0.213734,-0.361749,-0.068554,0.56357,-0.892535,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734


As we can see, some columns contain "NaN" values. It means that some values are missing. We decided to handle this issue separately on each of the files. 

After having done that, we will gather all of them into an unique dataset, which will be the training file.

We used two functions. The first, objectcol, gets the name of the columns which are not of a numerical type. The second enables us to see if more than 70% of a column of the considered dataset contains NULL values. If so, we get rid of those columns.

In [38]:
# We could have also used a list comprehension but the use of a loop makes the code more readable

def objectcol(data):
    final = []
    for i in data.columns:
        if data[i].dtype == 'object':
            final.append(i)
    return(final)

Had we filled the columns (with more than 70% of missing values) with their mean value, we would have got non-relevant values in these columns.

In [39]:
def missing(data):
    miss = pd.DataFrame(data.isnull().mean() > 0.7,columns = ["GreaterThan70PerCent"])
    miss = miss[miss["GreaterThan70PerCent"] == True]
    return(list(miss.index))

In [40]:
train.drop(missing(train), axis = 1, inplace=True)
test.drop(missing(test), axis = 1, inplace=True)

bureau.drop(missing(bureau), axis = 1, inplace=True)
credit_card_balance.drop(missing(credit_card_balance), axis = 1, inplace=True)
previous_application.drop(missing(previous_application), axis = 1, inplace=True)
installments_payments.drop(missing(installments_payments), axis = 1, inplace=True)
POS_CASH_balance.drop(missing(POS_CASH_balance), axis = 1, inplace=True)
bureau_balance.drop(missing(bureau_balance), axis = 1, inplace=True)

In [41]:
# Installments_payments 

installments_payments = pd.get_dummies(installments_payments, columns=objectcol(installments_payments))

countinstallments = installments_payments[['SK_ID_CURR', 'SK_ID_PREV']].groupby('SK_ID_CURR').count()
installments_payments['SK_ID_PREV'] = installments_payments['SK_ID_CURR'].map(countinstallments['SK_ID_PREV'])
installments_payments = installments_payments.groupby('SK_ID_CURR').mean()


#POS_CASH_balance

POS_CASH_balance = pd.get_dummies(POS_CASH_balance, columns=objectcol(POS_CASH_balance))
countPOS = POS_CASH_balance[['SK_ID_CURR', 'SK_ID_PREV']].groupby('SK_ID_CURR').count()
POS_CASH_balance['SK_ID_PREV'] = POS_CASH_balance['SK_ID_CURR'].map(countPOS['SK_ID_PREV'])
POS_CASH_balance = POS_CASH_balance.groupby('SK_ID_CURR').mean()



# Credit_card_balance

credit_card_balance = pd.get_dummies(credit_card_balance, columns=objectcol(credit_card_balance))

countcredit = credit_card_balance[['SK_ID_CURR', 'SK_ID_PREV']].groupby('SK_ID_CURR').count()
credit_card_balance['SK_ID_PREV'] = credit_card_balance['SK_ID_CURR'].map(countcredit['SK_ID_PREV'])

credit_card_balance = credit_card_balance.groupby('SK_ID_CURR').mean()



# Previous_application

previous_application = pd.get_dummies(previous_application, columns=objectcol(previous_application))
countprevious = previous_application[['SK_ID_CURR', 'SK_ID_PREV']].groupby('SK_ID_CURR').count()
previous_application['SK_ID_PREV'] = previous_application['SK_ID_CURR'].map(countprevious['SK_ID_PREV'])
previous_application = previous_application.groupby('SK_ID_CURR').mean()



# Bureau and bureau_balance

bureau = pd.get_dummies(bureau, columns=objectcol(bureau))

bureau_balance = pd.get_dummies(bureau_balance, columns=objectcol(bureau_balance))
bureau_balance = bureau_balance.groupby('SK_ID_BUREAU').mean()

# We can now merge the bureau dataset and the bureau_balance one
mergedbureau = bureau.merge(right=bureau_balance.reset_index(), how='left', on='SK_ID_BUREAU')

countbureau = mergedbureau[['SK_ID_CURR', 'SK_ID_BUREAU']].groupby('SK_ID_CURR').count()

mergedbureau['SK_ID_BUREAU'] = mergedbureau['SK_ID_CURR'].map(countbureau['SK_ID_BUREAU'])

mergedbureau = mergedbureau.groupby('SK_ID_CURR').mean()

#### Now that we grouped each file (except the training and test sets) according to SK_ID_CURR, we can join them all before creating the model and fitting it to our training set.

In [42]:
train = train.merge(mergedbureau, on = 'SK_ID_CURR', how = 'left')                
train = train.merge(credit_card_balance, on = 'SK_ID_CURR', how = 'left')    
train = train.merge(previous_application, on = 'SK_ID_CURR', how = 'left')   
train = train.merge(installments_payments, on = 'SK_ID_CURR', how = 'left') 
train = train.merge(POS_CASH_balance, on = 'SK_ID_CURR', how = 'left') 

In [43]:
test = test.merge(mergedbureau, on = 'SK_ID_CURR', how = 'left')                
test = test.merge(credit_card_balance, on = 'SK_ID_CURR', how = 'left')    
test = test.merge(previous_application, on = 'SK_ID_CURR', how = 'left')   
test = test.merge(installments_payments, on = 'SK_ID_CURR', how = 'left')
test = test.merge(POS_CASH_balance, on = 'SK_ID_CURR', how = 'left') 

In [44]:
print("The number of missing values for the training and the test sets are : " + str(sum(list(train.isna().sum())))+ "," + str(sum(list(test.isna().sum()))))

The number of missing values for the training and the test sets are : 22207220,2826727


In [45]:
le = LabelEncoder()

for col in train.columns:
    try:
        if train[col].dtype == object and len(train[col].unique()) <= 2:
            print(col)
            le.fit(train[col])
            train[col] = le.transform(train[col])
            test[col] = le.transform(test[col])
    except:
        pass

NAME_CONTRACT_TYPE
FLAG_OWN_CAR
FLAG_OWN_REALTY


In [46]:
traincopy = train.copy()

In [47]:
train = pd.get_dummies(train)

In [48]:
test = pd.get_dummies(test)

We now want to fill the missing values.

In [49]:
train.isna().sum()

SK_ID_CURR                                       0
TARGET                                           0
NAME_CONTRACT_TYPE                               0
FLAG_OWN_CAR                                     0
FLAG_OWN_REALTY                                  0
CNT_CHILDREN                                     0
AMT_INCOME_TOTAL                                 0
AMT_CREDIT_x                                     0
AMT_ANNUITY_x                                   12
AMT_GOODS_PRICE_x                              278
REGION_POPULATION_RELATIVE                       0
DAYS_BIRTH                                       0
DAYS_EMPLOYED                                    0
DAYS_REGISTRATION                                0
DAYS_ID_PUBLISH                                  0
OWN_CAR_AGE                                 202929
FLAG_MOBIL                                       0
FLAG_EMP_PHONE                                   0
FLAG_WORK_PHONE                                  0
FLAG_CONT_MOBILE               

We also needed to get the same number of columns in the training and in the test datasets. That's we deleted the columns that were not common to both datasets (in particular, the TARGET one) :

We also got rid of the potential column that we could gave in train and not in test.

In [50]:
target = train['TARGET']
for col in set(train.columns).difference(set(test.columns)):
    train.drop(col, axis=1, inplace=True)

We can now fill the columns that don't a disproportionate number of missing values with the median of their values.

In [51]:
imputer = Imputer(strategy = "median")
imputer.fit(train)
train.loc[:] = imputer.transform(train)
imputer = Imputer(strategy = "median")
imputer.fit(test)
test.loc[:] = imputer.transform(test)

We can now scale and normalize the train and test datasets thanks to StandardScaler.

In [52]:
scaler = StandardScaler()
scaler.fit(train)
train.loc[:] = scaler.transform(train)

scaler = StandardScaler()
scaler.fit(test)
test.loc[:] = scaler.transform(test)

In [18]:
train.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,-1.733423,-0.324395,-0.717914,0.664531,-0.577538,0.142129,-0.478095,-0.166143,-0.507236,-0.149452,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,1.935056,-0.133215,0.963763,-0.08734
1,-1.733413,-0.324395,-0.717914,-1.50482,-0.577538,0.426792,1.72545,0.592683,1.600873,-1.25275,...,-0.062904,5.677469,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,0.963763,-0.08734
2,-1.733403,3.082659,1.392925,0.664531,-0.577538,-0.427196,-1.152888,-1.404669,-1.092145,-0.783451,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734
3,-1.733384,-0.324395,-0.717914,0.664531,-0.577538,-0.142533,-0.71143,0.177874,-0.653463,-0.928991,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734
4,-1.733374,-0.324395,-0.717914,0.664531,-0.577538,-0.199466,-0.213734,-0.361749,-0.068554,0.56357,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734


In [53]:
print("The number of missing values for the training and the test sets are : " + str(sum(list(train.isna().sum())))+ "," + str(sum(list(test.isna().sum()))))

The number of missing values for the training and the test sets are : 0,0


In [54]:
print(test.shape,train.shape)

(48744, 494) (307511, 494)


In [79]:
train.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_x,AMT_ANNUITY_x,AMT_GOODS_PRICE_x,REGION_POPULATION_RELATIVE,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,-1.733423,-0.324395,-0.717914,0.664531,-0.577538,0.142129,-0.478095,-0.166143,-0.507236,-0.149452,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,1.935056,-0.133215,0.963763,-0.08734
1,-1.733413,-0.324395,-0.717914,-1.50482,-0.577538,0.426792,1.72545,0.592683,1.600873,-1.25275,...,-0.062904,5.677469,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,0.963763,-0.08734
2,-1.733403,3.082659,1.392925,0.664531,-0.577538,-0.427196,-1.152888,-1.404669,-1.092145,-0.783451,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734
3,-1.733384,-0.324395,-0.717914,0.664531,-0.577538,-0.142533,-0.71143,0.177874,-0.653463,-0.928991,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734
4,-1.733374,-0.324395,-0.717914,0.664531,-0.577538,-0.199466,-0.213734,-0.361749,-0.068554,0.56357,...,-0.062904,-0.176135,-0.086733,-0.076281,-0.072886,-0.522963,-0.516781,-0.133215,-1.037599,-0.08734


We get rid of the SK_ID_CURR in the training and test sets

In [55]:
if 'SK_ID_CURR' in train.columns:
    train.drop('SK_ID_CURR', axis=1, inplace=True)
if 'SK_ID_CURR' in test.columns:
    test.drop('SK_ID_CURR', axis=1, inplace=True)

**Now that we don't have any missing value anymore, we can get into the logistic regression.**

In [22]:
lr = LogisticRegression(C=1)
lr.fit(train, target)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

The logistic model is now trained. We can exploit the probability prediction tool of sklearn to build the prediction file we need.

In [23]:
Pt = lr.predict_proba(train)[:,1]

print("The AUC score is",roc_auc_score(target, Pt))

The AUC score is 0.7721209206755005


In [24]:
P = lr.predict_proba(test)[:,1]

pred = pd.DataFrame()
pred['SK_ID_CURR'] = pd.read_csv("application_test.csv")['SK_ID_CURR']
pred['TARGET'] = P
pred.to_csv('predictionlr.csv', index = False)

We get a high AUC so we can say that the model succeeds to distinguish the classes quite well. It means that the model has around 0.776 chances to distinguish the cases where the loan was repaid or not.

## LIGHTGBM

As we want to improve the score of our prediction, we also used the LIGHTGBM method. We train the model until the validation score improves.

We first split the train set as the LIGHTGBM method requires to split the train set. Please note that we chose the hyperparameters of this method thanks to what is given on different websites and forums.

In [56]:
train_x, valid_x, train_y, valid_y = train_test_split(train, target, test_size=0.2, shuffle=True)

traingbm=lightgbm.Dataset(train_x,label=train_y)
validgbm=lightgbm.Dataset(valid_x,label=valid_y)

We can now train the LIGHTGBM model with these two datasets.

In [57]:
lgbm = lightgbm.train({'boosting_type': 'gbdt',
          'max_depth' : 10,
          'reg_alpha': 5,
          'reg_lambda': 10,
          'min_split_gain': 0.5,
          'objective': 'binary',
          'nthread': 5,
          'num_leaves': 64,
          'learning_rate': 0.05,
          'max_bin': 512,
          'min_child_weight': 1,
          'min_child_samples': 5,
          'scale_pos_weight': 1,
          'num_class' : 1,
          'subsample_for_bin': 200,
          'subsample': 1,
          'subsample_freq': 1,
          'colsample_bytree': 0.8,
          'metric' : 'auc'},
                 traingbm,
                 2500,
                 valid_sets=validgbm,
                 early_stopping_rounds= 40,
                 verbose_eval= 10
                 )


 
# We can now predict on the test set
probslgbm = lgbm.predict(test)

Training until validation scores don't improve for 40 rounds.
[10]	valid_0's auc: 0.739341
[20]	valid_0's auc: 0.746867
[30]	valid_0's auc: 0.751659
[40]	valid_0's auc: 0.756253
[50]	valid_0's auc: 0.760194
[60]	valid_0's auc: 0.763261
[70]	valid_0's auc: 0.766316
[80]	valid_0's auc: 0.768489
[90]	valid_0's auc: 0.77055
[100]	valid_0's auc: 0.771897
[110]	valid_0's auc: 0.77329
[120]	valid_0's auc: 0.774217
[130]	valid_0's auc: 0.775297
[140]	valid_0's auc: 0.776069
[150]	valid_0's auc: 0.776726
[160]	valid_0's auc: 0.777752
[170]	valid_0's auc: 0.778234
[180]	valid_0's auc: 0.778645
[190]	valid_0's auc: 0.779278
[200]	valid_0's auc: 0.779624
[210]	valid_0's auc: 0.78001
[220]	valid_0's auc: 0.780273
[230]	valid_0's auc: 0.780602
[240]	valid_0's auc: 0.780923
[250]	valid_0's auc: 0.781079
[260]	valid_0's auc: 0.781297
[270]	valid_0's auc: 0.781605
[280]	valid_0's auc: 0.781722
[290]	valid_0's auc: 0.781866
[300]	valid_0's auc: 0.78197
[310]	valid_0's auc: 0.782077
[320]	valid_0's auc: 

In [27]:
score = roc_auc_score(target, lgbm.predict(train))
score

0.8773793487119629

In [59]:
preds = pd.DataFrame()
preds['SK_ID_CURR'] = pd.read_csv("application_test.csv")['SK_ID_CURR']
preds['TARGET'] = probslgbm

Now that we built the submission dataset with the SK_ID_CURR column and the probability predicted thanks to the LIGHTGBM method, we can create the submission file (in CSV format).

In [60]:
preds.to_csv('submissionlgbm.csv', index = False)

# Conclusion

As a conclusion, we can say that the more we "clean" our data, the better will be the score of the model. Here, we decided, for instance, to get rid of columns that contain too many missing values. Keeping them would have worsened the ability of our model to predict the target properly.