# Simply Rational - ASSIGNMENT 1: Comparative Model Analysis
### Notebook created by: Jiacheng Yao, 09/02/2021



-------------
ASSIGNMENT 1: Comparative Model Analysis

Introduction
Attached to this email you will find two scientific papers describing two types of “simple algorithms” – Fast-And-Frugal Trees (Philipps et al, 2017) and Select-Regress-and-Round (Jung et al, 2020). Please include at least one of these models in your comparative model analysis.


The scenario
Many people struggle to get loans due to insufficient or non-existent credit histories. Unfortunately, this population is often taken advantage of by untrustworthy lenders and credit sharks. An organization strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. To make sure this underserved population has a positive loan experience, the organization makes use of a variety of alternative data—including telco and transactional information—to predict their clients' repayment abilities.

In addition, the organization currently operates in a country in which great economic and societal changes are taking place. Factors predictive of successful loan repayment or of loan default today may no longer be predictive a year from now. In order to be able to continue smooth operations during this time of upheaval, the organization wishes to also assess the effectiveness of simple and transparent algorithms. This would allow the employees of the organization to effectively evaluate the predictions made by the new system and integrate their knowledge of these societal and economic changes into the decision-making process.




Assessment of the task
Using ten-fold cross validation, determine the predictive accuracy (balanced accuracy BACC) with regards to TARGET in the main data set “application.csv”. Please use at least one of the simple models and compare it to the other models of your choice.

Your performance will be evaluated along two metrics:

1. Predictive accuracy of your best model.
2. A discussion regarding simplicity and interpretability of the simple models vs black box models and under which circumstances you would recommend the use of the simple models.

Please prepare the code as well as slides for a presentation that showcase your approach, the results and recommendations.

Data Description

1.  application.csv

    This is the main table.
    Static data for all applications. One row represents one loan in our data sample.

2.  bureau.csv

    All previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
    For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
    There are more IDs in this file than in the “application.csv”-file. Please ignore those IDs that are not included in “application.csv”.

3.  columns_description.csv

    This file contains descriptions for the columns in the various data files.
    
## 1. Explorative Analysis

In [1]:
import logging

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info("Start 1. Explorative Analysis")

INFO:root:Start 1. Explorative Analysis


### 1. Read the input data

In [2]:
df_app = pd.read_csv("application.csv")

df_bureau = pd.read_csv("bureau.csv", sep = ",")

df_col_des = pd.read_csv("columns_description.csv", encoding="ISO-8859-1")

### 2. Take a first look at the data - Application:

In [3]:
df_app.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number
0,"100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,...",,,,,,,,,,...,,,,,,,,,,7
1,100003,0.0,Cash loans,F,N,N,0.0,270000.0,1293502.5,35698.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,100004,0.0,Revolving loans,M,Y,Y,0.0,67500.0,135000.0,6750.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,100006,0.0,Cash loans,F,N,Y,0.0,135000.0,312682.5,29686.5,...,0.0,0.0,0.0,,,,,,,8
4,100007,0.0,Cash loans,M,N,Y,0.0,121500.0,513000.0,21865.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
5,"100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,2...",,,,,,,,,,...,,,,,,,,,,8
6,100009,0.0,Cash loans,F,Y,Y,1.0,171000.0,1560726.0,41301.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,10
7,100010,0.0,Cash loans,M,Y,Y,0.0,360000.0,1530000.0,42075.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
8,100011,0.0,Cash loans,F,N,Y,0.0,112500.0,1019610.0,33826.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,10
9,100012,0.0,Revolving loans,M,N,Y,0.0,135000.0,405000.0,20250.0,...,0.0,0.0,0.0,,,,,,,6


**Comment**: _Some rows have been read incorrectly and need to be handled._

In [4]:
n_rows_old = df_app.shape[0]

# incorrectly read rows
df_app_p1 = df_app[df_app['TARGET'].isnull()]

# correctly read rows
df_app_p2 = df_app[df_app['TARGET'].notnull()]

In [5]:
df_app_p2.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number
1,100003,0.0,Cash loans,F,N,N,0.0,270000.0,1293502.5,35698.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,100004,0.0,Revolving loans,M,Y,Y,0.0,67500.0,135000.0,6750.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,100006,0.0,Cash loans,F,N,Y,0.0,135000.0,312682.5,29686.5,...,0.0,0.0,0.0,,,,,,,8
4,100007,0.0,Cash loans,M,N,Y,0.0,121500.0,513000.0,21865.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
6,100009,0.0,Cash loans,F,Y,Y,1.0,171000.0,1560726.0,41301.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,10
7,100010,0.0,Cash loans,M,Y,Y,0.0,360000.0,1530000.0,42075.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
8,100011,0.0,Cash loans,F,N,Y,0.0,112500.0,1019610.0,33826.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,10
9,100012,0.0,Revolving loans,M,N,Y,0.0,135000.0,405000.0,20250.0,...,0.0,0.0,0.0,,,,,,,6
10,100014,0.0,Cash loans,F,N,Y,1.0,112500.0,652500.0,21177.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4
11,100015,0.0,Cash loans,F,N,Y,0.0,38419.155,148365.0,10678.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,8


In [6]:
# reformat incorrectly read rows
df_app_p1= df_app_p1.iloc[:,0].str.split(',', expand=True)
df_app_p1 = df_app_p1[df_app_p1.columns[:-1]]
df_app_p1.columns = df_app_p2.columns

In [7]:
df_app_p1.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
18,100022,0,Revolving loans,F,N,Y,0,112500.0,157500.0,7875.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
20,100024,0,Revolving loans,M,Y,Y,0,135000.0,427500.0,21375.0,...,0,0,0,0,,,,,,
24,100030,0,Cash loans,F,N,Y,0,90000.0,225000.0,11074.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
32,100040,0,Cash loans,F,N,Y,0,135000.0,1125000.0,32895.0,...,0,0,0,0,,,,,,
33,100041,0,Cash loans,F,N,N,0,112500.0,450000.0,44509.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
36,100045,0,Cash loans,F,N,Y,0,99000.0,247275.0,17338.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
37,100047,1,Cash loans,M,N,Y,0,202500.0,1193580.0,35028.0,...,0,0,0,0,0.0,0.0,0.0,2.0,0.0,4.0
49,100060,0,Cash loans,M,Y,N,0,76500.0,454500.0,14661.0,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0


In [8]:
# combine all the data into one dataframe
df_app = pd.concat([df_app_p1, df_app_p2])

In [9]:
# make sure the one dataframe has the same number of rows as the original dataframe
assert(df_app.shape[0] == n_rows_old)

In [10]:
df_app.head(10)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
18,100022,0,Revolving loans,F,N,Y,0,112500.0,157500.0,7875.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
20,100024,0,Revolving loans,M,Y,Y,0,135000.0,427500.0,21375.0,...,0,0,0,0,,,,,,
24,100030,0,Cash loans,F,N,Y,0,90000.0,225000.0,11074.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
32,100040,0,Cash loans,F,N,Y,0,135000.0,1125000.0,32895.0,...,0,0,0,0,,,,,,
33,100041,0,Cash loans,F,N,N,0,112500.0,450000.0,44509.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
36,100045,0,Cash loans,F,N,Y,0,99000.0,247275.0,17338.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
37,100047,1,Cash loans,M,N,Y,0,202500.0,1193580.0,35028.0,...,0,0,0,0,0.0,0.0,0.0,2.0,0.0,4.0
49,100060,0,Cash loans,M,Y,N,0,76500.0,454500.0,14661.0,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0


In [11]:
df_app.shape

(276686, 123)

### 3. Take a first look at the data - Bureau:

In [12]:
df_bureau.head(10)

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,
5,215354,5714467,Active,currency 1,-273,0,27460.0,,0.0,0,180000.0,71017.38,108982.62,0.0,Credit card,-31,
6,215354,5714468,Active,currency 1,-43,0,79.0,,0.0,0,42103.8,42103.8,0.0,0.0,Consumer credit,-22,
7,162297,5714469,Closed,currency 1,-1896,0,-1684.0,-1710.0,14985.0,0,76878.45,0.0,0.0,0.0,Consumer credit,-1710,
8,162297,5714470,Closed,currency 1,-1146,0,-811.0,-840.0,0.0,0,103007.7,0.0,0.0,0.0,Consumer credit,-840,
9,162297,5714471,Active,currency 1,-1146,0,-484.0,,0.0,0,4500.0,0.0,0.0,0.0,Credit card,-690,


In [13]:
df_bureau.shape

(1716428, 17)

### 4. Take a first look at the data - Column Description:

In [14]:
df_col_des.head(10)

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application.csv,SK_ID_CURR,ID of loan in our sample,
1,"2,application.csv,TARGET,""Target variable (1 -...",,,,
2,5,application.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application.csv,CODE_GENDER,Gender of the client,
4,7,application.csv,FLAG_OWN_CAR,Flag if the client owns a car,
5,8,application.csv,FLAG_OWN_REALTY,Flag if client owns a house or flat,
6,9,application.csv,CNT_CHILDREN,Number of children the client has,
7,10,application.csv,AMT_INCOME_TOTAL,Income of the client,
8,11,application.csv,AMT_CREDIT,Credit amount of the loan,
9,12,application.csv,AMT_ANNUITY,Loan annuity,


**Comment**: _Like application data, some rows have been read incorrectly and need to be handled._

In [15]:
n_rows_old = df_col_des.shape[0]

# incorrectly read rows
df_col_des_p1 = df_col_des[df_col_des['Table'].isnull()]

# correctly read rows
df_col_des_p2 = df_col_des[df_col_des['Table'].notnull()]

df_col_des_p2.columns = ['Row_ID' if x==df_col_des.columns[0] else x for x in df_col_des_p2.columns]

In [16]:
df_col_des_p2.head(10)

Unnamed: 0,Row_ID,Table,Row,Description,Special
0,1,application.csv,SK_ID_CURR,ID of loan in our sample,
2,5,application.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application.csv,CODE_GENDER,Gender of the client,
4,7,application.csv,FLAG_OWN_CAR,Flag if the client owns a car,
5,8,application.csv,FLAG_OWN_REALTY,Flag if client owns a house or flat,
6,9,application.csv,CNT_CHILDREN,Number of children the client has,
7,10,application.csv,AMT_INCOME_TOTAL,Income of the client,
8,11,application.csv,AMT_CREDIT,Credit amount of the loan,
9,12,application.csv,AMT_ANNUITY,Loan annuity,
10,13,application.csv,AMT_GOODS_PRICE,For consumer loans it is the price of the good...,


In [17]:
# reformat incorrectly read rows
df_col_des_p1= df_col_des_p1.iloc[:,0].str.split(',', 3, expand=True)
df_col_des_p1['Special'] = np.nan
df_col_des_p1.columns = df_col_des_p2.columns

In [18]:
df_col_des_p1.head(10)

Unnamed: 0,Row_ID,Table,Row,Description,Special
1,2,application.csv,TARGET,"""Target variable (1 - client with payment diff...",
12,15,application.csv,NAME_INCOME_TYPE,"""Clients income type (businessman, working, ma...",
15,18,application.csv,NAME_HOUSING_TYPE,"""What is the housing situation of the client (...",
22,25,application.csv,FLAG_MOBIL,"""Did client provide mobile phone (1=YES, 0=NO)"",",
23,26,application.csv,FLAG_EMP_PHONE,"""Did client provide work phone (1=YES, 0=NO)"",",
24,27,application.csv,FLAG_WORK_PHONE,"""Did client provide home phone (1=YES, 0=NO)"",",
25,28,application.csv,FLAG_CONT_MOBILE,"""Was mobile phone reachable (1=YES, 0=NO)"",",
26,29,application.csv,FLAG_PHONE,"""Did client provide home phone (1=YES, 0=NO)"",",
27,30,application.csv,FLAG_EMAIL,"""Did client provide email (1=YES, 0=NO)"",",
30,33,application.csv,REGION_RATING_CLIENT,"""Our rating of the region where client lives (...",


In [19]:
# combine all the data into one dataframe
df_col_des = pd.concat([df_col_des_p1, df_col_des_p2])
df_col_des['Row_ID'] = pd.to_numeric(df_col_des['Row_ID'], errors='ignore')
df_col_des.sort_values(by='Row_ID', ascending=True, inplace=True)

In [20]:
# make sure the one dataframe has the same number of rows as the original dataframe
assert(df_col_des.shape[0] == n_rows_old)

In [21]:
df_col_des.head(10)

Unnamed: 0,Row_ID,Table,Row,Description,Special
0,1,application.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application.csv,TARGET,"""Target variable (1 - client with payment diff...",
2,5,application.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application.csv,CODE_GENDER,Gender of the client,
4,7,application.csv,FLAG_OWN_CAR,Flag if the client owns a car,
5,8,application.csv,FLAG_OWN_REALTY,Flag if client owns a house or flat,
6,9,application.csv,CNT_CHILDREN,Number of children the client has,
7,10,application.csv,AMT_INCOME_TOTAL,Income of the client,
8,11,application.csv,AMT_CREDIT,Credit amount of the loan,
9,12,application.csv,AMT_ANNUITY,Loan annuity,


In [22]:
df_col_des.to_csv("columns_description(cleaned).csv", index = False)

### 5. Summarize the dataframe

In [29]:
df_app.describe([.1,.2,.3,.6,.7,.8,.9,.95,.98,.99,.999])

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number
count,276686,276686.0,276686,276686,276686,276686,276686.0,276686.0,276686.0,276678.0,...,276686.0,276686.0,276686.0,247998.0,247998.0,247998.0,247998.0,247998.0,247998.0,276686
unique,276686,4.0,2,3,2,2,23.0,2833.0,8765.0,22502.0,...,4.0,4.0,4.0,7.0,14.0,17.0,31.0,29.0,40.0,30
top,339485,0.0,Cash loans,F,N,Y,0.0,135000.0,450000.0,9000.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
freq,1,192643.0,250263,182057,182593,191871,147155.0,24386.0,6601.0,4349.0,...,209914.0,209944.0,209965.0,180244.0,180355.0,175533.0,151871.0,146779.0,48345.0,23599


**Comment**: _No statistics printed out, this means all data have been stored previously as string, preprocessing neededd._

In [30]:
# One example
df_app['AMT_INCOME_TOTAL'][0]

'202500.0'

In [31]:
# convert string to numeric if the data is numeric, otherwise keep original string version
for col in df_app.columns:
    df_app[col] = pd.to_numeric(df_app[col], errors='ignore')

In [32]:
df_app.describe([.1,.2,.3,.6,.7,.8,.9,.95,.98,.99,.999])

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number
count,276686.0,276686.0,276686.0,276686.0,276686.0,276675.0,276427.0,276686.0,276686.0,276686.0,...,276686.0,276686.0,276686.0,247998.0,239635.0,239386.0,239386.0,239386.0,239386.0,268074.0
mean,278228.098939,0.080824,0.416013,168855.9,598720.1,27100.156769,538123.3,-15461.279088,61415.656513,-2840.702092,...,0.002356,0.000549,0.000419,0.004762,0.006685,0.027771,0.206077,0.267509,1.508969,5.096723
std,102800.236957,0.272566,0.719639,247193.5,402439.4,14490.114074,369373.5,5251.822942,139918.505204,27832.315504,...,0.048486,0.023432,0.020471,0.072662,0.103747,0.185793,0.793481,0.703983,1.873693,2.982774
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,-25229.0,-25152.0,-23738.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10%,135688.5,0.0,0.0,81000.0,180000.0,11070.0,180000.0,-22125.0,-5934.0,-9878.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
20%,171375.0,0.0,0.0,99000.0,254700.0,14679.0,225000.0,-20374.0,-3605.0,-8149.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
30%,207168.0,0.0,0.0,112500.0,306000.0,18157.5,270000.0,-18734.0,-2566.0,-6657.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
50%,278256.5,0.0,0.0,148500.0,513000.0,24898.5,450000.0,-15511.0,-1296.0,-4378.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0
60%,313896.0,0.0,0.0,162000.0,604152.0,28062.0,522000.0,-14147.0,-875.0,-3380.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0
70%,349493.5,0.0,0.0,180000.0,755190.0,32017.5,675000.0,-12764.0,-494.0,-2394.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,7.0


In [33]:
df_bureau.describe([.1,.2,.3,.6,.7,.8,.9,.95,.98,.99,.999])

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
count,1716428.0,1716428.0,1716428.0,1716428.0,1610875.0,1082775.0,591940.0,1716428.0,1716415.0,1458759.0,1124648.0,1716428.0,1716428.0,489637.0
mean,278214.9,5924434.0,-1142.108,0.8181666,510.5174,-1017.437,3825.418,0.006410406,354994.6,137085.1,6229.515,37.91276,-593.7483,15712.76
std,102938.6,532265.7,795.1649,36.54443,4994.22,714.0106,206031.6,0.09622391,1149811.0,677401.1,45032.03,5937.65,720.7473,325826.9
min,100001.0,5000000.0,-2922.0,0.0,-42060.0,-42023.0,0.0,0.0,0.0,-4705600.0,-586406.1,0.0,-41947.0,0.0
10%,135602.0,5184875.0,-2443.0,0.0,-1922.0,-2159.0,0.0,0.0,22500.0,0.0,0.0,0.0,-1561.0,0.0
20%,171220.0,5370915.0,-1879.0,0.0,-1357.0,-1677.0,0.0,0.0,42762.6,0.0,0.0,0.0,-1039.0,0.0
30%,206727.0,5556600.0,-1501.0,0.0,-953.0,-1325.0,0.0,0.0,65119.5,0.0,0.0,0.0,-797.0,0.0
50%,278055.0,5926304.0,-987.0,0.0,-330.0,-897.0,0.0,0.0,125518.5,0.0,0.0,0.0,-395.0,0.0
60%,314000.0,6109587.0,-764.0,0.0,-59.0,-699.0,0.0,0.0,171585.0,0.0,0.0,0.0,-183.0,5125.5
70%,349549.0,6293764.0,-567.0,0.0,248.0,-511.0,0.0,0.0,239265.0,0.0,0.0,0.0,-51.0,10454.03


**Comment**: _bureau data seems normal._

## 2. Preprocessing

### 1. Focus on bureau data first and turn it into features for application

In [34]:
def most_common(series):
    return series.value_counts().index[0]

In [35]:
df_bureau_grouped = df_bureau.groupby(['SK_ID_CURR']).agg({'CREDIT_ACTIVE': most_common, 
                                                           'CREDIT_CURRENCY': most_common, 
                                                           'DAYS_CREDIT': 'median', 
                                                           'CREDIT_DAY_OVERDUE': 'median', 
                                                           'DAYS_CREDIT_ENDDATE': 'median', 
                                                           'DAYS_ENDDATE_FACT': 'median', 
                                                           'AMT_CREDIT_MAX_OVERDUE': 'median', 
                                                           'CNT_CREDIT_PROLONG': 'median', 
                                                           'AMT_CREDIT_SUM': 'median', 
                                                           'AMT_CREDIT_SUM_DEBT': 'median', 
                                                           'AMT_CREDIT_SUM_LIMIT': 'median', 
                                                           'AMT_CREDIT_SUM_OVERDUE': 'median',
                                                           'CREDIT_TYPE': most_common, 
                                                           'DAYS_CREDIT_UPDATE': 'median', 
                                                           'AMT_ANNUITY': 'median'}).reset_index()

In [36]:
df_bureau_grouped.head()

Unnamed: 0,SK_ID_CURR,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,100001,Closed,currency 1,-857.0,0.0,-179.0,-715.0,,0.0,168345.0,0.0,0.0,0.0,Consumer credit,-155.0,0.0
1,100002,Closed,currency 1,-1042.5,0.0,-424.5,-939.0,40.5,0.0,54130.5,0.0,0.0,0.0,Credit card,-402.5,0.0
2,100003,Closed,currency 1,-1205.5,0.0,-480.0,-621.0,0.0,0.0,92576.25,0.0,0.0,0.0,Credit card,-545.0,
3,100004,Closed,currency 1,-867.0,0.0,-488.5,-532.5,0.0,0.0,94518.9,0.0,0.0,0.0,Consumer credit,-532.0,
4,100005,Active,currency 1,-137.0,0.0,122.0,-123.0,0.0,0.0,58500.0,25321.5,0.0,0.0,Consumer credit,-31.0,0.0


### 2. Merge application and bureau data

In [41]:
df_merged = df_app.merge(df_bureau_grouped, on='SK_ID_CURR', how='left', suffixes=('', '_BUREAU'))

In [42]:
df_merged.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY_BUREAU
0,100002,1.0,Cash loans,M,N,Y,0.0,202500.0,406597.5,24700.5,...,-939.0,40.5,0.0,54130.5,0.0,0.0,0.0,Credit card,-402.5,0.0
1,100008,0.0,Cash loans,M,N,Y,0.0,99000.0,490495.5,27517.5,...,-909.0,0.0,0.0,105705.0,0.0,0.0,0.0,Consumer credit,-790.0,
2,100022,0.0,Revolving loans,F,N,Y,0.0,112500.0,157500.0,7875.0,...,,0.0,0.0,528750.0,205276.5,0.0,0.0,Consumer credit,-28.0,
3,100024,0.0,Revolving loans,M,Y,Y,0.0,135000.0,427500.0,21375.0,...,,,,,,,,,,
4,100030,0.0,Cash loans,F,N,Y,0.0,90000.0,225000.0,11074.5,...,-598.0,0.0,0.0,33487.785,0.0,0.0,0.0,Consumer credit,-244.0,


In [43]:
df_merged.to_csv("app_merged.csv", index = False)

### 3. Correlation

In [3]:
corr = df_merged.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,random_number,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY_BUREAU
SK_ID_CURR,1.0,-0.002474,-0.000715,-0.001703,0.000436,0.000545,0.000428,-0.002131,0.001445,-0.001687,-0.000273,-0.000103,0.000278,-0.001427,-0.00049,0.000584,0.001909,0.000646,-0.000876,-0.001915,0.000708,0.000884,0.003733,-0.000785,-0.002224,2.5e-05,0.00324,0.00134,0.002671,-0.00199,-0.001318,0.008284,0.001009,0.004314,-0.003106,0.004877,0.004642,0.003787,0.001251,0.001922,-0.000359,0.003055,0.00114,-0.001514,-0.00102,0.00777,0.001114,0.004475,-0.00324,0.004639,0.003743,0.004129,0.002029,0.002422,7.9e-05,0.002323,0.001114,-0.001515,-0.001306,0.008172,0.001093,0.004559,-0.002863,0.004714,0.004338,0.004121,0.001379,0.002345,-0.000909,0.002783,-0.000782,-0.000325,-0.000211,-0.00197,0.001838,-0.000391,-0.001849,0.003949,-0.002086,4e-06,0.002173,0.000488,0.00044,-0.00408,0.000186,-0.002739,0.0025,-0.000567,-0.001096,0.000317,-0.000854,-0.002415,0.002222,-0.001806,-0.001896,0.00074,0.000459,0.00168,0.003517,0.001848,0.004098,-0.003686,-0.000227,0.000342,0.002943,0.00104,0.001701,0.001058,6e-05,-0.002628,0.002817,-0.005909
TARGET,-0.002474,1.0,0.01796,-0.003026,-0.030508,-0.012918,-0.039702,0.062352,-0.043391,-0.001937,0.049793,-3e-06,0.001633,0.044571,0.025756,0.003867,-0.023366,-0.00403,0.054418,0.061094,-0.002665,0.006011,0.00301,0.043044,0.049784,0.032547,-0.159187,-0.178523,-0.02882,-0.022553,-0.001305,-0.021268,-0.009496,-0.032901,-0.019179,-0.042417,-0.034735,-0.011335,-0.022185,-0.032169,-0.006585,-0.012831,-0.025486,-0.019962,-0.001363,-0.021522,-0.008143,-0.030815,-0.0175,-0.041779,-0.033721,-0.010955,-0.020514,-0.030042,-0.005469,-0.012068,-0.027336,-0.021858,-0.001418,-0.021584,-0.009431,-0.032647,-0.018926,-0.042261,-0.034515,-0.01177,-0.021678,-0.031931,-0.006304,-0.012726,0.013453,0.005806,0.03792,0.024775,0.003793,0.000219,0.000215,-0.022217,-0.017213,-0.007834,-0.003408,-0.000875,-0.004309,-0.001662,-0.009713,-0.009314,-0.008144,-0.009977,-0.006507,-0.006095,-0.006481,0.000404,0.003643,0.000224,0.003017,0.000235,-0.008804,-0.005269,0.019495,0.010624,0.085905,0.008661,0.040828,0.051749,0.002164,0.00195,-0.015403,0.001317,-0.006626,0.003746,0.067967,-0.002177
CNT_CHILDREN,-0.000715,0.01796,1.0,0.012956,0.002178,0.021375,-0.001786,0.284344,-0.235824,-0.012929,-0.015446,-0.018077,0.023723,0.236889,0.071653,-0.028404,-0.016546,0.023591,0.093037,0.025388,0.037134,0.005093,0.013636,0.019881,0.066549,0.07085,-0.021391,-0.041282,0.024442,-0.00734,-0.027041,0.046054,0.031744,-0.006736,-0.012849,-0.012576,-0.008493,0.006003,-0.012586,-0.009813,0.016951,-0.00163,-0.016142,-0.007021,-0.026706,0.045293,0.031919,-0.005884,-0.011393,-0.01249,-0.007861,0.007052,-0.012368,-0.008587,0.017443,-0.00103,-0.017312,-0.007334,-0.027149,0.046018,0.031924,-0.006404,-0.012834,-0.012229,-0.007829,0.006004,-0.011861,-0.009734,0.0173,-0.001564,0.016348,0.001084,-0.009847,0.007102,-0.008318,0.004571,-0.004059,-0.134617,-0.073472,0.044753,0.01916,0.002421,-0.006,-0.000424,0.003402,-0.00161,0.000653,0.007663,0.004381,0.00456,0.002047,-0.000479,-0.001535,-0.000864,0.000591,-0.001644,-0.009791,-0.010691,-0.028911,-0.001015,0.022172,-0.001088,0.012148,0.012106,-0.002243,-0.005737,0.017686,0.021556,-0.002016,-0.001437,0.01898,-0.000362
AMT_INCOME_TOTAL,-0.001703,-0.003026,0.012956,1.0,0.149994,0.182211,0.152465,0.016024,-0.059855,-0.009252,0.010996,0.078337,-0.003354,0.05995,-0.016477,0.00472,-0.002716,0.031232,-0.077722,-0.087912,-0.006347,0.059418,0.055552,0.004763,0.0064,0.006971,0.057553,-0.026313,0.011783,0.016576,0.007799,0.023242,0.024235,0.04204,0.007689,0.056547,0.140172,-0.000142,0.105187,0.037303,0.026862,0.075597,0.028523,0.012503,0.00761,0.018601,0.016079,0.038232,0.004523,0.054209,0.132249,-0.002114,0.091465,0.032496,0.022055,0.0623,0.031993,0.015741,0.007802,0.023062,0.02322,0.041222,0.007131,0.056094,0.138905,-0.00048,0.103195,0.036662,0.025628,0.071448,-0.013764,-0.004347,-0.009529,-0.012287,0.000147,0.006762,-0.000946,-0.037497,-0.019065,0.058794,0.036069,0.008209,0.000833,0.002892,0.019268,0.019147,0.013573,0.006791,0.004402,0.002948,0.002372,0.000285,-7e-05,0.00021,0.0019,0.001692,0.020302,0.010377,0.00628,-0.009061,-0.003969,-0.001495,0.007411,-0.001179,0.00621,0.000279,0.062479,0.031721,0.013868,-0.001049,0.018509,0.037782
AMT_CREDIT,0.000436,-0.030508,0.002178,0.149994,1.0,0.770384,0.986961,-0.029265,-0.069921,0.002965,-0.01016,0.003933,0.014761,0.068226,-0.006761,-0.026905,0.032811,0.020594,-0.091079,-0.11029,0.027393,0.05116,0.052674,-0.024949,-0.019477,-0.000141,0.130918,0.045258,0.060708,0.042033,-0.019578,0.041664,0.046763,0.077953,0.014036,0.098652,0.078925,0.016514,0.055384,0.070397,0.033403,0.035457,0.047754,0.034071,-0.020098,0.039504,0.042752,0.0721,0.008955,0.095717,0.075605,0.01321,0.047674,0.062502,0.030502,0.029677,0.052864,0.040214,-0.019731,0.041336,0.046134,0.076396,0.012815,0.097966,0.07833,0.015685,0.053783,0.069102,0.032864,0.033428,-0.002551,-0.000789,-0.048928,-0.033191,-0.013641,0.043381,-0.001721,-0.040673,-0.023395,0.069581,0.045241,0.016297,0.024057,0.016559,0.043583,0.046035,0.038729,0.050134,0.037079,0.032608,0.023136,0.023278,0.001908,-0.003283,0.000692,-0.002199,0.045715,0.028076,-0.038807,-0.020824,-0.07508,0.001176,-0.018394,-0.047125,0.006169,-0.00516,0.084762,0.031809,0.017784,-0.003412,-0.025836,0.015929
AMT_ANNUITY,0.000545,-0.012918,0.021375,0.182211,0.770384,1.0,0.775343,0.020241,-0.105688,-0.001233,0.008935,0.020239,0.013426,0.104722,-0.010791,-0.022377,0.016799,0.068681,-0.11556,-0.140466,0.024396,0.078541,0.07464,-0.004355,0.001045,0.010498,0.125502,0.03282,0.063121,0.047772,-0.01709,0.03896,0.051541,0.09873,0.014764,0.125069,0.101036,0.019688,0.070885,0.088069,0.040227,0.047592,0.060754,0.03759,-0.017169,0.035986,0.04666,0.090302,0.007625,0.120989,0.09551,0.014961,0.061042,0.077808,0.035709,0.039023,0.067882,0.045812,-0.017143,0.03839,0.050998,0.096782,0.013393,0.12388,0.099815,0.018639,0.068714,0.086603,0.039459,0.044798,-0.012585,-0.006017,-0.042614,-0.028936,-0.011903,0.042056,0.00224,-0.063774,-0.033713,0.111923,0.068561,0.020859,-0.004252,-1.8e-05,0.021068,0.030791,0.023875,0.005834,0.005334,-0.006673,-0.002428,0.006048,-0.006185,0.002148,0.000871,0.010969,0.033965,0.018817,-0.010982,-0.01296,-0.06182,-0.000656,-0.007155,-0.027451,0.006726,-0.002224,0.100228,0.037617,0.021955,-0.003325,-0.024404,0.01611
AMT_GOODS_PRICE,0.000428,-0.039702,-0.001786,0.152465,0.986961,0.775343,1.0,-0.029072,-0.067805,0.003174,-0.011636,0.007946,0.012923,0.066121,0.012968,-0.023758,0.046919,0.021725,-0.093825,-0.111649,0.025431,0.052516,0.05281,-0.025179,-0.020399,-0.001427,0.13889,0.049659,0.061375,0.046759,-0.017061,0.046399,0.045564,0.081156,0.018158,0.103874,0.08174,0.022206,0.058584,0.075613,0.034607,0.039673,0.052499,0.038746,-0.017594,0.044128,0.04142,0.075473,0.013243,0.100861,0.078146,0.018773,0.050441,0.067709,0.031538,0.033875,0.05763,0.04491,-0.017191,0.04602,0.044924,0.079655,0.016997,0.103144,0.081073,0.021469,0.05698,0.074327,0.034014,0.03762,-0.003514,0.000633,-0.050449,-0.035076,-0.013272,0.038726,0.002846,-0.043484,-0.024022,0.067461,0.044588,0.016121,0.029142,0.020062,0.043124,0.0459,0.039271,0.047199,0.035648,0.031352,0.022397,0.022905,0.003845,-0.002755,0.001055,-0.002205,0.046961,0.029131,-0.041171,-0.022597,-0.076932,0.000884,-0.019168,-0.045731,0.005788,-0.004952,0.087375,0.032233,0.01904,-0.003581,-0.02745,0.015437
DAYS_BIRTH,-0.002131,0.062352,0.284344,0.016024,-0.029265,0.020241,-0.029072,1.0,-0.562755,0.261843,0.087511,-0.680653,0.303256,0.5567,0.303746,-0.50788,0.136282,0.164103,0.102325,0.015006,0.558939,0.058658,0.061478,0.131317,0.158333,0.15027,-0.091336,-0.160364,0.427853,0.033994,-0.476652,0.275863,0.501277,-0.018882,-0.07333,-0.048626,-0.009289,0.152707,-0.032723,-0.005552,0.247014,-0.017984,-0.067949,0.031044,-0.470725,0.273322,0.506032,-0.017476,-0.071526,-0.049669,-0.009416,0.151503,-0.03757,-0.000954,0.251043,-0.016853,-0.071246,0.035253,-0.475852,0.275169,0.502037,-0.018343,-0.073129,-0.048425,-0.009386,0.150259,-0.032383,-0.005122,0.249532,-0.017185,-0.030327,0.093018,0.148094,-0.071228,-0.177282,0.144105,0.11597,-0.310623,-0.094621,0.060461,0.07567,0.053225,0.028351,0.013235,0.014038,0.025031,0.022271,0.009824,0.028143,0.030048,0.027165,0.011723,0.017612,-0.003715,0.001232,-0.012928,-0.021886,-0.024722,-0.117067,-0.12388,0.139088,-0.000935,0.096588,0.108449,-0.000558,0.013272,0.026569,0.048608,0.007678,0.003391,0.107501,0.004715
DAYS_EMPLOYED,0.001445,-0.043391,-0.235824,-0.059855,-0.069921,-0.105688,-0.067805,-0.562755,1.0,-0.071954,-0.222021,0.12674,-0.051413,-0.999563,-0.251682,0.097866,-0.017521,-0.075579,0.012883,0.031866,-0.10567,-0.102743,-0.093985,-0.086381,-0.243544,-0.21557,-0.019449,0.111033,-0.097181,-0.007576,0.087093,-0.054945,-0.09837,-0.006228,0.015615,-0.007377,-0.012478,-0.034045,-0.012469,-0.010942,-0.051427,-0.008262,-0.001701,-0.00603,0.086299,-0.053778,-0.099004,-0.005165,0.01649,-0.006406,-0.012139,-0.032652,-0.010882,-0.010069,-0.051982,-0.007262,-0.002795,-0.007983,0.086924,-0.055055,-0.098589,-0.006313,0.015842,-0.007544,-0.012915,-0.033506,-0.012727,-0.011169,-0.051571,-0.008213,0.016063,-0.015191,-0.005097,0.031217,0.032426,-0.088683,-0.034665,0.514713,0.22738,-0.102225,-0.062556,-0.016493,-0.024824,-0.013381,-0.021792,-0.02257,-0.017917,-0.035805,-0.024249,-0.034636,-0.021468,-0.010948,-0.008774,-0.00181,0.001163,0.005089,-0.025128,0.00777,0.054968,0.032715,-0.034094,0.001261,-0.059964,-0.027887,-0.003202,-0.012473,-0.049112,-0.040705,-0.015058,-0.001881,-0.048341,-0.012105
DAYS_REGISTRATION,-0.001687,-0.001937,-0.012929,-0.009252,0.002965,-0.001233,0.003174,0.261843,-0.071954,1.0,-0.138225,-0.45156,0.275809,0.062062,-0.027372,-0.42233,0.106765,0.087735,0.009111,0.020386,0.345874,-0.011018,-0.011368,-0.013575,-0.02255,-0.024743,0.01179,-0.017434,0.282677,0.018749,-0.349265,0.171835,0.333993,-0.013979,-0.056084,-0.029124,-0.004532,0.111336,-0.025409,-0.004795,0.138974,-0.009885,-0.049447,0.016364,-0.345117,0.170391,0.336727,-0.013421,-0.055408,-0.029932,-0.005162,0.110335,-0.029378,-0.002147,0.140363,-0.009266,-0.051393,0.019359,-0.348802,0.171316,0.334421,-0.01338,-0.056072,-0.028961,-0.005012,0.109405,-0.025326,-0.004647,0.140543,-0.00941,-0.024793,0.063367,0.072592,-0.059709,-0.1065,0.036819,0.034177,-0.033516,0.215586,0.012348,0.000689,0.002572,-0.001675,-0.001611,-0.002171,-0.001509,-0.001735,-0.002962,-0.001711,-0.002898,-0.002059,-0.00081,-0.000972,-0.005723,-0.002713,-0.009228,-0.017642,-0.015123,-0.053186,-0.082279,0.00449,-0.001182,-0.004641,0.006792,-0.000343,-0.002774,0.003779,0.004718,-0.000692,-0.000112,0.003086,-0.000252


### 4. Check if there are missing values in the columns:

In [6]:
# Check if there are missing values in the columns:
def missing_value_checker(df):
    for col in df.columns:
        tmp_flag = 'Numerical'
        if df[col].dtype == object:
            tmp_flag = 'Categorical'
        logging.info('{}: {} ({})'.format(col, str(df[col].isnull().sum()/float(df.shape[0])), tmp_flag))

In [7]:
missing_value_checker(df_merged)

INFO:root:SK_ID_CURR: 0.0 (Numerical)
INFO:root:TARGET: 0.0 (Numerical)
INFO:root:NAME_CONTRACT_TYPE: 0.0 (Categorical)
INFO:root:CODE_GENDER: 0.0 (Categorical)
INFO:root:FLAG_OWN_CAR: 0.0 (Categorical)
INFO:root:FLAG_OWN_REALTY: 0.0 (Categorical)
INFO:root:CNT_CHILDREN: 0.0 (Numerical)
INFO:root:AMT_INCOME_TOTAL: 0.0 (Numerical)
INFO:root:AMT_CREDIT: 0.0 (Numerical)
INFO:root:AMT_ANNUITY: 3.975625799642916e-05 (Numerical)
INFO:root:AMT_GOODS_PRICE: 0.0009360791655522868 (Numerical)
INFO:root:NAME_TYPE_SUITE: 0.004221391758166297 (Categorical)
INFO:root:NAME_INCOME_TYPE: 0.0 (Categorical)
INFO:root:NAME_EDUCATION_TYPE: 0.0 (Categorical)
INFO:root:NAME_FAMILY_STATUS: 0.0 (Categorical)
INFO:root:NAME_HOUSING_TYPE: 0.0 (Categorical)
INFO:root:REGION_POPULATION_RELATIVE: 0.0 (Categorical)
INFO:root:DAYS_BIRTH: 0.0 (Numerical)
INFO:root:DAYS_EMPLOYED: 0.0 (Numerical)
INFO:root:DAYS_REGISTRATION: 0.0 (Numerical)
INFO:root:DAYS_ID_PUBLISH: 0.0 (Numerical)
INFO:root:OWN_CAR_AGE: 0.638691513123

**Comment**: _Missing Data in the dataframe found, attention needed._

### 5. See how many unique values there are for each column:

In [8]:
# See how many unique values there are for each columns, and if there is only one unique value, we drop the column
def unique_value_printer(df):
    cols_to_drop = []
    for col in df.columns:
        tmp_num_unique = len(df[col].unique())
        tmp_flag = 'Numerical'
        if df[col].dtype == object:
            tmp_flag = 'Categorical'
        logging.info('{}: {} ({})'.format(col, str(tmp_num_unique), tmp_flag))
        if (tmp_num_unique==1):
            cols_to_drop.append(col)
    df.drop(cols_to_drop, 1, inplace=True)

In [9]:
unique_value_printer(df_merged)

INFO:root:SK_ID_CURR: 276686 (Numerical)
INFO:root:TARGET: 2 (Numerical)
INFO:root:NAME_CONTRACT_TYPE: 2 (Categorical)
INFO:root:CODE_GENDER: 3 (Categorical)
INFO:root:FLAG_OWN_CAR: 2 (Categorical)
INFO:root:FLAG_OWN_REALTY: 2 (Categorical)
INFO:root:CNT_CHILDREN: 14 (Numerical)
INFO:root:AMT_INCOME_TOTAL: 2348 (Numerical)
INFO:root:AMT_CREDIT: 5439 (Numerical)
INFO:root:AMT_ANNUITY: 13418 (Numerical)
INFO:root:AMT_GOODS_PRICE: 941 (Numerical)
INFO:root:NAME_TYPE_SUITE: 8 (Categorical)
INFO:root:NAME_INCOME_TYPE: 9 (Categorical)
INFO:root:NAME_EDUCATION_TYPE: 10 (Categorical)
INFO:root:NAME_FAMILY_STATUS: 11 (Categorical)
INFO:root:NAME_HOUSING_TYPE: 11 (Categorical)
INFO:root:REGION_POPULATION_RELATIVE: 170 (Categorical)
INFO:root:DAYS_BIRTH: 17513 (Numerical)
INFO:root:DAYS_EMPLOYED: 17766 (Numerical)
INFO:root:DAYS_REGISTRATION: 15534 (Numerical)
INFO:root:DAYS_ID_PUBLISH: 8686 (Numerical)
INFO:root:OWN_CAR_AGE: 4538 (Numerical)
INFO:root:FLAG_MOBIL: 46 (Numerical)
INFO:root:FLAG_EM

### 6. Encoding of the categorical features

**Comment**: _Some categorical features have hundreds, or even thousands (e.g. TOTALAREA_MODE, WALLSMATERIAL_MODE), this means one hot encoding is not suitable here. Due to the presence of nominal categorical features, label encoding will also be a bad idea._

We use frequency encoding here to ensure scalability of the encoding method and help avoid the explosion of dimensionality.

In [2]:
for col in df_merged.columns:
    if df_merged[col].dtype == object:
        df_tmp_group = df_merged.groupby(col).size()/len(df_merged)
        df_merged.loc[:, col+'_ENCODED'] = df_merged[col].map(df_tmp_group)

INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [3]:
df_merged['TARGET_BOOL'] = df_merged['TARGET'].apply(lambda x: True if x==1 else 0).astype(bool)

In [5]:
features_drop = ['SK_ID_CURR', 'TARGET', 'TARGET_BOOL']
features_train = []
for col in df_merged.columns:
    if col in features_drop or df_merged[col].dtype == object:
        pass
    else:
        features_train.append(col)
        
df_X = df_merged[features_train]

### 7. Fill the missing values with Multiple Imputation by Chained Equations(MICE)

In [10]:
# fill the missing data in age and lohas with KNN as imputation method
#from sklearn.impute import KNNImputer
#imputer = KNNImputer(n_neighbors=3)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(imputation_order='ascending',max_iter=10,random_state=42,n_nearest_features=None)
df_filled = imputer.fit_transform(df_X) # np.array format

# generate new imputed dataframe
df_X = pd.DataFrame(data=df_filled, index=df_X.index, columns=df_X.columns)

In [11]:
df_X.to_csv("app_final.csv", index = False)

In [5]:
df_X['TARGET'] = df_merged['TARGET']
df_X.to_csv("app_final(w_target).csv", index = False) # for FFTrees R code

## 3. Comparative Model Analysis

In [1]:
import logging

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info("Start 1. Explorative Analysis")

df_merged = pd.read_csv("app_merged.csv", sep = ",")
df_X = pd.read_csv("app_final.csv", sep = ",")
df_y = df_merged['TARGET']

INFO:root:Start 1. Explorative Analysis
  interactivity=interactivity, compiler=compiler, result=result)


### 1. FFT

**Comment**: _FFT has been run with the original FFTrees library in R, the script: r_fftrees.r,
the results can be seen in the presentation._


### 2. Gradient Boosting

In [2]:
import xgboost as xgb
from bayes_opt import BayesianOptimization
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import cross_val_score
import warnings

def xgboostcv(max_depth,
              learning_rate,
              n_estimators,
              gamma,
              min_child_weight,
              max_delta_step,
              subsample,
              colsample_bytree,
              reg_alpha,
              reg_lambda,
              silent=True,
              nthread=-1,
              random_state=1):
    return cross_val_score(xgb.XGBClassifier(max_depth=int(max_depth),
                                             learning_rate=learning_rate,
                                             n_estimators=int(n_estimators),
                                             silent=silent,
                                             nthread=nthread,
                                             gamma=gamma,
                                             min_child_weight=min_child_weight,
                                             max_delta_step=max_delta_step,
                                             subsample=subsample,
                                             colsample_bytree=colsample_bytree,
                                             reg_alpha=reg_alpha,
                                             reg_lambda = reg_lambda),
                           df_X,
                           df_y,
                           cv=10,
                           scoring="roc_auc",
                           n_jobs=-1).mean()

xgboostBO = BayesianOptimization(xgboostcv,
                                 {'max_depth': (2, 5),
                                  'learning_rate': (0.01, 0.3),
                                  'n_estimators': (1000, 2500),
                                  'gamma': (1., 0.01),
                                  'min_child_weight': (1, 10),
                                  'max_delta_step': (0, 0.1),
                                  'subsample': (0.5, 0.8),
                                  'colsample_bytree' :(0.1, 0.99),
                                  'reg_alpha':(0.1, 0.5),
                                  'reg_lambda':(0.1, 0.9)
                                  })

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    xgboostBO.maximize(init_points=2, n_iter=5, acq='ei', xi=0.0)

|   iter    |  target   | colsam... |   gamma   | learni... | max_de... | max_depth | min_ch... | n_esti... | reg_alpha | reg_la... | subsample |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.7251  [0m | [0m 0.5119  [0m | [0m 0.9797  [0m | [0m 0.0372  [0m | [0m 0.03446 [0m | [0m 2.271   [0m | [0m 6.51    [0m | [0m 2.317e+0[0m | [0m 0.3296  [0m | [0m 0.3515  [0m | [0m 0.5632  [0m |
| [95m 2       [0m | [95m 0.7539  [0m | [95m 0.1925  [0m | [95m 0.8523  [0m | [95m 0.09321 [0m | [95m 0.09195 [0m | [95m 3.998   [0m | [95m 1.289   [0m | [95m 1.65e+03[0m | [95m 0.5     [0m | [95m 0.1692  [0m | [95m 0.6227  [0m |
| [0m 3       [0m | [0m 0.72    [0m | [0m 0.1181  [0m | [0m 0.01    [0m | [0m 0.1454  [0m | [0m 0.01845 [0m | [0m 4.187   [0m | [0m 8.38    [0m | [0m 1e+03   [0m | [0m 0.2854  [0m | [0m 0.348

In [3]:
logging.info('-'*100)
logging.info('Final Results')
logging.info('Maximum XGBOOST value: %f' % xgboostBO.max['target'])
logger.info("Loging dict ---> {0}".format(xgboostBO.max['params']))

INFO:root:----------------------------------------------------------------------------------------------------
INFO:root:Final Results
INFO:root:Maximum XGBOOST value: 0.754424
INFO:root:Loging dict ---> {'colsample_bytree': 0.8298562600739037, 'gamma': 0.01, 'learning_rate': 0.11449854987508573, 'max_delta_step': 0.08635751811725184, 'max_depth': 2.5514452669206, 'min_child_weight': 1.0361971974799622, 'n_estimators': 1740.1264302275176, 'reg_alpha': 0.2407365410765918, 'reg_lambda': 0.46582272420563375, 'subsample': 0.6872749370334974}


In [4]:
max_params = xgboostBO.max['params']
max_params['max_depth'] = int(max_params['max_depth'])

xgb_final_model = xgb.XGBClassifier(max_depth=max_params['max_depth'],
                                             learning_rate=max_params['learning_rate'],
                                             n_estimators=int(max_params['n_estimators']),
                                             silent=True,
                                             nthread=-1,
                                             gamma=max_params['gamma'],
                                             min_child_weight=max_params['min_child_weight'],
                                             max_delta_step=max_params['max_delta_step'],
                                             subsample=max_params['subsample'],
                                             colsample_bytree=max_params['colsample_bytree'],
                                             reg_alpha=max_params['reg_alpha'],
                                             reg_lambda = max_params['reg_lambda'])

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import balanced_accuracy_score

kf10 = KFold(n_splits=10, shuffle=False)

i = 1
for train_index, test_index in kf10.split(df_X):
    X_train = df_X.iloc[train_index]
    X_test = df_X.iloc[test_index]
    y_train = df_y.iloc[train_index]
    y_test = df_y.loc[test_index]
        
    #Train the model
    xgb_final_model.fit(X_train, y_train)
    logging.info(f"BACC for the fold no. {i} on the test set: {balanced_accuracy_score(y_test, xgb_final_model.predict(X_test))}")
    i += 1
