# Predicting Home Loans' Repayment and Analysis of Unbanked Population

The unbanked population, consumers without adequate credit history, will go to an untrustworthy lender whom could take unfair advantage of their situation because reputable banks will not provide them with loans. 
However, there could be other ways to determine the credit worthiness of applicants. The vision of Home Credit Group is to provide fair and equal services to a non-traditional home loan applicant. 


1) Determine if the applicant will likely repay their loan by using previous credit information via the Home Credit Group's databases and the Credit Bureau

2) Find non-traditional methods to determine the credit worthiness of an applicant (unbanked

In [26]:
import pandas as pd
import numpy as np
import pickle

pd.set_option?


pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)

# Load datasets into dataframes

In [2]:
df_application_train = pd.read_csv('application_train.csv',header=0,error_bad_lines=False,nrows=10000)
df_previous_application = pd.read_csv('previous_application.csv',header=0,error_bad_lines=False,nrows=50000)
df_installment_payments = pd.read_csv('installments_payments.csv',header=0,error_bad_lines=False,nrows=50000)
df_bureau = pd.read_csv('bureau.csv',header=0,error_bad_lines=False,nrows=50000)
df_bureau_bal = pd.read_csv('bureau_balance.csv',header=0,error_bad_lines=False,nrows=5000)
df_pos_cash_bal = pd.read_csv('POS_CASH_balance.csv',header=0,error_bad_lines=False,nrows=5000)
df_credit_card_bal = pd.read_csv('credit_card_balance.csv',header=0,error_bad_lines=False,nrows=5000)

dflist = [df_application_train,df_previous_application,df_installment_payments,df_bureau,df_bureau_bal,df_pos_cash_bal,df_credit_card_bal]


# Cleaning Datasets

Started with Previous Application Dataset because it had historical information about the application. My main focus was on 3 noticable columns. 

1. FLAG_LAST_APPL_PER_CONTRACT: If this column had the value of 'N', then it was more than likely a mistake or duplicate of another application. My thought is that this maybe unnecessary data to the purpose of this project.

2. AMT_APPLICATION: This column states how much the applicant has asked for from the bank. 

3. AMT_CREDIT: This column states how much the bank as credited the applicant.

My thought is if (2) and (3) are 0, then the application is incomplete and it would not provide me with additional information about the application. Thus I should remove these rows from my dataset.
        

In [3]:
df_previous_application = df_previous_application.drop(df_previous_application[df_previous_application['FLAG_LAST_APPL_PER_CONTRACT']=='N'].index)
df_previous_application = df_previous_application.drop(df_previous_application[(df_previous_application['AMT_APPLICATION']==0.0) & (df_previous_application['AMT_CREDIT']==0.0)].index)



In the next cell, the 'RATE_DOWN_PAYMENT','RATE_INTEREST_PRIMARY', and 'RATE_INTEREST_PRIVILEGED' columns has 'NaN' values. Because I will be aggregating the columns to merge them into the main dataset (application_train), I have replaced the NaN values with 0. 

In [4]:
df_previous_application[['RATE_DOWN_PAYMENT','RATE_INTEREST_PRIMARY','RATE_INTEREST_PRIVILEGED']].isnull().sum()

RATE_DOWN_PAYMENT           15511
RATE_INTEREST_PRIMARY       40148
RATE_INTEREST_PRIVILEGED    40148
dtype: int64

In [5]:
df_previous_application[['RATE_DOWN_PAYMENT','RATE_INTEREST_PRIMARY','RATE_INTEREST_PRIVILEGED']] = df_previous_application[['RATE_DOWN_PAYMENT','RATE_INTEREST_PRIMARY','RATE_INTEREST_PRIVILEGED']].fillna(0.0)

Previous application dataset has all of the previous application available. Currently, it has a 1:many relationship to the application dataset, so I will need to aggregate the columns before merging. 

In [27]:
df_previous_application.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,FLAG_LAST_APPL_PER_CONTRACT,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,NAME_CASH_LOAN_PURPOSE,NAME_CONTRACT_STATUS,DAYS_DECISION,NAME_PAYMENT_TYPE,CODE_REJECT_REASON,NAME_TYPE_SUITE,NAME_CLIENT_TYPE,NAME_GOODS_CATEGORY,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,CHANNEL_TYPE,SELLERPLACE_AREA,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,Y,1,0.0,0.182832,0.867336,XAP,Approved,-73,Cash through the bank,XAP,,Repeater,Mobile,POS,XNA,Country-wide,35,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,Y,1,0.0,0.0,0.0,XNA,Approved,-164,XNA,XAP,Unaccompanied,Repeater,XNA,Cash,x-sell,Contact center,-1,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,Y,1,0.0,0.0,0.0,XNA,Approved,-301,Cash through the bank,XAP,"Spouse, partner",Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,Y,1,0.0,0.0,0.0,XNA,Approved,-512,Cash through the bank,XAP,,Repeater,XNA,Cash,x-sell,Credit and cash offices,-1,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,Y,1,0.0,0.0,0.0,Repairs,Refused,-781,Cash through the bank,HC,,Repeater,XNA,Cash,walk-in,Credit and cash offices,-1,XNA,24.0,high,Cash Street: high,,,,,,


Below is where I aggreagated the previous_application dataset by using index 'SK_ID_CURR'. The purpose of this is so I can provide a 1:1 relationship to the main

In [7]:
df_prevapp_pivot = pd.pivot_table(df_previous_application,values=df_previous_application.columns,index='SK_ID_CURR',
                                  aggfunc={'SK_ID_PREV':(lambda x: len(x.unique())), 'NAME_CONTRACT_TYPE': (lambda x: len(x.unique())), 
                                           'AMT_ANNUITY':'sum', 'AMT_APPLICATION': 'sum','AMT_CREDIT':'sum', 'AMT_DOWN_PAYMENT':'sum', 
                                           'AMT_GOODS_PRICE':'sum','WEEKDAY_APPR_PROCESS_START':(lambda x: len(x.unique())), 
                                           'HOUR_APPR_PROCESS_START':np.mean,'FLAG_LAST_APPL_PER_CONTRACT':(lambda x: len(x.unique())), 
                                           'NFLAG_LAST_APPL_IN_DAY':(lambda x: len(x.unique())),'RATE_DOWN_PAYMENT':np.mean, 
                                           'RATE_INTEREST_PRIMARY':np.mean,'RATE_INTEREST_PRIVILEGED':np.mean, 
                                           'NAME_CASH_LOAN_PURPOSE':(lambda x: len(x.unique())),
                                           'NAME_CONTRACT_STATUS':(lambda x:x.map({'Approved':1,'Canceled':0,'Refused':0,'Unused offer':1}).mean()), 
                                           'DAYS_DECISION':np.mean, 'NAME_PAYMENT_TYPE':(lambda x: len(x.unique())),
                                           'CODE_REJECT_REASON':(lambda x: len(x.unique())),'NAME_TYPE_SUITE':(lambda x: len(x.unique())), 
                                           'NAME_CLIENT_TYPE':(lambda x: len(x.unique())),'NAME_GOODS_CATEGORY':(lambda x: len(x.unique())), 
                                           'NAME_PORTFOLIO':(lambda x: len(x.unique())), 'NAME_PRODUCT_TYPE':(lambda x: len(x.unique())),
                                           'CHANNEL_TYPE':(lambda x: len(x.unique())), 'SELLERPLACE_AREA':(lambda x: len(x.unique())), 
                                           'NAME_SELLER_INDUSTRY':(lambda x: len(x.unique())),'CNT_PAYMENT':'sum', 
                                           'NAME_YIELD_GROUP':(lambda x: len(x.unique())), 'PRODUCT_COMBINATION':(lambda x: len(x.unique())),
                                           'DAYS_FIRST_DRAWING':'sum', 'DAYS_FIRST_DUE':'sum', 'DAYS_LAST_DUE_1ST_VERSION':'sum',
                                           'DAYS_LAST_DUE':'sum', 'DAYS_TERMINATION':'sum', 'NFLAG_INSURED_ON_APPROVAL':np.mean})

In [8]:
prevapp_agg = df_previous_application.select_dtypes(exclude=['object']).drop(columns = ['SK_ID_PREV']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'max', 'min', 'sum']).reset_index()

# Merging previous_application dataset into application_train dataset 

In [13]:
df = df_application_train.merge(df_prevapp_pivot,on =['SK_ID_CURR'],suffixes=('_app_train','_prev_app'),how='left')

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 158 entries, SK_ID_CURR to WEEKDAY_APPR_PROCESS_START_prev_app
dtypes: float64(101), int64(41), object(16)
memory usage: 12.1+ MB


# Aggreagated Bureau & Installment datasets and then merged into dataset

In [15]:
df_bureau_pivot = pd.pivot_table(df_bureau,values=df_bureau.columns,index='SK_ID_CURR',
                                 aggfunc={'SK_ID_BUREAU':'count','CREDIT_ACTIVE':'count',
                                          'CREDIT_CURRENCY':(lambda x: len(x.unique())),'DAYS_CREDIT':np.min,
                                          'CREDIT_DAY_OVERDUE':'sum','DAYS_CREDIT_ENDDATE':'sum', 
                                          'DAYS_ENDDATE_FACT':'sum', 'AMT_CREDIT_MAX_OVERDUE':'sum', 
                                          'CNT_CREDIT_PROLONG':'sum', 'AMT_CREDIT_SUM':'sum','AMT_CREDIT_SUM_DEBT':'sum', 
                                          'AMT_CREDIT_SUM_LIMIT':'sum', 'AMT_CREDIT_SUM_OVERDUE':'sum',
                                          'CREDIT_TYPE':(lambda x: len(x.unique())), 'DAYS_CREDIT_UPDATE':np.min, 
                                          'AMT_ANNUITY':'sum'})

In [16]:
df = df.merge(df_bureau_pivot,on =['SK_ID_CURR'],suffixes=('_curr','_bur'),how='left')

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 174 entries, SK_ID_CURR to SK_ID_BUREAU
dtypes: float64(117), int64(41), object(16)
memory usage: 13.4+ MB


# Merging Installment Payment Dataset


AMT_INSTAL_PAY_DIFF: I believe that knowing how much the applicant has paid back in the past would be a good indicator for the future

NUM_INSTAL_VERSION_NUM_DIFF: I believe knowing how fast or slow the previous load was paid would also be a good indicator.

DAY_INSTAL_ENTRY_DIFF: Similar to NUM_INSTAL_VERSION_NUM_DIFF

In [19]:
df_installment_payments_pivot = pd.pivot_table(df_installment_payments,values=df_installment_payments.columns,index='SK_ID_CURR',
                                               aggfunc={'SK_ID_PREV':(lambda x: len(x.unique())),'NUM_INSTALMENT_VERSION':'sum', 
                                                        'NUM_INSTALMENT_NUMBER':'sum','DAYS_INSTALMENT':'sum', 'DAYS_ENTRY_PAYMENT':'sum', 
                                                        'AMT_INSTALMENT':'sum','AMT_PAYMENT':'sum'})                                                                                                                      

In [20]:
df_installment_payments_pivot['AMT_INSTAL_PAY_DIFF'] = df_installment_payments_pivot['AMT_INSTALMENT']-df_installment_payments_pivot['AMT_PAYMENT']
df_installment_payments_pivot['NUM_INSTAL_VERSION_NUM_DIFF'] = df_installment_payments_pivot['NUM_INSTALMENT_VERSION']-df_installment_payments_pivot['NUM_INSTALMENT_NUMBER']
df_installment_payments_pivot['DAY_INSTAL_ENTRY_DIFF'] = df_installment_payments_pivot['DAYS_INSTALMENT']-df_installment_payments_pivot['DAYS_ENTRY_PAYMENT']


In [21]:
df = df.merge(df_installment_payments_pivot,on =['SK_ID_CURR'],suffixes=('_curr2','_instpay'),how='left')

In [33]:
df['BANKED']= df.index.isin(df_bureau.index)
df['BANKED']=df.BANKED.map({True:1,False:0})

In [34]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 185 entries, SK_ID_CURR to BANKED
dtypes: float64(127), int64(42), object(16)
memory usage: 14.2+ MB
None


# Eliminating Outliers and Unnessary columns

To find the outliers and NaNs, I went through each column using the code below and switched only the column index value of the iloc function. Through this method, I was able to drop about 83 columns because it did not have enough values to fill, did not seem to have any factor to the business case (e.g. weekday of application), or had low variety of values (e.g. 99% of values in the given column are True). 

As a result, I decreased the size of the dataframe from 140.4MB to 84MB as seen below.  



In [35]:
df.head(50)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE_app_train,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_app_train,AMT_ANNUITY_app_train,AMT_GOODS_PRICE_app_train,NAME_TYPE_SUITE_app_train,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START_app_train,HOUR_APPR_PROCESS_START_app_train,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_ANNUITY_prev_app,AMT_APPLICATION,AMT_CREDIT_prev_app,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE_prev_app,CHANNEL_TYPE,CNT_PAYMENT,CODE_REJECT_REASON,DAYS_DECISION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_TERMINATION,FLAG_LAST_APPL_PER_CONTRACT,HOUR_APPR_PROCESS_START_prev_app,NAME_CASH_LOAN_PURPOSE,NAME_CLIENT_TYPE,NAME_CONTRACT_STATUS,NAME_CONTRACT_TYPE_prev_app,NAME_GOODS_CATEGORY,NAME_PAYMENT_TYPE,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,NAME_SELLER_INDUSTRY,NAME_TYPE_SUITE_prev_app,NAME_YIELD_GROUP,NFLAG_INSURED_ON_APPROVAL,NFLAG_LAST_APPL_IN_DAY,PRODUCT_COMBINATION,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,SELLERPLACE_AREA,SK_ID_PREV_curr2,WEEKDAY_APPR_PROCESS_START_prev_app,AMT_ANNUITY,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CNT_CREDIT_PROLONG,CREDIT_ACTIVE,CREDIT_CURRENCY,CREDIT_DAY_OVERDUE,CREDIT_TYPE,DAYS_CREDIT,DAYS_CREDIT_ENDDATE,DAYS_CREDIT_UPDATE,DAYS_ENDDATE_FACT,SK_ID_BUREAU,AMT_INSTALMENT,AMT_PAYMENT,DAYS_ENTRY_PAYMENT,DAYS_INSTALMENT,NUM_INSTALMENT_NUMBER,NUM_INSTALMENT_VERSION,SK_ID_PREV_instpay,AMT_INSTAL_PAY_DIFF,NUM_INSTAL_VERSION_NUM_DIFF,DAY_INSTAL_ENTRY_DIFF,BANKED
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,98356.995,98356.995,-690.0,-686.0,2.0,1.0,1.0,0.0,-1.0,4.0,1
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,16509.6,180000.0,180000.0,0.0,180000.0,1.0,18.0,1.0,-865.0,365243.0,-834.0,-354.0,-324.0,-347.0,1.0,14.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,1
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,454500.0,"Spouse, partner",State servant,Secondary / secondary special,Married,House / apartment,0.035792,-16941,-1588,-4970.0,-477,,1,1,1,1,1,0,Laborers,2.0,2,2,WEDNESDAY,16,0,0,0,0,0,0,Other,,0.354225,0.621226,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-2536.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
6,100009,0,Cash loans,F,Y,Y,1,171000.0,1560726.0,41301.0,1395000.0,Unaccompanied,Commercial associate,Higher education,Married,House / apartment,0.035792,-13778,-3130,-1213.0,-619,17.0,1,1,0,1,1,0,Accountants,3.0,2,2,SUNDAY,16,0,0,0,0,0,0,Business Entity Type 3,0.774761,0.724,0.49206,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,-1562.0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,1.0,1.0,2.0,8996.76,98239.5,98239.5,0.0,98239.5,1.0,12.0,1.0,-449.0,365243.0,-418.0,-88.0,-88.0,-84.0,1.0,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,17818.02,17818.02,-421.0,-392.0,12.0,2.0,2.0,0.0,-10.0,29.0,1
7,100010,0,Cash loans,M,Y,Y,0,360000.0,1530000.0,42075.0,1530000.0,Unaccompanied,State servant,Higher education,Married,House / apartment,0.003122,-18850,-449,-4597.0,-2379,8.0,1,1,1,1,0,0,Managers,2.0,3,3,MONDAY,16,0,0,0,0,1,1,Other,,0.714279,0.540654,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-1070.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
8,100011,0,Cash loans,F,N,Y,0,112500.0,1019610.0,33826.5,913500.0,Children,Pensioner,Secondary / secondary special,Married,House / apartment,0.018634,-20099,365243,-7427.0,-3514,,1,0,0,1,0,0,,2.0,2,2,WEDNESDAY,14,0,0,0,0,0,0,XNA,0.587334,0.205747,0.751724,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,1.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,14588.55,449.685,-1189.0,-2147.0,12.0,1.0,1.0,14138.865,-11.0,-958.0,1
9,100012,0,Revolving loans,M,N,Y,0,135000.0,405000.0,20250.0,405000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.019689,-14469,-2019,-14437.0,-3992,,1,1,0,1,0,0,Laborers,1.0,2,2,THURSDAY,8,0,0,0,0,0,0,Electricity,,0.746644,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-1673.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,3012.075,18720.0,23697.0,0.0,18720.0,1.0,12.0,1.0,-1673.0,365243.0,-1641.0,-1401.0,-1311.0,-1397.0,1.0,9.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,,,,,,,,,,,,,,,,,5242.86,5242.86,-428.0,-387.0,4.0,3.0,1.0,0.0,-1.0,41.0,1


df = df.drop(columns=['NAME_TYPE_SUITE_app_train','FLAG_MOBIL',
                'FLAG_EMP_PHONE','FLAG_PHONE','FLAG_WORK_PHONE',
                'FLAG_EMAIL','FLAG_CONT_MOBILE','FLAG_PHONE',
                'WEEKDAY_APPR_PROCESS_START_app_train',
                'HOUR_APPR_PROCESS_START_app_train','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3',
                'APARTMENTS_AVG','BASEMENTAREA_AVG','YEARS_BEGINEXPLUATATION_AVG',
               'YEARS_BUILD_AVG','COMMONAREA_AVG','ELEVATORS_AVG',
               'ENTRANCES_AVG','FLOORSMAX_AVG','FLOORSMIN_AVG',
                'LANDAREA_AVG','LIVINGAPARTMENTS_AVG','LIVINGAREA_AVG',
               'NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_AVG','APARTMENTS_MODE',
                'BASEMENTAREA_MODE','YEARS_BEGINEXPLUATATION_MODE',
               'YEARS_BUILD_MODE','COMMONAREA_MODE','ELEVATORS_MODE',
               'ENTRANCES_MODE','FLOORSMAX_MODE','FLOORSMIN_MODE','LANDAREA_MODE',
                'LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE','NONLIVINGAPARTMENTS_MODE',
               'NONLIVINGAREA_MODE','APARTMENTS_MEDI','BASEMENTAREA_MEDI',
                'YEARS_BEGINEXPLUATATION_MEDI','YEARS_BUILD_MEDI',
               'COMMONAREA_MEDI','ELEVATORS_MEDI','ENTRANCES_MEDI',
                'FLOORSMAX_MEDI','FLOORSMIN_MEDI','LANDAREA_MEDI',
                'LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI',
                'NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI',
                'FONDKAPREMONT_MODE','HOUSETYPE_MODE',
                'TOTALAREA_MODE','WALLSMATERIAL_MODE',
                'EMERGENCYSTATE_MODE','DAYS_FIRST_DRAWING',
               'DAYS_FIRST_DUE','DAYS_LAST_DUE','DAYS_LAST_DUE_1ST_VERSION',
               'DAYS_TERMINATION','NAME_TYPE_SUITE_prev_app',
               'WEEKDAY_APPR_PROCESS_START_prev_app','CREDIT_CURRENCY','AMT_REQ_CREDIT_BUREAU_HOUR',
                     'AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK',
                     'AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR',
                     'FLAG_LAST_APPL_PER_CONTRACT'],axis=1)


'modewithoutnan' is a function that returns the mode of the column after eliminating NaN from the column values. This is because there are a significant amount of missing values due to lack of historical data on current applicants. 

def modewithoutnan(column):
    return df[df[column]!=np.nan][column].mode()[0]


Below, I have identified columns which has a large amount of NaN values and filled it with the mean, mode, or 0 values. I selected means for continuous variables, and mode for disctrict variables. I placed 0.0 with columns that were looking for previous credit history. 

df = df.fillna(value={'DAYS_EMPLOYED':df['DAYS_EMPLOYED'][df['DAYS_EMPLOYED']!=365243].mean(),
                'OBS_30_CNT_SOCIAL_CIRCLE':modewithoutnan('OBS_30_CNT_SOCIAL_CIRCLE'),
                'DEF_30_CNT_SOCIAL_CIRCLE':modewithoutnan('DEF_30_CNT_SOCIAL_CIRCLE'),
                'OBS_60_CNT_SOCIAL_CIRCLE': modewithoutnan('OBS_60_CNT_SOCIAL_CIRCLE'),
                'DEF_60_CNT_SOCIAL_CIRCLE': modewithoutnan('DEF_60_CNT_SOCIAL_CIRCLE'),
                'AMT_ANNUITY_prev_app':df['AMT_ANNUITY_prev_app'].mean(),
                'AMT_APPLICATION':df['AMT_APPLICATION'].mean(),
                'AMT_DOWN_PAYMENT':df['AMT_DOWN_PAYMENT'].mean(),
                'AMT_GOODS_PRICE_prev_app':df['AMT_GOODS_PRICE_prev_app'].mean(),
                'OWN_CAR_AGE':df['OWN_CAR_AGE'].mean(),
                'CHANNEL_TYPE':modewithoutnan('CHANNEL_TYPE'),
                 'CNT_PAYMENT':modewithoutnan('CNT_PAYMENT'),
                'CODE_REJECT_REASON':modewithoutnan('CODE_REJECT_REASON'),
                'DAYS_DECISION':round(df['DAYS_DECISION'].mean(),0),
                'HOUR_APPR_PROCESS_START_prev_app':df['HOUR_APPR_PROCESS_START_prev_app'].mean(),
                'NAME_CASH_LOAN_PURPOSE':modewithoutnan('NAME_CASH_LOAN_PURPOSE'),
                'NAME_CONTRACT_STATUS':df['NAME_CONTRACT_STATUS'].mean(),
                'NAME_CONTRACT_TYPE_prev_app':modewithoutnan('NAME_CONTRACT_TYPE_prev_app'),
                'NAME_GOODS_CATEGORY':modewithoutnan('NAME_GOODS_CATEGORY'),
                'NAME_PAYMENT_TYPE':modewithoutnan('NAME_PAYMENT_TYPE'),
                'NAME_PORTFOLIO':modewithoutnan('NAME_PORTFOLIO'),
                'NAME_PRODUCT_TYPE':modewithoutnan('NAME_PRODUCT_TYPE'),
                'NAME_SELLER_INDUSTRY':modewithoutnan('NAME_SELLER_INDUSTRY'),
                'NAME_YIELD_GROUP':modewithoutnan('NAME_YIELD_GROUP'),
                'NFLAG_INSURED_ON_APPROVAL':df['NFLAG_INSURED_ON_APPROVAL'].mean(),
                'NFLAG_LAST_APPL_IN_DAY':modewithoutnan('NFLAG_LAST_APPL_IN_DAY'),
                'PRODUCT_COMBINATION':modewithoutnan('PRODUCT_COMBINATION'),
                'RATE_DOWN_PAYMENT':df['RATE_DOWN_PAYMENT'].mean(),
                'RATE_INTEREST_PRIMARY':df['RATE_INTEREST_PRIMARY'].mean(),
                'RATE_INTEREST_PRIVILEGED':df['RATE_INTEREST_PRIVILEGED'].mean(),
                'SELLERPLACE_AREA':modewithoutnan('SELLERPLACE_AREA'),
                'SK_ID_PREV_curr2':0.0,'AMT_ANNUITY':df['AMT_ANNUITY'].mean(),
                'AMT_CREDIT_MAX_OVERDUE':round(df['AMT_CREDIT_MAX_OVERDUE'].mean(),0),
                'AMT_CREDIT_SUM':0.0,'AMT_CREDIT_SUM_DEBT':0,'AMT_CREDIT_SUM_LIMIT':0.0,
                'AMT_CREDIT_SUM_OVERDUE':0.0,'CNT_CREDIT_PROLONG':0.0,
                'CREDIT_ACTIVE':0.0,'CREDIT_DAY_OVERDUE':0.0,'CREDIT_TYPE':0.0,
                'DAYS_CREDIT':0.0,'DAYS_CREDIT_ENDDATE':0.0,'DAYS_CREDIT_UPDATE':0.0,
                'DAYS_ENDDATE_FACT':0.0,'SK_ID_BUREAU':0.0,'AMT_INSTALMENT':0.0,
                'AMT_PAYMENT':0.0,'DAYS_ENTRY_PAYMENT':0.0,'DAYS_INSTALMENT':0.0,
                'NUM_INSTALMENT_NUMBER':0.0,'NUM_INSTALMENT_VERSION':0.0,'SK_ID_PREV_instpay':0.0,
                'AMT_INSTAL_PAY_DIFF':0.0,'NUM_INSTAL_VERSION_NUM_DIFF':0.0,
                'DAY_INSTAL_ENTRY_DIFF':0.0,'AMT_CREDIT_prev_app':0.0,'NAME_CLIENT_TYPE':0.0,
                     'OCCUPATION_TYPE':'Other','AMT_GOODS_PRICE_app_train':df['AMT_GOODS_PRICE_app_train'].mean()})

In [36]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 185 entries, SK_ID_CURR to BANKED
dtypes: float64(127), int64(42), object(16)
memory usage: 14.2+ MB


There are a handful of columns that still have NaN values, so I have decided to drop them to simplify the dataset.

In [37]:
#df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 185 entries, SK_ID_CURR to BANKED
dtypes: float64(127), int64(42), object(16)
memory usage: 14.2+ MB


In [40]:
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV


df_dumb = pd.get_dummies(df.select_dtypes('object'),dummy_na=True)
type(df.select_dtypes(exclude=['object']))
df1 = pd.concat([df.select_dtypes(exclude=['object']),df_dumb],axis=1)

df1.fillna(value=0,inplace=True)
df1.head(100)
df1.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 320 entries, SK_ID_CURR to EMERGENCYSTATE_MODE_nan
dtypes: float64(127), int64(42), uint8(151)
memory usage: 14.4 MB


# Completed Data Wrangling

Decreased by overall dataset by 40%, while only removing 9 applicants from overall aggregated dataset.

In [42]:
df_test = df1.loc[:, df1.columns != 'TARGET']

y = df1.TARGET


# Saving File into Pickle File

In [44]:
df_test.head()

Unnamed: 0,SK_ID_CURR,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT_app_train,AMT_ANNUITY_app_train,AMT_GOODS_PRICE_app_train,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START_app_train,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,TOTALAREA_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_ANNUITY_prev_app,AMT_APPLICATION,AMT_CREDIT_prev_app,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE_prev_app,CHANNEL_TYPE,CNT_PAYMENT,CODE_REJECT_REASON,DAYS_DECISION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_TERMINATION,FLAG_LAST_APPL_PER_CONTRACT,HOUR_APPR_PROCESS_START_prev_app,NAME_CASH_LOAN_PURPOSE,NAME_CLIENT_TYPE,NAME_CONTRACT_STATUS,NAME_CONTRACT_TYPE_prev_app,NAME_GOODS_CATEGORY,NAME_PAYMENT_TYPE,NAME_PORTFOLIO,NAME_PRODUCT_TYPE,NAME_SELLER_INDUSTRY,NAME_TYPE_SUITE_prev_app,NAME_YIELD_GROUP,NFLAG_INSURED_ON_APPROVAL,NFLAG_LAST_APPL_IN_DAY,PRODUCT_COMBINATION,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,SELLERPLACE_AREA,SK_ID_PREV_curr2,WEEKDAY_APPR_PROCESS_START_prev_app,AMT_ANNUITY,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CNT_CREDIT_PROLONG,CREDIT_ACTIVE,CREDIT_CURRENCY,CREDIT_DAY_OVERDUE,CREDIT_TYPE,DAYS_CREDIT,DAYS_CREDIT_ENDDATE,DAYS_CREDIT_UPDATE,DAYS_ENDDATE_FACT,SK_ID_BUREAU,AMT_INSTALMENT,AMT_PAYMENT,DAYS_ENTRY_PAYMENT,DAYS_INSTALMENT,NUM_INSTALMENT_NUMBER,NUM_INSTALMENT_VERSION,SK_ID_PREV_instpay,AMT_INSTAL_PAY_DIFF,NUM_INSTAL_VERSION_NUM_DIFF,DAY_INSTAL_ENTRY_DIFF,BANKED,NAME_CONTRACT_TYPE_app_train_Cash loans,NAME_CONTRACT_TYPE_app_train_Revolving loans,NAME_CONTRACT_TYPE_app_train_nan,CODE_GENDER_F,CODE_GENDER_M,CODE_GENDER_nan,FLAG_OWN_CAR_N,FLAG_OWN_CAR_Y,FLAG_OWN_CAR_nan,FLAG_OWN_REALTY_N,FLAG_OWN_REALTY_Y,FLAG_OWN_REALTY_nan,NAME_TYPE_SUITE_app_train_Children,NAME_TYPE_SUITE_app_train_Family,NAME_TYPE_SUITE_app_train_Group of people,NAME_TYPE_SUITE_app_train_Other_A,NAME_TYPE_SUITE_app_train_Other_B,"NAME_TYPE_SUITE_app_train_Spouse, partner",NAME_TYPE_SUITE_app_train_Unaccompanied,NAME_TYPE_SUITE_app_train_nan,NAME_INCOME_TYPE_Commercial associate,NAME_INCOME_TYPE_Pensioner,NAME_INCOME_TYPE_State servant,NAME_INCOME_TYPE_Unemployed,NAME_INCOME_TYPE_Working,NAME_INCOME_TYPE_nan,NAME_EDUCATION_TYPE_Academic degree,NAME_EDUCATION_TYPE_Higher education,NAME_EDUCATION_TYPE_Incomplete higher,NAME_EDUCATION_TYPE_Lower secondary,NAME_EDUCATION_TYPE_Secondary / secondary special,NAME_EDUCATION_TYPE_nan,NAME_FAMILY_STATUS_Civil marriage,NAME_FAMILY_STATUS_Married,NAME_FAMILY_STATUS_Separated,NAME_FAMILY_STATUS_Single / not married,NAME_FAMILY_STATUS_Widow,NAME_FAMILY_STATUS_nan,NAME_HOUSING_TYPE_Co-op apartment,NAME_HOUSING_TYPE_House / apartment,NAME_HOUSING_TYPE_Municipal apartment,NAME_HOUSING_TYPE_Office apartment,NAME_HOUSING_TYPE_Rented apartment,NAME_HOUSING_TYPE_With parents,NAME_HOUSING_TYPE_nan,OCCUPATION_TYPE_Accountants,OCCUPATION_TYPE_Cleaning staff,OCCUPATION_TYPE_Cooking staff,OCCUPATION_TYPE_Core staff,OCCUPATION_TYPE_Drivers,OCCUPATION_TYPE_HR staff,OCCUPATION_TYPE_High skill tech staff,OCCUPATION_TYPE_IT staff,OCCUPATION_TYPE_Laborers,OCCUPATION_TYPE_Low-skill Laborers,OCCUPATION_TYPE_Managers,OCCUPATION_TYPE_Medicine staff,OCCUPATION_TYPE_Private service staff,OCCUPATION_TYPE_Realty agents,OCCUPATION_TYPE_Sales staff,OCCUPATION_TYPE_Secretaries,OCCUPATION_TYPE_Security staff,OCCUPATION_TYPE_Waiters/barmen staff,OCCUPATION_TYPE_nan,WEEKDAY_APPR_PROCESS_START_app_train_FRIDAY,WEEKDAY_APPR_PROCESS_START_app_train_MONDAY,WEEKDAY_APPR_PROCESS_START_app_train_SATURDAY,WEEKDAY_APPR_PROCESS_START_app_train_SUNDAY,WEEKDAY_APPR_PROCESS_START_app_train_THURSDAY,WEEKDAY_APPR_PROCESS_START_app_train_TUESDAY,WEEKDAY_APPR_PROCESS_START_app_train_WEDNESDAY,WEEKDAY_APPR_PROCESS_START_app_train_nan,ORGANIZATION_TYPE_Advertising,ORGANIZATION_TYPE_Agriculture,ORGANIZATION_TYPE_Bank,ORGANIZATION_TYPE_Business Entity Type 1,ORGANIZATION_TYPE_Business Entity Type 2,ORGANIZATION_TYPE_Business Entity Type 3,ORGANIZATION_TYPE_Cleaning,ORGANIZATION_TYPE_Construction,ORGANIZATION_TYPE_Culture,ORGANIZATION_TYPE_Electricity,ORGANIZATION_TYPE_Emergency,ORGANIZATION_TYPE_Government,ORGANIZATION_TYPE_Hotel,ORGANIZATION_TYPE_Housing,ORGANIZATION_TYPE_Industry: type 1,ORGANIZATION_TYPE_Industry: type 10,ORGANIZATION_TYPE_Industry: type 11,ORGANIZATION_TYPE_Industry: type 12,ORGANIZATION_TYPE_Industry: type 13,ORGANIZATION_TYPE_Industry: type 2,ORGANIZATION_TYPE_Industry: type 3,ORGANIZATION_TYPE_Industry: type 4,ORGANIZATION_TYPE_Industry: type 5,ORGANIZATION_TYPE_Industry: type 6,ORGANIZATION_TYPE_Industry: type 7,ORGANIZATION_TYPE_Industry: type 8,ORGANIZATION_TYPE_Industry: type 9,ORGANIZATION_TYPE_Insurance,ORGANIZATION_TYPE_Kindergarten,ORGANIZATION_TYPE_Legal Services,ORGANIZATION_TYPE_Medicine,ORGANIZATION_TYPE_Military,ORGANIZATION_TYPE_Mobile,ORGANIZATION_TYPE_Other,ORGANIZATION_TYPE_Police,ORGANIZATION_TYPE_Postal,ORGANIZATION_TYPE_Realtor,ORGANIZATION_TYPE_Religion,ORGANIZATION_TYPE_Restaurant,ORGANIZATION_TYPE_School,ORGANIZATION_TYPE_Security,ORGANIZATION_TYPE_Security Ministries,ORGANIZATION_TYPE_Self-employed,ORGANIZATION_TYPE_Services,ORGANIZATION_TYPE_Telecom,ORGANIZATION_TYPE_Trade: type 1,ORGANIZATION_TYPE_Trade: type 2,ORGANIZATION_TYPE_Trade: type 3,ORGANIZATION_TYPE_Trade: type 4,ORGANIZATION_TYPE_Trade: type 5,ORGANIZATION_TYPE_Trade: type 6,ORGANIZATION_TYPE_Trade: type 7,ORGANIZATION_TYPE_Transport: type 1,ORGANIZATION_TYPE_Transport: type 2,ORGANIZATION_TYPE_Transport: type 3,ORGANIZATION_TYPE_Transport: type 4,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_XNA,ORGANIZATION_TYPE_nan,FONDKAPREMONT_MODE_not specified,FONDKAPREMONT_MODE_org spec account,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,FONDKAPREMONT_MODE_nan,HOUSETYPE_MODE_block of flats,HOUSETYPE_MODE_specific housing,HOUSETYPE_MODE_terraced house,HOUSETYPE_MODE_nan,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,WALLSMATERIAL_MODE_nan,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes,EMERGENCYSTATE_MODE_nan
0,100002,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,-3648.0,-2120,0.0,1,1,0,1,1,0,1.0,2,2,10,0,0,0,0,0,0,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,0.0149,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0
1,100003,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,-1186.0,-291,0.0,1,1,0,1,1,0,2.0,1,1,11,0,0,0,0,0,0,0.311267,0.622246,0.0,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,0.0714,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,98356.995,98356.995,-690.0,-686.0,2.0,1.0,1.0,0.0,-1.0,4.0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0
2,100004,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,1.0,2,2,9,0,0,0,0,0,0,0.0,0.555912,0.729567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
3,100006,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,-9833.0,-2437,0.0,1,1,0,1,0,0,2.0,2,2,17,0,0,0,0,0,0,0.0,0.650442,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
4,100007,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,-4311.0,-3458,0.0,1,1,0,1,0,0,1.0,2,2,11,0,0,0,0,1,1,0.0,0.322738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,16509.6,180000.0,180000.0,0.0,180000.0,1.0,18.0,1.0,-865.0,365243.0,-834.0,-354.0,-324.0,-347.0,1.0,14.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1


In [45]:

clf = LassoCV()
sfm = SelectFromModel(clf,threshold=0.00000000000000000000000001)
sfm.fit(df_test,y)

df_test.columns[sfm.get_support()]

Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT_app_train', 'AMT_ANNUITY_app_train', 'AMT_GOODS_PRICE_app_train', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE', 'AMT_APPLICATION', 'AMT_CREDIT_prev_app', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE_prev_app', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'AMT_ANNUITY', 'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT', 'AMT_INSTALMENT', 'AMT_INSTAL_PAY_DIFF'], dtype='object')

In [46]:
df.to_pickle('df_pickle.pickle')