
# **数据挖掘——Home Credit Default Risk**

Authors：李林（3120220938）、杨洋（3220211141）、敬甲男（3220221052）、李翰杰（3120220936）

github地址：https://github.com/leealim/kaggle-Home-Credit-Default-Risk

---

## 数据预处理——数据分析

共八张表，逐个进行分析：
- application_{train|test}.csv:客户申请表
- bureau.csv/bureau_balance.csv:客户历史借款记录
- POS_CASH_balance.csv:客户POS和现金贷款历史
- credit_card_balance.csv:客户信用卡的snapshot历史
- previous_application.csv:客户历史申请记录
- installments_payments.csv:客户信用卡还款记录

---


In [177]:
#引入本部分所需要的包，并定义需要的值和函数

import pandas as pd
import matplotlib.pyplot as plt

app_tr_path="..\\data\\home-credit-default-risk\\application_train.csv"
app_te_path="..\\data\\home-credit-default-risk\\application_test.csv"
bur_path="..\\data\\home-credit-default-risk\\bureau.csv"
bur_bal_path="..\\data\\home-credit-default-risk\\bureau_balance.csv"
pos_path="..\\data\\home-credit-default-risk\\POS_CASH_balance.csv"
cre_path="..\\data\\home-credit-default-risk\\credit_card_balance.csv"
pre_path="..\\data\\home-credit-default-risk\\previous_application.csv"
ins_path="..\\data\\home-credit-default-risk\\installments_payments.csv"

hom_path="..\\data\\home-credit-default-risk\\HomeCredit_columns_description.csv"# 列描述表
hom=pd.read_csv(hom_path)
app_tr=pd.read_csv(app_tr_path)

def missing_values_table(df,table_name):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table = mis_val_table.rename(
        columns = {0 : 'Missing Values',
        1 : '% of Total Values'})
    mis_val_table = mis_val_table.sort_values(
    '% of Total Values', ascending=False).round(1)
    miss_num=(mis_val_table["Missing Values"]!=0).sum()
    print("Total "+ str(miss_num) +" columns missing values")
    mis_val_table=mis_val_table.drop(index=mis_val_table[miss_num:].index)

    mis_val_table=mis_val_table.merge(hom,how="left",left_index=True,right_on='Row')
    mis_val_table=mis_val_table.drop(columns=['Unnamed: 0'])
    mis_val_table=mis_val_table.drop(index=mis_val_table.loc[mis_val_table["Table"]!=table_name].index)
    mis_val_table=mis_val_table.reindex(columns=["Row","Description","Special","Missing Values","% of Total Values"])
    mis_val_table = mis_val_table.reset_index(drop=True)

    return mis_val_table



### 1. **application_{train|test}.csv**

In [178]:
# 查看训练数据的基本数据特征

app_tr=pd.read_csv(app_tr_path)
app_tr.describe()

Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,307511.0,307511.0,307511.0,307511.0,307511.0,307499.0,307233.0,307511.0,307511.0,307511.0,...,307511.0,307511.0,307511.0,307511.0,265992.0,265992.0,265992.0,265992.0,265992.0,265992.0
mean,278180.518577,0.080729,0.417052,168797.9,599026.0,27108.573909,538396.2,0.020868,-16036.995067,63815.045904,...,0.00813,0.000595,0.000507,0.000335,0.006402,0.007,0.034362,0.267395,0.265474,1.899974
std,102790.175348,0.272419,0.722121,237123.1,402490.8,14493.737315,369446.5,0.013831,4363.988632,141275.766519,...,0.089798,0.024387,0.022518,0.018299,0.083849,0.110757,0.204685,0.916002,0.794056,1.869295
min,100002.0,0.0,0.0,25650.0,45000.0,1615.5,40500.0,0.00029,-25229.0,-17912.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,189145.5,0.0,0.0,112500.0,270000.0,16524.0,238500.0,0.010006,-19682.0,-2760.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,278202.0,0.0,0.0,147150.0,513531.0,24903.0,450000.0,0.01885,-15750.0,-1213.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,367142.5,0.0,1.0,202500.0,808650.0,34596.0,679500.0,0.028663,-12413.0,-289.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
max,456255.0,1.0,19.0,117000000.0,4050000.0,258025.5,4050000.0,0.072508,-7489.0,365243.0,...,1.0,1.0,1.0,1.0,4.0,9.0,8.0,27.0,261.0,25.0


In [179]:
# 缺失值分析

t=missing_values_table(app_tr,"application_{train|test}.csv")
t


Total 67 columns missing values


Unnamed: 0,Row,Description,Special,Missing Values,% of Total Values
0,COMMONAREA_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9
1,COMMONAREA_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9
2,COMMONAREA_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9
3,NONLIVINGAPARTMENTS_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4
4,NONLIVINGAPARTMENTS_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4
...,...,...,...,...,...
62,EXT_SOURCE_2,Normalized score from external data source,normalized,660,0.2
63,AMT_GOODS_PRICE,For consumer loans it is the price of the goods for which the loan is given,,278,0.1
64,AMT_ANNUITY,Loan annuity,,12,0.0
65,CNT_FAM_MEMBERS,How many family members does client have,,2,0.0


In [180]:
#获取较小的缺失值列信息

t_small=t.loc[t["% of Total Values"]<1]
t_small


Unnamed: 0,Row,Description,Special,Missing Values,% of Total Values
57,NAME_TYPE_SUITE,Who was accompanying client when he was applying for the loan,,1292,0.4
58,OBS_30_CNT_SOCIAL_CIRCLE,How many observation of client's social surroundings with observable 30 DPD (days past due) default,,1021,0.3
59,DEF_30_CNT_SOCIAL_CIRCLE,How many observation of client's social surroundings defaulted on 30 DPD (days past due),,1021,0.3
60,OBS_60_CNT_SOCIAL_CIRCLE,How many observation of client's social surroundings with observable 60 DPD (days past due) default,,1021,0.3
61,DEF_60_CNT_SOCIAL_CIRCLE,How many observation of client's social surroundings defaulted on 60 (days past due) DPD,,1021,0.3
62,EXT_SOURCE_2,Normalized score from external data source,normalized,660,0.2
63,AMT_GOODS_PRICE,For consumer loans it is the price of the goods for which the loan is given,,278,0.1
64,AMT_ANNUITY,Loan annuity,,12,0.0
65,CNT_FAM_MEMBERS,How many family members does client have,,2,0.0
66,DAYS_LAST_PHONE_CHANGE,How many days before application did client change phone,,1,0.0


In [181]:
#获取较小的缺失值列信息

t_large=t.loc[t["% of Total Values"]>1]
t_large

Unnamed: 0,Row,Description,Special,Missing Values,% of Total Values
0,COMMONAREA_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9
1,COMMONAREA_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9
2,COMMONAREA_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9
3,NONLIVINGAPARTMENTS_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4
4,NONLIVINGAPARTMENTS_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4
5,NONLIVINGAPARTMENTS_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4
6,FONDKAPREMONT_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210295,68.4
7,LIVINGAPARTMENTS_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210199,68.4
8,LIVINGAPARTMENTS_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210199,68.4
9,LIVINGAPARTMENTS_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210199,68.4


可以发现，残缺值数量差距很大，对于小于百分之一的残缺值，我们采取删去对应行的措施。


In [182]:
#删去部分行

app_tr.dropna(subset=t_small["Row"],
          axis=0, # axis=0表示删除行；
          how='any', # how=any表示若列name、age中，任意一个出现空值，就删掉该行
          inplace=True # inplace=True表示在原df上进行修改；
          )
app_tr = app_tr.reset_index(drop=True)
app_tr

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304526,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
304527,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
304528,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
304529,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


对于残缺值比较大的行，我们逐一进行研究处理。首先，对每个特征融入统计数据。

In [183]:

t_large_describe=t_large.merge(app_tr.describe().T,how="left",left_on="Row",right_index=True)
pd.set_option('max_colwidth',400)
t_large_describe


Unnamed: 0,Row,Description,Special,Missing Values,% of Total Values,count,mean,std,min,25%,50%,75%,max
0,COMMONAREA_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9,91661.0,0.044544,0.076043,0.0,0.0079,0.0208,0.0513,1.0
1,COMMONAREA_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9,91661.0,0.044564,0.075932,0.0,0.0078,0.0211,0.0514,1.0
2,COMMONAREA_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,214865,69.9,91661.0,0.04251,0.074343,0.0,0.0072,0.019,0.0489,1.0
3,NONLIVINGAPARTMENTS_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4,92987.0,0.008061,0.046265,0.0,0.0,0.0,0.0039,1.0
4,NONLIVINGAPARTMENTS_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4,92987.0,0.008795,0.047732,0.0,0.0,0.0,0.0039,1.0
5,NONLIVINGAPARTMENTS_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,213514,69.4,92987.0,0.008637,0.047412,0.0,0.0,0.0,0.0039,1.0
6,FONDKAPREMONT_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210295,68.4,,,,,,,,
7,LIVINGAPARTMENTS_MODE,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210199,68.4,96272.0,0.105537,0.097673,0.0,0.0542,0.0762,0.1313,1.0
8,LIVINGAPARTMENTS_AVG,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210199,68.4,96272.0,0.100662,0.092368,0.0,0.0504,0.0756,0.121,1.0
9,LIVINGAPARTMENTS_MEDI,"Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor",normalized,210199,68.4,96272.0,0.101845,0.093431,0.0,0.0513,0.0761,0.1231,1.0


可以发现还存在一些存在大量残缺值的分类数据。对于这些数据，在转化为数值数据时，多转化一个类别。
另外，这里面有着大量缺失的房产数据，可以简化这些特征为是否有房产，注意可以考虑到只要是这些属性有一个就说明有房产。



In [184]:
#填补分类数据缺失值

temp=t_large_describe.drop(columns=["Special"]).isnull().T.any()
temp.loc[temp==True].index
rows=t_large_describe.loc[temp.loc[temp==True].index].Row

for col in rows: 
    app_tr[str(col)] = app_tr[str(col)].fillna(value="MyNull")
    
t=missing_values_table(app_tr,"application_{train|test}.csv")
app_tr[str(rows.iloc[0])]

Total 52 columns missing values


0         reg oper account
1         reg oper account
2                   MyNull
3                   MyNull
4                   MyNull
                ...       
304526    reg oper account
304527    reg oper account
304528    reg oper account
304529              MyNull
304530              MyNull
Name: FONDKAPREMONT_MODE, Length: 304531, dtype: object

In [185]:
#重新得到缺失值矩阵

t_describe=t.merge(app_tr.describe().T,how="left",left_on="Row",right_index=True)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304526,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
304527,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
304528,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
304529,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [186]:
t_house=t_describe.loc[t_describe["Description"]=="Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor"]
temp=app_tr.loc[:,t_house["Row"].tolist()].isnull().sum(axis=1)
app_tr=app_tr.drop(columns=t_house["Row"].tolist())
app_tr["MY_HOUSE_OWN"]=1
temp_app_list=app_tr["MY_HOUSE_OWN"].copy()
for index in temp.loc[temp==0].index:
    temp_app_list[index]=0
app_tr["MY_HOUSE_OWN"]=temp_app_list
app_tr["MY_HOUSE_OWN"]

0         0
1         0
2         1
3         1
4         1
         ..
304526    0
304527    0
304528    1
304529    1
304530    1
Name: MY_HOUSE_OWN, Length: 304531, dtype: int64

In [187]:
app_tr

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,MY_HOUSE_OWN
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,,,,,,,1
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304526,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,,,,,,,0
304527,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,,,,,,,0
304528,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0,1
304529,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [189]:
t=missing_values_table(app_tr,"application_{train|test}.csv")
t_describe=t.merge(app_tr.describe().T,how="left",left_on="Row",right_index=True)
t_describe

Total 9 columns missing values


Unnamed: 0,Row,Description,Special,Missing Values,% of Total Values,count,mean,std,min,25%,50%,75%,max
0,OWN_CAR_AGE,Age of client's car,,200912,66.0,103619.0,12.070682,11.935821,0.0,5.0,9.0,15.0,91.0
1,EXT_SOURCE_1,Normalized score from external data source,normalized,171652,56.4,132879.0,0.501986,0.211049,0.014568,0.333967,0.505819,0.674901,0.962693
2,EXT_SOURCE_3,Normalized score from external data source,normalized,60251,19.8,244280.0,0.510764,0.194843,0.000527,0.37065,0.535276,0.669057,0.89601
3,AMT_REQ_CREDIT_BUREAU_YEAR,Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application),,41108,13.5,263423.0,1.905904,1.869645,0.0,0.0,1.0,3.0,25.0
4,AMT_REQ_CREDIT_BUREAU_QRT,Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application),,41108,13.5,263423.0,0.266127,0.795735,0.0,0.0,0.0,0.0,261.0
5,AMT_REQ_CREDIT_BUREAU_MON,Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application),,41108,13.5,263423.0,0.267782,0.91533,0.0,0.0,0.0,0.0,27.0
6,AMT_REQ_CREDIT_BUREAU_WEEK,Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application),,41108,13.5,263423.0,0.034484,0.204615,0.0,0.0,0.0,0.0,8.0
7,AMT_REQ_CREDIT_BUREAU_DAY,Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application),,41108,13.5,263423.0,0.006981,0.110358,0.0,0.0,0.0,0.0,9.0
8,AMT_REQ_CREDIT_BUREAU_HOUR,Number of enquiries to Credit Bureau about the client one hour before application,,41108,13.5,263423.0,0.006385,0.083786,0.0,0.0,0.0,0.0,4.0
