
# **数据挖掘——Home Credit Default Risk**

Authors：李林（3120220938）、杨洋（3220211141）、敬甲男（3220221052）、李翰杰（3120220936）

github地址：https://github.com/leealim/kaggle-Home-Credit-Default-Risk

---

## 数据预处理——异常值处理

共八张表，逐个进行处理：
- application_{train|test}.csv:客户申请表
- bureau.csv/bureau_balance.csv:客户历史借款记录
- POS_CASH_balance.csv:客户POS和现金贷款历史
- credit_card_balance.csv:客户信用卡的snapshot历史
- previous_application.csv:客户历史申请记录
- installments_payments.csv:客户信用卡还款记录

---


In [8]:
# 引入本部分所需要的包，并定义需要的值和函数

import pandas as pd
import numpy as np
import os
import math 

source_dir=".\\data\\miss_value_handling"
result_dir=".\\data\\outlier_handling"


app_tr_path = source_dir+"\\application_train.csv"
app_te_path = source_dir+"\\application_test.csv"
pos_path = source_dir+"\\POS_CASH_balance.csv"
cre_path = source_dir+"\\credit_card_balance.csv"
pre_path = source_dir+"\\previous_application.csv"
ins_path = source_dir+"\\installments_payments.csv"
hom_path = ".\\HomeCredit_columns_description.csv"  # 列描述表
hom = pd.read_csv(hom_path)

if not os.path.exists(result_dir):
    os.makedirs(result_dir)

def box_outlier(data,q1_,q3_):
    df = data.copy(deep=True)
    for col in df.select_dtypes(exclude='object').columns:             # 对数值属性的每一列分别用盒图进行判断
        Q1 = df[col].quantile(q=q1_)       # 下四分位
        Q3 = df[col].quantile(q=q3_)       # 上四分位
        low_whisker = Q1 - 1.5 * (Q3 - Q1)  # 下边缘
        up_whisker = Q3 + 1.5 * (Q3 - Q1)   # 上边缘
        
        # 寻找异常点,获得异常点索引值，将异常值设为空值
        rule = (df[col] > up_whisker) | (df[col] < low_whisker)  
        out = df[col].index[rule]
        df.loc[out,col]=np.nan
    return df


### 1. **application_{train|test}.csv**

在过程中，发现DAYS_EMPLOYED异常数据,用非异常值部分的均值填补  
此外用箱线图处理异常值，同时为了不删除大量数据（正常删除会删除3分之2的数据），限制Q1以及Q3（可以自行设定值）

In [12]:
app_tr = pd.read_csv(app_tr_path)
app_tr.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_CURR,304531.0,278171.558800,102782.532925,100002.0,189138.5,278193.0,367136.0,456255.0
TARGET,304531.0,0.081000,0.272836,0.0,0.0,0.0,0.0,1.0
CNT_CHILDREN,304531.0,0.417140,0.722308,0.0,0.0,0.0,1.0,19.0
AMT_INCOME_TOTAL,304531.0,168663.446314,237890.994045,25650.0,112500.0,147600.0,202500.0,117000000.0
AMT_CREDIT,304531.0,599559.238330,402145.313895,45000.0,270000.0,517266.0,808650.0,4050000.0
...,...,...,...,...,...,...,...,...
AMT_REQ_CREDIT_BUREAU_WEEK,304531.0,0.029829,0.190669,0.0,0.0,0.0,0.0,8.0
AMT_REQ_CREDIT_BUREAU_MON,304531.0,0.231635,0.856216,0.0,0.0,0.0,0.0,27.0
AMT_REQ_CREDIT_BUREAU_QRT,304531.0,0.230203,0.745648,0.0,0.0,0.0,0.0,261.0
AMT_REQ_CREDIT_BUREAU_YEAR,304531.0,1.648630,1.856842,0.0,0.0,1.0,3.0,25.0


In [13]:
#DAYS_EMPLOYED异常数据证明

app_tr["DAYS_EMPLOYED"].loc[app_tr["DAYS_EMPLOYED"]>0]

8         365243
11        365243
23        365243
38        365243
43        365243
           ...  
304489    365243
304503    365243
304507    365243
304525    365243
304527    365243
Name: DAYS_EMPLOYED, Length: 54852, dtype: int64

In [14]:
#箱线图

temp=box_outlier( .isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

6         True
26        True
50        True
70        True
71        True
          ... 
304283    True
304324    True
304342    True
304379    True
304528    True
Length: 13910, dtype: bool

In [15]:
#删除结果

app_tr.drop(index=temp.index,inplace=True)
temp=box_outlier(app_tr.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

614       True
1397      True
1399      True
1497      True
1641      True
          ... 
299346    True
299600    True
301945    True
303769    True
304152    True
Length: 484, dtype: bool

In [16]:
# 输出结果

app_tr.to_csv(result_dir+"\\application_train.csv",index=False)
app_te = pd.read_csv(app_te_path)
app_te.to_csv(result_dir+"\\application_test.csv",index=False)


### 2. **previous_application.csv**

此外用箱线图处理异常值，同时为了不删除大量数据（正常删除会删除3分之2的数据），限制Q1以及Q3（可以自行设定值）

In [17]:
pre = pd.read_csv(pre_path)
pre.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,1669867.0,1923089.0,532599.44231,1000001.0,1461858.0,1923117.0,2384284.0,2845382.0
SK_ID_CURR,1669867.0,278358.7,102815.028046,100001.0,189330.0,278721.0,367514.0,456255.0
AMT_ANNUITY,1669867.0,12401.83,14625.840301,0.0,2250.0,8254.035,16828.51,418058.145
AMT_APPLICATION,1669867.0,175270.3,292799.282301,0.0,18796.5,71055.0,180441.0,6905160.0
AMT_CREDIT,1669867.0,196154.7,318595.110482,0.0,24192.0,80550.0,216418.5,6905160.0
AMT_DOWN_PAYMENT,1669867.0,3105.797,14633.322193,-0.9,0.0,0.0,0.0,3060045.0
AMT_GOODS_PRICE,1669867.0,175292.5,292818.625273,0.0,18801.0,71055.0,180495.0,6905160.0
HOUR_APPR_PROCESS_START,1669867.0,12.48412,3.334075,0.0,10.0,12.0,15.0,23.0
NFLAG_LAST_APPL_IN_DAY,1669867.0,0.9964668,0.059336,0.0,1.0,1.0,1.0,1.0
RATE_DOWN_PAYMENT,1669867.0,0.0369301,0.083477,-1.497876e-05,0.0,0.0,0.0,1.0


In [18]:
#箱线图

temp=box_outlier(pre.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

0          True
30         True
151        True
207        True
277        True
           ... 
1668302    True
1668459    True
1669246    True
1669492    True
1669576    True
Length: 16981, dtype: bool

In [19]:
#删除结果

pre.drop(index=temp.index,inplace=True)
temp=box_outlier(pre.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

299        True
2816       True
3466       True
4651       True
6192       True
           ... 
1654491    True
1656703    True
1662950    True
1663978    True
1667040    True
Length: 405, dtype: bool

In [22]:
# 输出结果

pre.to_csv(result_dir+"\\previous_application.csv",index=False)


### 3. **POS_CASH_balance.csv**

In [23]:
pos = pd.read_csv(pos_path)
pos.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,9975174.0,1903212.0,535847.900702,1000001.0,1434397.25,1896573.0,2368946.0,2843499.0
SK_ID_CURR,9975174.0,278407.9,102764.696703,100001.0,189552.0,278662.0,367433.0,456255.0
MONTHS_BALANCE,9975174.0,-35.05663,26.080261,-96.0,-54.0,-28.0,-13.0,-1.0
CNT_INSTALMENT,9975174.0,17.08974,11.995085,1.0,10.0,12.0,24.0,92.0
CNT_INSTALMENT_FUTURE,9975174.0,10.4838,11.109033,0.0,3.0,7.0,14.0,85.0
SK_DPD,9975174.0,11.6374,132.886777,0.0,0.0,0.0,0.0,4231.0
SK_DPD_DEF,9975174.0,0.6561863,32.805445,0.0,0.0,0.0,0.0,3595.0


In [24]:
#箱线图

temp=box_outlier(pos.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

194        True
246        True
285        True
290        True
352        True
           ... 
9975168    True
9975169    True
9975170    True
9975171    True
9975172    True
Length: 142780, dtype: bool

In [25]:
#删除结果

pos.drop(index=temp.index,inplace=True)
temp=box_outlier(pos.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

67         True
68         True
148        True
169        True
231        True
           ... 
9975082    True
9975083    True
9975084    True
9975104    True
9975141    True
Length: 110148, dtype: bool

In [26]:
np.sum(pos.isnull(),axis = 0)

SK_ID_PREV               0
SK_ID_CURR               0
MONTHS_BALANCE           0
CNT_INSTALMENT           0
CNT_INSTALMENT_FUTURE    0
NAME_CONTRACT_STATUS     0
SK_DPD                   0
SK_DPD_DEF               0
dtype: int64

In [27]:
# 输出结果

pos.to_csv(result_dir+"\\POS_CASH_balance.csv",index=False)


### 4. **credit_card_balance.csv**

In [28]:
cre= pd.read_csv(cre_path)
cre.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,3840312.0,1904504.0,536469.470563,1000018.0,1434385.0,1897122.0,2369328.0,2843496.0
SK_ID_CURR,3840312.0,278324.2,102704.475133,100006.0,189517.0,278396.0,367580.0,456250.0
MONTHS_BALANCE,3840312.0,-34.52192,26.667751,-96.0,-55.0,-28.0,-11.0,-1.0
AMT_BALANCE,3840312.0,58300.16,106307.031025,-420250.185,0.0,0.0,89046.69,1505902.185
AMT_CREDIT_LIMIT_ACTUAL,3840312.0,153808.0,165145.699523,0.0,45000.0,112500.0,180000.0,1350000.0
AMT_DRAWINGS_ATM_CURRENT,3840312.0,4797.384,25430.70437,-6827.31,0.0,0.0,0.0,2115000.0
AMT_DRAWINGS_CURRENT,3840312.0,7433.388,33846.077334,-6211.62,0.0,0.0,0.0,2287098.315
AMT_DRAWINGS_OTHER_CURRENT,3840312.0,231.9048,7358.721299,0.0,0.0,0.0,0.0,1529847.0
AMT_DRAWINGS_POS_CURRENT,3840312.0,2389.15,18693.534956,0.0,0.0,0.0,0.0,2239274.16
AMT_INST_MIN_REGULARITY,3840312.0,3258.821,5457.655789,0.0,0.0,0.0,5625.0,202882.005


In [29]:
#箱线图

temp=box_outlier(cre.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

8          True
23         True
83         True
215        True
285        True
           ... 
3840038    True
3840065    True
3840222    True
3840234    True
3840309    True
Length: 67946, dtype: bool

In [31]:
#删除结果

cre.drop(index=temp.index,inplace=True)
temp=box_outlier(cre.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

5          True
12         True
153        True
157        True
262        True
           ... 
3839928    True
3839959    True
3840212    True
3840216    True
3840245    True
Length: 50024, dtype: bool

In [32]:
np.sum(cre.isnull(),axis = 0)

SK_ID_PREV                    0
SK_ID_CURR                    0
MONTHS_BALANCE                0
AMT_BALANCE                   0
AMT_CREDIT_LIMIT_ACTUAL       0
AMT_DRAWINGS_ATM_CURRENT      0
AMT_DRAWINGS_CURRENT          0
AMT_DRAWINGS_OTHER_CURRENT    0
AMT_DRAWINGS_POS_CURRENT      0
AMT_INST_MIN_REGULARITY       0
AMT_PAYMENT_CURRENT           0
AMT_PAYMENT_TOTAL_CURRENT     0
AMT_RECEIVABLE_PRINCIPAL      0
AMT_RECIVABLE                 0
AMT_TOTAL_RECEIVABLE          0
CNT_DRAWINGS_ATM_CURRENT      0
CNT_DRAWINGS_CURRENT          0
CNT_DRAWINGS_OTHER_CURRENT    0
CNT_DRAWINGS_POS_CURRENT      0
CNT_INSTALMENT_MATURE_CUM     0
NAME_CONTRACT_STATUS          0
SK_DPD                        0
SK_DPD_DEF                    0
dtype: int64

In [33]:
# 输出结果

cre.to_csv(result_dir+"\\credit_card_balance.csv",index=False)


### 5. **installments_payments.csv**

In [34]:
ins = pd.read_csv(ins_path)
ins.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_PREV,13602496.0,1903364.0,536206.564667,1000001.0,1434182.0,1896524.0,2369094.0,2843499.0
SK_ID_CURR,13602496.0,278444.1,102718.470692,100001.0,189639.0,278684.0,367530.0,456255.0
NUM_INSTALMENT_VERSION,13602496.0,0.8564952,1.031683,0.0,0.0,1.0,1.0,73.0
NUM_INSTALMENT_NUMBER,13602496.0,18.86637,26.66131,1.0,4.0,8.0,19.0,277.0
DAYS_INSTALMENT,13602496.0,-1042.326,800.945622,-2922.0,-1654.0,-818.0,-361.0,-1.0
DAYS_ENTRY_PAYMENT,13602496.0,-1051.114,800.585883,-4921.0,-1662.0,-827.0,-370.0,-1.0
AMT_INSTALMENT,13602496.0,17051.07,50568.662196,0.0,4227.66,8884.71,16710.21,3771487.845
AMT_PAYMENT,13602496.0,17238.22,54735.783981,0.0,3398.265,8125.515,16108.425,3771487.845


In [35]:
#箱线图

temp=box_outlier(ins.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

104         True
179         True
501         True
632         True
879         True
            ... 
13601886    True
13602015    True
13602016    True
13602125    True
13602243    True
Length: 82861, dtype: bool

In [37]:
#删除结果

ins.drop(index=temp.index,inplace=True)
temp=box_outlier(ins.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

667         True
961         True
1231        True
1232        True
1647        True
            ... 
13600257    True
13601371    True
13601691    True
13601930    True
13602237    True
Length: 26234, dtype: bool

In [38]:
np.sum(ins.isnull(),axis = 0)

SK_ID_PREV                0
SK_ID_CURR                0
NUM_INSTALMENT_VERSION    0
NUM_INSTALMENT_NUMBER     0
DAYS_INSTALMENT           0
DAYS_ENTRY_PAYMENT        0
AMT_INSTALMENT            0
AMT_PAYMENT               0
dtype: int64

In [39]:
# 输出结果

ins.to_csv(result_dir+"\\installments_payments.csv",index=False)