
# **数据挖掘——Home Credit Default Risk**

Authors：李林（3120220938）、杨洋（3220211141）、敬甲男（3220221052）、李翰杰（3120220936）

github地址：https://github.com/leealim/kaggle-Home-Credit-Default-Risk

---

## 数据预处理——异常值处理

共八张表，逐个进行处理：
- application_{train|test}.csv:客户申请表
- bureau.csv/bureau_balance.csv:客户历史借款记录
- POS_CASH_balance.csv:客户POS和现金贷款历史
- credit_card_balance.csv:客户信用卡的snapshot历史
- previous_application.csv:客户历史申请记录
- installments_payments.csv:客户信用卡还款记录

---


In [99]:
# 引入本部分所需要的包，并定义需要的值和函数

import pandas as pd
import numpy as np
import os
import math 

source_dir="..\\data\\miss_value_handling"
app_tr_path = source_dir+"\\application_train.csv"
app_te_path = source_dir+"\\application_test.csv"
bur_path = source_dir+"\\bureau.csv"
bur_bal_path = source_dir+"\\bureau_balance.csv"
pos_path = source_dir+"\\POS_CASH_balance.csv"
cre_path = source_dir+"\\credit_card_balance.csv"
pre_path = source_dir+"\\previous_application.csv"
ins_path = source_dir+"\\installments_payments.csv"

hom_path = "..\\data\\home-credit-default-risk\\HomeCredit_columns_description.csv"  # 列描述表
hom = pd.read_csv(hom_path)

result_dir="..\\data\\outlier_handling"
if not os.path.exists(result_dir):
    os.makedirs(result_dir)

def box_outlier(data):
    df = data.copy(deep=True)
    for col in df.select_dtypes(exclude='object').columns:             # 对数值属性的每一列分别用盒图进行判断
        Q1 = df[col].quantile(q=0.25)       # 下四分位
        Q3 = df[col].quantile(q=0.75)       # 上四分位
        low_whisker = Q1 - 1.5 * (Q3 - Q1)  # 下边缘
        up_whisker = Q3 + 1.5 * (Q3 - Q1)   # 上边缘
        
        # 寻找异常点,获得异常点索引值，将异常值设为空值
        rule = (df[col] > up_whisker) | (df[col] < low_whisker)  
        out = df[col].index[rule]
        df.loc[out,col]=np.nan
    return df


### 1. **application_{train|test}.csv**

首先，调查将非归一化数值数据对数化  
在过程中，发现DAYS_EMPLOYED异常数据,用非异常值部分的均值填补

In [100]:
#排除归一化数据和分类数据

app_tr = pd.read_csv(app_tr_path)
temp=app_tr.describe().T
outlier_possible=temp.loc[(temp["min"]<-0.1) |(temp["max"]>1.1)]

In [101]:
#DAYS_EMPLOYED异常数据证明

app_tr["DAYS_EMPLOYED"].loc[app_tr["DAYS_EMPLOYED"]>0]

8         365243
11        365243
23        365243
38        365243
43        365243
           ...  
304489    365243
304503    365243
304507    365243
304525    365243
304527    365243
Name: DAYS_EMPLOYED, Length: 54852, dtype: int64

In [102]:
#均值替代和对数化

mean_num=app_tr["DAYS_EMPLOYED"].loc[app_tr["DAYS_EMPLOYED"]<0].mean()
temp=app_tr["DAYS_EMPLOYED"].copy()
temp.loc[temp>0]=mean_num
app_tr["DAYS_EMPLOYED"]=temp

for c in outlier_possible.index.tolist() :
    if not c=="SK_ID_CURR":
        app_tr.loc[:,c]=app_tr.loc[:,c].apply(np.abs)
        app_tr.loc[:,c]=app_tr.loc[:,c].apply(lambda x:x+1)
        app_tr.loc[:,c]=app_tr.loc[:,c].apply(np.log10)

In [103]:
#查看结果

temp=app_tr.describe().T
temp.loc[(temp["min"]<-0.1) |(temp["max"]>1.1)]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_CURR,304531.0,278171.5588,102782.532925,100002.0,189138.5,278193.0,367136.0,456255.0
CNT_CHILDREN,304531.0,0.109944,0.176316,0.0,0.0,0.0,0.30103,1.30103
AMT_INCOME_TOTAL,304531.0,5.171963,0.212054,4.409104,5.051156,5.169089,5.306427,8.068186
AMT_CREDIT,304531.0,5.676802,0.310566,4.653222,5.431365,5.713715,5.907761,6.607455
AMT_ANNUITY,304531.0,4.373182,0.236672,3.208576,4.21944,4.396896,4.539603,5.411664
AMT_GOODS_PRICE,304531.0,5.62881,0.310793,4.607466,5.37749,5.653213,5.83219,6.607455
DAYS_BIRTH,304531.0,4.188281,0.123494,3.874482,4.094087,4.197446,4.294091,4.401917
DAYS_EMPLOYED,304531.0,3.209153,0.423025,0.0,2.970812,3.346744,3.441381,4.253168
DAYS_REGISTRATION,304531.0,3.514358,0.51244,0.0,3.303844,3.653791,3.873844,4.392222
DAYS_ID_PUBLISH,304531.0,3.365914,0.4042,0.0,3.236033,3.512684,3.633468,3.857212


可以看出数据还是比较一致。这样子主表的异常值就处理完毕。  
接下来同步改变测试表  

In [104]:
app_te = pd.read_csv(app_te_path)

temp=app_te["DAYS_EMPLOYED"].copy()
temp.loc[temp>0]=mean_num
app_te["DAYS_EMPLOYED"]=temp

for c in outlier_possible.index.tolist() :
    if not c=="SK_ID_CURR":
        app_te.loc[:,c]=app_te.loc[:,c].apply(np.abs)
        app_te.loc[:,c]=app_te.loc[:,c].apply(lambda x:x+1.1)
        app_te.loc[:,c]=app_te.loc[:,c].apply(np.log10)

temp=app_te.describe().T
temp.loc[(temp["min"]<-0.1) |(temp["max"]>1.1)]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_CURR,47772.0,277771.337457,103174.096797,100001.0,188557.75,277582.0,367549.5,456250.0
CNT_CHILDREN,47772.0,0.139997,0.162604,0.041393,0.041393,0.041393,0.322219,1.324282
AMT_INCOME_TOTAL,47772.0,5.200183,0.206901,4.431381,5.051157,5.197284,5.352185,6.644439
AMT_CREDIT,47772.0,5.611855,0.307746,4.653223,5.416043,5.653214,5.829304,6.351313
AMT_ANNUITY,47772.0,4.40618,0.240909,3.360991,4.254647,4.418564,4.572774,5.256663
AMT_GOODS_PRICE,47772.0,5.560668,0.310714,4.653223,5.352185,5.602604,5.799341,6.351313
DAYS_BIRTH,47772.0,4.189168,0.122095,3.865643,4.096705,4.198082,4.292635,4.401333
DAYS_EMPLOYED,47772.0,3.236904,0.40429,0.322219,3.019988,3.377864,3.464504,4.242146
DAYS_REGISTRATION,47772.0,3.51401,0.500885,0.041393,3.279462,3.652643,3.873108,4.375171
DAYS_ID_PUBLISH,47772.0,3.373396,0.400338,0.041393,3.232004,3.509216,3.648175,3.802712


In [105]:
# 输出结果

app_tr.to_csv(result_dir+"\\application_train.csv")
app_te.to_csv(result_dir+"\\application_test.csv")