
# **数据挖掘——Home Credit Default Risk**

Authors：李林（3120220938）、杨洋（3220211141）、敬甲男（3220221052）、李翰杰（3120220936）

github地址：https://github.com/leealim/kaggle-Home-Credit-Default-Risk

---

## 数据预处理——异常值处理

共八张表，逐个进行处理：
- application_{train|test}.csv:客户申请表
- bureau.csv/bureau_balance.csv:客户历史借款记录
- POS_CASH_balance.csv:客户POS和现金贷款历史
- credit_card_balance.csv:客户信用卡的snapshot历史
- previous_application.csv:客户历史申请记录
- installments_payments.csv:客户信用卡还款记录

---


In [90]:
# 引入本部分所需要的包，并定义需要的值和函数

import pandas as pd
import numpy as np
import os
import math 

source_dir="..\\data\\miss_value_handling"
result_dir="..\\data\\outlier_handling"

app_tr_path = source_dir+"\\application_train.csv"
app_te_path = source_dir+"\\application_test.csv"
bur_path = source_dir+"\\bureau.csv"
bur_bal_path = source_dir+"\\bureau_balance.csv"
pos_path = source_dir+"\\POS_CASH_balance.csv"
cre_path = source_dir+"\\credit_card_balance.csv"
pre_path = source_dir+"\\previous_application.csv"
ins_path = source_dir+"\\installments_payments.csv"
hom_path = "..\\data\\home-credit-default-risk\\HomeCredit_columns_description.csv"  # 列描述表
hom = pd.read_csv(hom_path)

if not os.path.exists(result_dir):
    os.makedirs(result_dir)

def box_outlier(data,q1_,q3_):
    df = data.copy(deep=True)
    for col in df.select_dtypes(exclude='object').columns:             # 对数值属性的每一列分别用盒图进行判断
        Q1 = df[col].quantile(q=q1_)       # 下四分位
        Q3 = df[col].quantile(q=q3_)       # 上四分位
        low_whisker = Q1 - 1.5 * (Q3 - Q1)  # 下边缘
        up_whisker = Q3 + 1.5 * (Q3 - Q1)   # 上边缘
        
        # 寻找异常点,获得异常点索引值，将异常值设为空值
        rule = (df[col] > up_whisker) | (df[col] < low_whisker)  
        out = df[col].index[rule]
        df.loc[out,col]=np.nan
    return df


### 1. **application_{train|test}.csv**

在过程中，发现DAYS_EMPLOYED异常数据,用非异常值部分的均值填补  
此外用箱线图处理异常值，同时为了不删除大量数据（正常删除会删除3分之2的数据），限制Q1以及Q3（可以自行设定值）

In [91]:
app_tr = pd.read_csv(app_tr_path)
app_tr.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SK_ID_CURR,304531.0,278171.558800,102782.532925,100002.0,189138.5,278193.0,367136.0,456255.0
TARGET,304531.0,0.081000,0.272836,0.0,0.0,0.0,0.0,1.0
CNT_CHILDREN,304531.0,0.417140,0.722308,0.0,0.0,0.0,1.0,19.0
AMT_INCOME_TOTAL,304531.0,168663.446314,237890.994045,25650.0,112500.0,147600.0,202500.0,117000000.0
AMT_CREDIT,304531.0,599559.238330,402145.313895,45000.0,270000.0,517266.0,808650.0,4050000.0
...,...,...,...,...,...,...,...,...
AMT_REQ_CREDIT_BUREAU_WEEK,304531.0,0.029829,0.190669,0.0,0.0,0.0,0.0,8.0
AMT_REQ_CREDIT_BUREAU_MON,304531.0,0.231635,0.856216,0.0,0.0,0.0,0.0,27.0
AMT_REQ_CREDIT_BUREAU_QRT,304531.0,0.230203,0.745648,0.0,0.0,0.0,0.0,261.0
AMT_REQ_CREDIT_BUREAU_YEAR,304531.0,1.648630,1.856842,0.0,0.0,1.0,3.0,25.0


In [92]:
#DAYS_EMPLOYED异常数据证明

app_tr["DAYS_EMPLOYED"].loc[app_tr["DAYS_EMPLOYED"]>0]

8         365243
11        365243
23        365243
38        365243
43        365243
           ...  
304489    365243
304503    365243
304507    365243
304525    365243
304527    365243
Name: DAYS_EMPLOYED, Length: 54852, dtype: int64

In [93]:
#箱线图

temp=box_outlier(app_tr.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

6         True
26        True
50        True
70        True
71        True
          ... 
304283    True
304324    True
304342    True
304379    True
304528    True
Length: 13910, dtype: bool

In [94]:
#删除结果

app_tr.drop(index=temp.index,inplace=True)
temp=box_outlier(app_tr.iloc[:,2:],0.01,0.99).isnull().any(axis=1)
temp=temp.loc[temp==True]
temp

614       True
1397      True
1399      True
1497      True
1641      True
          ... 
299346    True
299600    True
301945    True
303769    True
304152    True
Length: 484, dtype: bool

In [95]:
# 输出结果

app_tr.to_csv(result_dir+"\\application_train.csv",index=False)
app_te = pd.read_csv(app_te_path)
app_te.to_csv(result_dir+"\\application_test.csv",index=False)


### 2. **previous_application.csv**

In [96]:
pre = pd.read_csv(pre_path)

In [97]:
# 结果保存

pre.to_csv(result_dir+"\\previous_application.csv",index=False)