<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#特征选择-(判定贷款用户是否逾期)" data-toc-modified-id="特征选择-(判定贷款用户是否逾期)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>特征选择 (判定贷款用户是否逾期)</a></span><ul class="toc-item"><li><span><a href="#IV值进行特征选择" data-toc-modified-id="IV值进行特征选择-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>IV值进行特征选择</a></span></li><li><span><a href="#随机森林挑选特征" data-toc-modified-id="随机森林挑选特征-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>随机森林挑选特征</a></span><ul class="toc-item"><li><span><a href="#平均不纯度减少-mean-decrease-impurity" data-toc-modified-id="平均不纯度减少-mean-decrease-impurity-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>平均不纯度减少 mean decrease impurity</a></span></li><li><span><a href="#平均精确率减少-Mean-decrease-accuracy" data-toc-modified-id="平均精确率减少-Mean-decrease-accuracy-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>平均精确率减少 Mean decrease accuracy</a></span></li></ul></li><li><span><a href="#综合挑选特征" data-toc-modified-id="综合挑选特征-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>综合挑选特征</a></span></li></ul></li><li><span><a href="#模型选择与模型评估" data-toc-modified-id="模型选择与模型评估-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>模型选择与模型评估</a></span></li><li><span><a href="#遇到的问题" data-toc-modified-id="遇到的问题-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>遇到的问题</a></span></li><li><span><a href="#Reference" data-toc-modified-id="Reference-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

# 特征选择 (判定贷款用户是否逾期)

给定金融数据，预测贷款用户是否会逾期。
（status是标签：0表示未逾期，1表示逾期。）

**Task8（特征工程2 - 特征选择）** - 分别用IV值和随机森林挑选特征，再构建模型，进行模型评估

In [1]:
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split

# 导入数据
data = pd.read_csv('data.csv')
data.drop_duplicates(inplace=True)

# 载入特征
with open('feature.pkl', 'rb') as f:
    X = pickle.load(f)

# 提取标签
y = data.status

# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2333)

In [2]:
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集：', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集：', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值：用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集：', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集：', '%.4f'%roc_auc_score(y_test, y_test_proba))

## IV值进行特征选择

stats.scoreatpercentile(x, 50)    # 得到x在50%处的数值

np.in1d(B,A)    # 在序列B中寻找与序列A相同的值，并返回一逻辑值（True,False）

In [3]:
import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target

def woe(X, y, event=1):  
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 连续特征离散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 计算该特征的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict) 
        
    return iv_dict
        
def discrete(x):
    # 使用5等分离散化特征
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1    # 将[i, i+1]块内的值标记成i+1
    return res

def woe_single_x(x, y, feature,event = 1):
    # event代表预测正例的标签
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total
    
    iv = 0
    woe_dict = {}
    for x1 in set(x):    # 遍历各个块
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total    
        rate_non_event = non_event_count / non_event_total
        
        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        elif rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv

处理上述特征时, 遇到了IV的极端情况, 响应数为0或未响应数为0。

为简单起见, 我们在代码中对极端值进行平滑处理。

In [4]:
import warnings
warnings.filterwarnings("ignore")

iv_dict = woe(X_train, y_train)

In [5]:
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv

[('historical_trans_amount', 2.6975301004625365),
 ('trans_amount_3_month', 2.5633548887586746),
 ('pawns_auctions_trusts_consume_last_6_month', 2.343990314630991),
 ('repayment_capability', 2.31685232254565),
 ('first_transaction_day', 2.10946672748192),
 ('abs', 2.048054369415617),
 ('consfin_avg_limit', 1.8005797778063934),
 ('consume_mini_time_last_1_month', 1.4570522032774857),
 ('loans_avg_limit', 1.3508993179510962),
 ('max_cumulative_consume_later_1_month', 1.2961861663340406),
 ('historical_trans_day', 1.0794587869439352),
 ('pawns_auctions_trusts_consume_last_1_month', 0.9637730486540506),
 ('consfin_credit_limit', 0.829726960824839),
 ('loans_score', 0.8035125155540374),
 ('loans_latest_day', 0.7177168342745962),
 ('avg_price_last_12_month', 0.6395438326722515),
 ('history_suc_fee', 0.6322293100618446),
 ('apply_score', 0.5592084043426475),
 ('latest_query_day', 0.5017485264222311),
 ('consfin_max_limit', 0.483273473979316),
 ('loans_long_time', 0.4592776814323623),
 ('take_

## 随机森林挑选特征

In [6]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [7]:
# 观察默认参数的性能
rf0 = RandomForestClassifier(oob_score=True, random_state=2333)
rf0.fit(X_train, y_train)
print('袋外分数：', rf0.oob_score_)
model_metrics(rf0, X_train, X_test, y_train, y_test)

袋外分数： 0.7342951608055305
[准确率] 训练集： 0.9805 测试集： 0.7744
[auc值] 训练集： 0.9996 测试集： 0.7289


In [None]:
# 网格法调参, 步骤省略...
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
# param_test = {'min_samples_split':range(10,100,20), 'min_samples_leaf':range(10,60,10)}
# param_test = {'max_features':range(3,17,2)}
gsearch = GridSearchCV(estimator = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50, 
                                                          min_samples_leaf=20, max_features = 9,random_state=2333), 
                       param_grid = param_test, scoring='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_

In [9]:
rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数：', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)

袋外分数： 0.7844905320108205
[准确率] 训练集： 0.8115 测试集： 0.7954
[auc值] 训练集： 0.8946 测试集： 0.7914


### 平均不纯度减少 mean decrease impurity

> 下面代码是基于Gini指数

In [10]:
rf.fit(X_train, y_train)
feature_impotance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)

In [11]:
feature_impotance1[:10]

[('0.1333', 'trans_fail_top_count_enum_last_1_month'),
 ('0.0818', 'loans_score'),
 ('0.0784', 'history_fail_fee'),
 ('0.0623', 'apply_score'),
 ('0.0580', 'latest_one_month_fail'),
 ('0.0424', 'loans_overdue_count'),
 ('0.0307', 'trans_fail_top_count_enum_last_12_month'),
 ('0.0237', 'trans_fail_top_count_enum_last_6_month'),
 ('0.0194', 'trans_day_last_12_month'),
 ('0.0184', 'max_cumulative_consume_later_1_month')]

### 平均精确率减少 Mean decrease accuracy

In [12]:
import numpy as np
from collections import defaultdict
from sklearn.model_selection import cross_val_score, ShuffleSplit

scores = defaultdict(list)
rs = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
for train_idx, test_idx in rs.split(X_train):
    x_train, x_test = X_train.values[train_idx], X_train.values[test_idx]
    Y_train, Y_test = y_train.values[train_idx], y_train.values[test_idx]
    r = rf.fit(x_train, Y_train)
    acc = accuracy_score(Y_test, rf.predict(x_test))
    for i in range(x_train.shape[1]):
        X_t = x_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = accuracy_score(Y_test, rf.predict(X_t))
        scores[X_train.columns[i]].append((acc - shuff_acc) / acc)
        
feature_impotance2=sorted([('%.4f'%np.mean(score), feat) for feat, score in scores.items()], reverse=True)

In [13]:
feature_impotance2[:10]

[('0.0163', 'history_fail_fee'),
 ('0.0153', 'trans_fail_top_count_enum_last_1_month'),
 ('0.0120', 'loans_score'),
 ('0.0097', 'latest_one_month_fail'),
 ('0.0097', 'apply_score'),
 ('0.0062', 'loans_overdue_count'),
 ('0.0046', 'trans_fail_top_count_enum_last_12_month'),
 ('0.0041', 'trans_fail_top_count_enum_last_6_month'),
 ('0.0036', 'latest_one_month_suc'),
 ('0.0025', 'avg_price_last_12_month')]

## 综合挑选特征

In [15]:
feature_impotance1[50], feature_impotance2[40]

(('0.0051', 'consfin_product_count'), ('0.0003', 'loans_avg_limit'))

In [16]:
useless = []
for feature in X_train.columns:
    if feature in [t[1] for t in feature_impotance1[50:]] and feature in [t[1] for t in feature_impotance2[40:]]:
        useless.append(feature)
        print(feature, iv_dict[feature])

regional_mobility 0.20654433409120623
is_high_user 0.19615128275454694
avg_consume_less_12_valid_month 0.22239702810015521
reg_preference_for_trad 0.2177870321526657
first_transaction_time_month 0.21771183020755758
first_transaction_time_weekday 0.22040179517292707
latest_query_time_year 0.19785800765281902
latest_query_time_month 0.22281703262580477
latest_query_time_weekday 0.22341492175897784
loans_latest_time_year 0.19963733017168203
loans_latest_time_month 0.23536009610676906
loans_latest_time_weekday 0.21550477980307856
transd_mcc 0.2800425079015167
consume_top_time_last_6_month 0.349022605284868
max_consume_count_later_6_month 0.25742951507849315
railway_consume_count_last_12_month 0.2003683041664725
jewelry_consume_count_last_6_month 0.21790403970323896
query_org_count 0.284857522455637
query_finance_count 0.24719289152922377
query_cash_count 0.23553151956888524
query_sum_count 0.2913852759631131
latest_one_month_apply 0.2764096502289849
latest_six_month_apply 0.316751202587778

In [17]:
len(useless)

33

In [18]:
X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)

In [19]:
rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数：', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)

袋外分数： 0.789600240456868
[准确率] 训练集： 0.8197 测试集： 0.7954
[auc值] 训练集： 0.8929 测试集： 0.7913


# 模型选择与模型评估

调参过程略, 参见"Finance3 - ModelAdjustPara.ipynb"

In [20]:
from sklearn.preprocessing import StandardScaler

# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train.values)
X_test = std.transform(X_test.values)

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

In [105]:
lr = LogisticRegression(C = 0.1, penalty = 'l1')
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True)
dt = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5, 
                    gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5, 
                    objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11, 
                    gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1, seed=27)

In [106]:
sclf = StackingClassifier(classifiers=[svm_linear, svm_poly, svm_rbf, svm_sigmoid, dt, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=False)

In [107]:
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)

[准确率] 训练集： 0.8563 测试集： 0.8017
[auc值] 训练集： 0.9061 测试集： 0.7875


# 遇到的问题

>1）求IV值遇到极端值时怎么处理?
将WOE标记为0/无穷或平滑处理，对IV值有较大大影响。已经无法从0.2—0.5的取值来删除特征了（除了可疑预测，其余都在0.2—0.5之间）。

>2）虽然已经求得IV值或feature_importance，但不知道是不是取值不合常规就一定要删除该特征。
>若一个特征一个特征删除后对比性能，进行验证的话，还要重新调参（很麻烦...）

# Reference

1）[结合Scikit-learn介绍几种常用的特征选择方法](https://blog.csdn.net/Bryan__/article/details/51607215)

2）[IV值的计算及使用](https://www.jianshu.com/p/cc4724a373f8)

3）[Information Value (IV) and Weight of Evidence (WOE) – A Case Study from Banking (Part 4)](http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/)

4）[计算IV值的代码](https://github.com/patrick201/information_value/blob/master/src/information_value.py)

5）[详细 - 数据挖掘模型中的IV和WOE详解](https://blog.csdn.net/kevin7658/article/details/50780391)