# 金融信贷风控模型
## 贷后催收模型
> 案例数据为美国某p2p公司贷后数据

### 1.导入所需模块

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from time import strptime,mktime
from datetime import datetime
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import cross_validation, metrics
%matplotlib inline

### 2.设置数据转换方法
* 第一个方法将划款率转换成logit形式
* 第二个方法为将分类数据变量的缺失转换成unkown
* 第三个方法为对连续型变量进行缺失数据填充，方法为预定义的repalcement

In [4]:
def LogitRR(x):
    '''
    :param x: 划款率，有的超过1，有的为0.做截断处理
    :return: 将还款率转化成logit变换
    '''
    if x >= 1:
        y = 0.9999
    elif x == 0:
        y = 0.0001
    else:
        y = x
    return np.log(y/(1-y))

def MakeupMissingCategorical(x):
    if str(x) == 'nan':
        return 'Unknown'
    else:
        return x

def MakeupMissingNumerical(x,replacement):
    if np.isnan(x):
        return replacement
    else:
        return x

### 3.数据读取与查看

In [6]:
foldOfData = '/home/hk/Documents/MLinLoan/'
mydata = pd.read_csv(foldOfData + "prosperLoanData_chargedoff.csv",header = 0)

In [7]:
mydata.head()

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
0,0DEB351032566848818E832,496566,21:40.3,,36,Chargedoff,2013/9/15 0:00,0.35643,0.3199,0.3099,...,-39.2,0.0,1031.89,1031.89,0.0,1.0,0,0,0.0,50
1,10F83416199324455232962,292860,04:51.6,HR,36,Chargedoff,2009/2/22 0:00,0.29743,0.2745,0.2645,...,-12.96,0.0,2293.09,2293.09,0.0,1.0,0,0,0.0,67
2,349134175738976507325F1,308345,33:07.7,B,36,Chargedoff,2008/11/23 0:00,0.15882,0.1445,0.1345,...,-4.13,0.0,2387.7,2387.7,0.0,1.0,0,0,0.0,85
3,30423527854712576FEBDB0,532077,39:41.2,,36,Chargedoff,2012/5/18 0:00,0.35132,0.3149,0.3049,...,-9.95,0.0,5793.94,5793.94,0.0,1.0,0,0,0.0,2
4,08E134244479824305E38AA,361026,32:23.2,B,36,Chargedoff,2010/12/10 0:00,0.15377,0.1395,0.1295,...,-58.18,-17.0,1608.01,1608.05,85.0,1.0,0,0,0.0,92


### 4.变量预处理
> 还款率定义：还款率=催收回款/(所欠本金+利息+催收成本)

In [8]:
mydata['rec_rate'] = mydata.apply(lambda x: x.LP_NonPrincipalRecoverypayments /(x.AmountDelinquent-x.LP_CollectionFees), axis=1)
# 限定还款率最大为1
mydata['rec_rate'] = mydata['rec_rate'].map(lambda x: min(x,1))

> 数据划分:4/6开

In [13]:
trainData, testData = train_test_split(mydata,test_size=0.4)

> 分类变量&连续变量划分

In [14]:
categoricalFeatures = ['CreditGrade','Term','BorrowerState','Occupation','EmploymentStatus','IsBorrowerHomeowner','CurrentlyInGroup','IncomeVerifiable']

numFeatures = ['BorrowerAPR','BorrowerRate','LenderYield','ProsperRating (numeric)','ProsperScore','ListingCategory (numeric)','EmploymentStatusDuration','CurrentCreditLines',
                'OpenCreditLines','TotalCreditLinespast7years','CreditScoreRangeLower','OpenRevolvingAccounts','OpenRevolvingMonthlyPayment','InquiriesLast6Months','TotalInquiries',
               'CurrentDelinquencies','DelinquenciesLast7Years','PublicRecordsLast10Years','PublicRecordsLast12Months','BankcardUtilization','TradesNeverDelinquent (percentage)',
               'TradesOpenedLast6Months','DebtToIncomeRatio','LoanFirstDefaultedCycleNumber','LoanMonthsSinceOrigination','PercentFunded','Recommendations','InvestmentFromFriendsCount',
               'Investors']

In [17]:
mydata[numFeatures].describe()

Unnamed: 0,BorrowerAPR,BorrowerRate,LenderYield,ProsperRating (numeric),ProsperScore,ListingCategory (numeric),EmploymentStatusDuration,CurrentCreditLines,OpenCreditLines,TotalCreditLinespast7years,...,BankcardUtilization,TradesNeverDelinquent (percentage),TradesOpenedLast6Months,DebtToIncomeRatio,LoanFirstDefaultedCycleNumber,LoanMonthsSinceOrigination,PercentFunded,Recommendations,InvestmentFromFriendsCount,Investors
count,3046.0,3046.0,3046.0,1126.0,1126.0,3046.0,3046.0,3046.0,3046.0,3046.0,...,3046.0,3046.0,3046.0,2837.0,3046.0,3046.0,3046.0,3046.0,3046.0,3046.0
mean,0.278995,0.256069,0.244253,2.480462,4.760213,2.347997,79.839133,6.94025,6.135916,26.387065,...,0.51285,0.654301,0.931714,0.295139,16.319764,58.279711,0.997769,0.13329,0.056139,67.149376
std,0.074141,0.067817,0.068459,1.31962,2.09746,3.444284,85.77427,4.786034,4.322763,14.431227,...,0.433415,0.21484,1.346137,0.848312,8.864026,24.489518,0.022633,0.515725,0.282727,71.924813
min,0.02998,0.01,0.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.01,1.0,7.0,0.7013,0.0,0.0,1.0
25%,0.223635,0.2049,0.19,1.0,3.0,0.0,19.0,3.0,3.0,16.0,...,0.03,0.52,0.0,0.11,9.0,31.0,1.0,0.0,0.0,20.0
50%,0.291035,0.26275,0.25,2.0,5.0,1.0,50.0,6.0,5.0,24.0,...,0.53,0.68,1.0,0.18,14.0,70.0,1.0,0.0,0.0,45.0
75%,0.35132,0.315225,0.305225,3.0,6.0,3.0,111.0,9.0,8.0,35.0,...,0.9,0.82,1.0,0.28,22.0,79.0,1.0,0.0,0.0,87.0
max,0.42395,0.36,0.34,7.0,10.0,20.0,532.0,33.0,31.0,118.0,...,2.68,1.0,14.0,10.01,41.0,85.0,1.0,16.0,4.0,635.0


> 对分类变量进行编码：编码 = 类别对应的平均还款率

In [20]:
encodedFeatures = []
encodedDict = {}
for var in categoricalFeatures:
    trainData[var] = trainData[var].map(MakeupMissingCategorical)
    avgTarget = trainData.groupby([var])['rec_rate'].mean()
    avgTarget = avgTarget.to_dict()
    newVar = var + '_encoded'
    newVarSeries = trainData[var].map(avgTarget)
    trainData[newVar] = newVarSeries
    encodedFeatures.append(newVar)
    encodedDict[var] = avgTarget


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [22]:
trainData[encodedFeatures].head()

Unnamed: 0,CreditGrade_encoded,Term_encoded,BorrowerState_encoded,Occupation_encoded,EmploymentStatus_encoded,IsBorrowerHomeowner_encoded,CurrentlyInGroup_encoded,IncomeVerifiable_encoded
2144,0.074754,0.072245,0.049353,0.053369,0.084543,0.06923,0.085929,0.071543
695,0.074754,0.072245,0.120636,0.067463,0.084543,0.071725,0.063975,0.071543
421,0.098356,0.072245,0.047522,0.084872,0.084543,0.071725,0.063975,0.071543
924,0.074754,0.072245,0.047522,0.073527,0.084543,0.06923,0.063975,0.071543
742,0.065102,0.072245,0.081897,0.099543,0.084543,0.06923,0.085929,0.071543


> 对连续变量缺失值进行填充

In [23]:
trainData['ProsperRating (numeric)'] = trainData['ProsperRating (numeric)'].map(lambda x: MakeupMissingNumerical(x,0))
trainData['ProsperScore'] = trainData['ProsperScore'].map(lambda x: MakeupMissingNumerical(x,0))

avgDebtToIncomeRatio = np.mean(trainData['DebtToIncomeRatio'])
trainData['DebtToIncomeRatio'] = trainData['DebtToIncomeRatio'].map(lambda x: MakeupMissingNumerical(x,avgDebtToIncomeRatio))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


### 5.模型开发
1.第三步：调参
>对基于CART的随机森林的调参，主要有：
1，树的个数
2，树的最大深度
3，内部节点最少样本数与叶节点最少样本数
4，特征个数
<
此外，调参过程中选择的误差函数是均值误差，5倍折叠

> 变量选择：将编码后的字符串变量选入训练集
> 根据网格搜索法搜索最佳分类器数量：
1. 范围{from:10, to:80, step:5}
2. 最小样本：50， 叶结点最小样本：10， 最大深度：8， 最大特征：sqrt(k)， 评价方法：mse

In [24]:
numFeatures2 = numFeatures + encodedFeatures
X, y= trainData[numFeatures2],trainData['rec_rate']

param_test1 = {'n_estimators':range(10,80,5)}
gsearch1 = GridSearchCV(estimator = RandomForestRegressor(min_samples_split=50,min_samples_leaf=10,max_depth=8,max_features='sqrt' ,random_state=10),
                       param_grid = param_test1, scoring='neg_mean_squared_error',cv=5)
gsearch1.fit(X,y)
gsearch1.best_params_, gsearch1.best_score_
best_n_estimators = gsearch1.best_params_['n_estimators']

In [27]:
gsearch1.best_params_

{'n_estimators': 70}

> 运用网格搜索法，搜索其他最佳参数
2. 最大深度&最小样本分割
3. 最小样本分割&最小叶结点
4. 最大特征

In [28]:
param_test2 = {'max_depth':range(3,21), 'min_samples_split':range(10,100,10)}
gsearch2 = GridSearchCV(estimator = RandomForestRegressor(n_estimators=best_n_estimators, min_samples_leaf=10,max_features='sqrt' ,random_state=10,oob_score=True),
                       param_grid = param_test2, scoring='neg_mean_squared_error',cv=5)
gsearch2.fit(X,y)
gsearch2.best_params_, gsearch2.best_score_
best_max_depth = gsearch2.best_params_['max_depth']
best_min_sample_split = gsearch2.best_params_['min_samples_split']

param_test3 = {'min_samples_split':range(50,201,10), 'min_samples_leaf':range(1,20,2)}
gsearch3 = GridSearchCV(estimator = RandomForestRegressor(n_estimators=best_n_estimators, max_depth = best_max_depth,max_features='sqrt',random_state=10,oob_score=True),
                       param_grid = param_test3, scoring='neg_mean_squared_error',cv=5)
gsearch3.fit(X,y)
gsearch3.best_params_, gsearch3.best_score_
best_min_samples_leaf = gsearch3.best_params_['min_samples_leaf']
best_min_samples_split = gsearch3.best_params_['min_samples_split']


In [30]:
numOfFeatures = len(numFeatures2)
mostSelectedFeatures = numOfFeatures/2
param_test4 = {'max_features':range(3,numOfFeatures+1)}
gsearch4 = GridSearchCV(estimator = RandomForestRegressor(n_estimators=best_n_estimators, max_depth=best_max_depth,min_samples_leaf=best_min_samples_leaf,
                                                          min_samples_split=best_min_samples_split,random_state=10,oob_score=True),
                       param_grid = param_test4, scoring='neg_mean_squared_error',cv=5)
gsearch4.fit(X,y)
gsearch4.best_params_, gsearch4.best_score_
best_max_features = gsearch4.best_params_['max_features']

In [32]:
print(gsearch2.best_params_)
print(gsearch1.best_params_)
print(gsearch3.best_params_)
print(gsearch4.best_params_)

{'max_depth': 11, 'min_samples_split': 10}
{'n_estimators': 70}
{'min_samples_leaf': 1, 'min_samples_split': 60}
{'max_features': 10}


In [33]:
print('最佳深度: %d' % best_max_depth)
print('最佳深度: %d' % best_n_estimators)
print('最小叶结点样本：%d' % best_min_samples_leaf)
print('最小样本分割： %d' % best_min_samples_split)
print('最大特征: %.2f' % best_max_features)

最佳深度: 11
最佳深度: 70
最小叶结点样本：1
最小样本分割： 60
最大特征: 10.00


> 根据之前训练得到的最佳参数拟合模型

In [35]:
cls = RandomForestRegressor(n_estimators=best_n_estimators,
                            max_depth=best_max_depth,
                            min_samples_leaf=best_min_samples_leaf,
                            min_samples_split=best_min_samples_split,
                            max_features=best_max_features,
                            random_state=10,
                            oob_score=True)
cls.fit(X,y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=11,
           max_features=10, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=60, min_weight_fraction_leaf=0.0,
           n_estimators=70, n_jobs=1, oob_score=True, random_state=10,
           verbose=0, warm_start=False)

> 预测结果输出，并比较预测结果与真实还款情况
* 输出预测结果高于真实情况的比率
* 输出预测与真实结果的mae

In [37]:
trainData['pred'] = cls.predict(trainData[numFeatures2])
trainData['less_rr'] = trainData.apply(lambda x: int(x.pred > x.rec_rate), axis=1)
print(np.mean(trainData['less_rr']))
err = trainData.apply(lambda x: np.abs(x.pred - x.rec_rate), axis=1)
print(np.mean(err))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


0.8741105637657361
0.0961360763057952


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


### 6.模型验证
* 输出预测结果高于真实情况的比率
* 输出预测与真实结果的mae

In [38]:
# 对测试数据中的字符串变量运用同样的方法进行编码
for var in categoricalFeatures:
    testData[var] = testData[var].map(MakeupMissingCategorical)
    newVar = var + '_encoded'
    testData[newVar] = testData[var].map(encodedDict[var])
    avgnewVar = np.mean(trainData[newVar])
    testData[newVar] = testData[newVar].map(lambda x: MakeupMissingNumerical(x, avgnewVar))

# 对测试数据中的连续变量运用同样的方法进行缺值填补
testData['ProsperRating (numeric)'] = testData['ProsperRating (numeric)'].map(lambda x: MakeupMissingNumerical(x,0))
testData['ProsperScore'] = testData['ProsperScore'].map(lambda x: MakeupMissingNumerical(x,0))
testData['DebtToIncomeRatio'] = testData['DebtToIncomeRatio'].map(lambda x: MakeupMissingNumerical(x,avgDebtToIncomeRatio))

testData['pred'] = cls.predict(testData[numFeatures2])
testData['less_rr'] = testData.apply(lambda x: int(x.pred > x.rec_rate), axis=1)
print(np.mean(testData['less_rr']))
err = testData.apply(lambda x: np.abs(x.pred - x.rec_rate), axis=1)
print(np.mean(err))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

0.8679245283018868
0.1154566325452688


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
