# `rosaceae`使用示例   
示例使用的数据集来自Kaggle上的一个项目[GiveMeSomeCredit](https://www.kaggle.com/c/GiveMeSomeCredit)。这个项目也是利用信用评分卡来对用户进行信用评估，下面也只是利用这些数据示范rosaceae的使用方法。
使用到的文件一共有3个： 
+ cs-test.csv  测试数据集
+ cs-training.csv  训练数据集
+ Data Dictionary.xls  数据指标的说明 

这里取了`cs-training.csv`文件前1000行数据用于介绍`rosaceae`的使用，完整数据可以去上面的项目链接下载。

## 读取数据

In [1]:
import pandas as pd

In [6]:
train = pd.read_csv('cs-training-1000.csv', index_col=0)

In [7]:
train.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


上面的测试数据中，`SeriousDlqin2yrs`是好坏用户的标记。根据`Data Dictionary.xls`文件中的解释，是指用户是否逾期90天，1表示逾期，0表示未逾期。其他列是用户的各个属性变量，变量所代表的含义参考`Data Dictionary.xls`

In [4]:
data_info = pd.read_excel('Data Dictionary.xls')

In [5]:
display(data_info)

Unnamed: 0,Variable Name,Description,Type
0,SeriousDlqin2yrs,Person experienced 90 days past due delinquenc...,Y/N
1,RevolvingUtilizationOfUnsecuredLines,Total balance on credit cards and personal lin...,percentage
2,age,Age of borrower in years,integer
3,NumberOfTime30-59DaysPastDueNotWorse,Number of times borrower has been 30-59 days p...,integer
4,DebtRatio,"Monthly debt payments, alimony,living costs di...",percentage
5,MonthlyIncome,Monthly income,real
6,NumberOfOpenCreditLinesAndLoans,Number of Open loans (installment like car loa...,integer
7,NumberOfTimes90DaysLate,Number of times borrower has been 90 days or m...,integer
8,NumberRealEstateLoansOrLines,Number of mortgage and real estate loans inclu...,integer
9,NumberOfTime60-89DaysPastDueNotWorse,Number of times borrower has been 60-89 days p...,integer


这些变量可以简单划分成下面几类

+ **基本属性：**age（年龄）
+ **财产状况：**NumberOfOpenCreditLinesAndLoans（开放式贷款和信贷），NumberRealEstateLoansOrLines（不动产抵押贷款或房屋将会信贷额度）
+ **信用状况：**NumberOfTimes90DaysLate（逾期>=90天），NumberOfTime60-89DaysPastDueNotWorse（逾期60-89天的次数），NumberOfTime30-59DaysPastDueNotWorse（逾期30-59天的次数）
+ **偿还能力：**MonthlyIncome（月收入），DebtRatio（负债率）
+ **其他因素：**NumberOfDependents（家属数量）



## 数据简单处理
这里只是演示`rosaceae`的使用方法，并没有任何对数据做特征工程

In [8]:
train.shape

(1000, 11)

## WOE和IV的计算
WOE和IV的计算是使用了`rosaceae.scorecard`的`woe_iv`函数。该函数可以同时处理多个变量或单个变量的WOE和IV值，将相应的结果以DataFrame的格式返回。

In [18]:
from rosaceae.scorecard import woe_iv

In [12]:
from rosaceae.bins import bin_tree

In [13]:
b = bin_tree(train['age'],train['SeriousDlqin2yrs'])

In [14]:
b.keys()

dict_keys(['-inf:36.5', '36.5:51.5', '51.5:57.5', '57.5:63.5', '63.5:67.5', '67.5:inf'])

In [25]:
age_df = woe_iv(train, 'SeriousDlqin2yrs', vars=['age'], dt=[0], verbose=True)

Processing on age
total_good: 943.0	total_bad: 57.0

Variable	Bin	Good(%)	Bad(%)	WOE_i	IV_i
age	-inf:36.5	180.0	9.0	0.19088016967126192	0.15789473684210525	-0.18971725875508363	0.0062579058951975465
age	36.5:51.5	276.0	31.0	0.2926829268292683	0.543859649122807	0.6196013535669037	0.1556294371175749
age	51.5:57.5	148.0	3.0	0.1569459172852598	0.05263157894736842	-1.0925849702970982	0.11397227825446651
age	57.5:63.5	125.0	10.0	0.1325556733828208	0.17543859649122806	0.2802863704906518	0.012019498874085174
age	63.5:67.5	56.0	1	0.05938494167550371	0.017543859649122806	-1.2193366759362418	0.05101836587562293
age	67.5:inf	158.0	4.0	0.16755037115588547	0.07017543859649122	-0.8702856571081689	0.08474400716831605


这里需要注意的地方是，在`age`的`63.5:67.5`这个区间里，坏用户数目其实是0，不过为了防止后面的计算过程中出现除以0的情况发生，这里会在代码里面重新定义为1，以便后面的计算。   
这里的区间`63.5:67.5`表示的是`x >= 63.6 and x < 67.5`，也就是`[63,5, 67.5)`， `-inf`和`inf`分别表示负无穷和正无穷。

In [26]:
age_df

Unnamed: 0,Variable,Bin,Good,Bad,pnt_0,pnt_e,WOE,IV_i,IV
0,age,-inf:36.5,180,9,0.19088,0.157895,-0.189717,0.00625791,0.423641
1,age,36.5:51.5,276,31,0.292683,0.54386,0.619601,0.155629,0.423641
2,age,51.5:57.5,148,3,0.156946,0.0526316,-1.09258,0.113972,0.423641
3,age,57.5:63.5,125,10,0.132556,0.175439,0.280286,0.0120195,0.423641
4,age,63.5:67.5,56,1,0.0593849,0.0175439,-1.21934,0.0510184,0.423641
5,age,67.5:inf,158,4,0.16755,0.0701754,-0.870286,0.084744,0.423641


如果想一次性的对所有变量计算WOE和IV，可以传入一个变量列表。同时注意`dt`参数，该参数用于指明变量是连续型还是分类型数据，用`0`指示连续型变量，`1`指示分类型变量，变量类型与变量列表一一对应。因为这里变量都是连续型变量，所以传入一个只包含0的列表

In [27]:
vars = train.columns[train.columns != 'SeriousDlqin2yrs']
dt = [0] * len(vars)
woe_iv_df = woe_iv(train, 'SeriousDlqin2yrs', vars=vars, dt=dt)

In [29]:
woe_iv_df

Unnamed: 0,Variable,Bin,Good,Bad,pnt_0,pnt_e,WOE,IV_i,IV
0,RevolvingUtilizationOfUnsecuredLines,-inf:0.2302175909280777,562,12,0.59597,0.210526,-1.04058,0.401085,1.054533
1,RevolvingUtilizationOfUnsecuredLines,0.2302175909280777:0.31378173828125,66,1,0.0699894,0.0175439,-1.38364,0.0725657,1.054533
2,RevolvingUtilizationOfUnsecuredLines,0.31378173828125:0.6355680227279663,140,8,0.148462,0.140351,-0.0561859,0.00045575,1.054533
3,RevolvingUtilizationOfUnsecuredLines,0.6355680227279663:inf,175,37,0.185578,0.649123,1.25215,0.580426,1.054533
4,age,-inf:36.5,180,9,0.19088,0.157895,-0.189717,0.00625791,0.423641
5,age,36.5:51.5,276,31,0.292683,0.54386,0.619601,0.155629,0.423641
6,age,51.5:57.5,148,3,0.156946,0.0526316,-1.09258,0.113972,0.423641
7,age,57.5:63.5,125,10,0.132556,0.175439,0.280286,0.0120195,0.423641
8,age,63.5:67.5,56,1,0.0593849,0.0175439,-1.21934,0.0510184,0.423641
9,age,67.5:inf,158,4,0.16755,0.0701754,-0.870286,0.084744,0.423641


上表记录了所有变量的IV值，对应区间的WOE值。对于连续型变量，这里使用的决策树的分箱方法，具体代码在`rosaceae.bins.bin_tree`，利用了`sklearn`的决策树函数。  

如果不提供参数给`var`，默认是把输入的Data Frame中除`SeriousDlqin2yrs`外都做了分析

In [30]:
woe_iv_df2 = woe_iv(train, 'SeriousDlqin2yrs', dt=dt)

In [31]:
woe_iv_df2

Unnamed: 0,Variable,Bin,Good,Bad,pnt_0,pnt_e,WOE,IV_i,IV
0,RevolvingUtilizationOfUnsecuredLines,-inf:0.2302175909280777,562,12,0.59597,0.210526,-1.04058,0.401085,1.054533
1,RevolvingUtilizationOfUnsecuredLines,0.2302175909280777:0.31378173828125,66,1,0.0699894,0.0175439,-1.38364,0.0725657,1.054533
2,RevolvingUtilizationOfUnsecuredLines,0.31378173828125:0.6355680227279663,140,8,0.148462,0.140351,-0.0561859,0.00045575,1.054533
3,RevolvingUtilizationOfUnsecuredLines,0.6355680227279663:inf,175,37,0.185578,0.649123,1.25215,0.580426,1.054533
4,age,-inf:36.5,180,9,0.19088,0.157895,-0.189717,0.00625791,0.423641
5,age,36.5:51.5,276,31,0.292683,0.54386,0.619601,0.155629,0.423641
6,age,51.5:57.5,148,3,0.156946,0.0526316,-1.09258,0.113972,0.423641
7,age,57.5:63.5,125,10,0.132556,0.175439,0.280286,0.0120195,0.423641
8,age,63.5:67.5,56,1,0.0593849,0.0175439,-1.21934,0.0510184,0.423641
9,age,67.5:inf,158,4,0.16755,0.0701754,-0.870286,0.084744,0.423641
