# 安装

## Python 2
对于Python 2用户，因为有一些包的最新版不再支持Python 2，所以直接根据依赖来安装的话，会有版本错误的提示。需要提前按照好对应的依赖包：
```
pip install matplotlib==2.2.3
pip install seaborn pandas sklearn
pip install rosaceae
```

## Python 3 
```
pip install rosaceae
```

# `rosaceae`使用示例   
示例使用的数据集来自Kaggle上的一个项目[GiveMeSomeCredit](https://www.kaggle.com/c/GiveMeSomeCredit)。这个项目也是利用信用评分卡来对用户进行信用评估，下面也只是利用这些数据示范rosaceae的使用方法。
使用到的文件一共有3个： 
+ cs-test.csv  测试数据集
+ cs-training.csv  训练数据集
+ Data Dictionary.xls  数据指标的说明 

这里取了`cs-training.csv`文件前1000行数据用于介绍`rosaceae`的使用，完整数据可以去上面的项目链接下载。

## 读取数据

In [2]:
import pandas as pd

In [3]:
train = pd.read_csv('cs-training-1000.csv', index_col=0)

In [61]:
train.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
1,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
2,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
3,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
4,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


上面的测试数据中，`SeriousDlqin2yrs`是好坏用户的标记。根据`Data Dictionary.xls`文件中的解释，是指用户是否逾期90天，1表示逾期，0表示未逾期。其他列是用户的各个属性变量，变量所代表的含义参考`Data Dictionary.xls`

In [5]:
data_info = pd.read_excel('Data Dictionary.xls')

In [6]:
display(data_info)

Unnamed: 0,Variable Name,Description,Type
0,SeriousDlqin2yrs,Person experienced 90 days past due delinquenc...,Y/N
1,RevolvingUtilizationOfUnsecuredLines,Total balance on credit cards and personal lin...,percentage
2,age,Age of borrower in years,integer
3,NumberOfTime30-59DaysPastDueNotWorse,Number of times borrower has been 30-59 days p...,integer
4,DebtRatio,"Monthly debt payments, alimony,living costs di...",percentage
5,MonthlyIncome,Monthly income,real
6,NumberOfOpenCreditLinesAndLoans,Number of Open loans (installment like car loa...,integer
7,NumberOfTimes90DaysLate,Number of times borrower has been 90 days or m...,integer
8,NumberRealEstateLoansOrLines,Number of mortgage and real estate loans inclu...,integer
9,NumberOfTime60-89DaysPastDueNotWorse,Number of times borrower has been 60-89 days p...,integer


这些变量可以简单划分成下面几类

+ **基本属性：**age（年龄）
+ **财产状况：**NumberOfOpenCreditLinesAndLoans（开放式贷款和信贷），NumberRealEstateLoansOrLines（不动产抵押贷款或房屋将会信贷额度）
+ **信用状况：**NumberOfTimes90DaysLate（逾期>=90天），NumberOfTime60-89DaysPastDueNotWorse（逾期60-89天的次数），NumberOfTime30-59DaysPastDueNotWorse（逾期30-59天的次数）
+ **偿还能力：**MonthlyIncome（月收入），DebtRatio（负债率）
+ **其他因素：**NumberOfDependents（家属数量）



## 查看数据
这里只是演示`rosaceae`的使用方法，并没有任何对数据做特征工程。  
`summary`函数简单统计了各个指标的一些统计指标，对数据有一个快速的了解。

In [8]:
from rosaceae.utils import summary

In [11]:
summ = summary(train)

In [12]:
summ

Unnamed: 0,Field,Type,Recs,Miss,Min,Q25,Q50,Avg,Q75,Max,StDv,Uniq,OutLo,OutHi
0,SeriousDlqin2yrs,int64,1000,0,0.0,0.0,0.0,0.057,0.0,1.0,0.232,2,0,57
1,RevolvingUtilizationOfUnsecuredLines,float64,1000,0,0.0,0.03236,0.1597,4.718,0.5334,2340.0,98.65,865,0,5
2,age,int64,1000,0,22.0,40.0,52.0,51.79,62.25,97.0,15.17,71,0,1
3,NumberOfTime30-59DaysPastDueNotWorse,int64,1000,0,0.0,0.0,0.0,0.266,0.0,10.0,0.7719,8,0,161
4,DebtRatio,float64,1000,0,0.0,0.1673,0.3604,353.8,0.7505,15466.0,1168.0,966,0,206
5,MonthlyIncome,float64,1000,181,0.0,3300.0,5217.0,6618.0,8332.0,208333.0,8818.0,493,0,35
6,NumberOfOpenCreditLinesAndLoans,int64,1000,0,0.0,5.0,8.0,8.523,11.0,31.0,5.169,30,0,30
7,NumberOfTimes90DaysLate,int64,1000,0,0.0,0.0,0.0,0.081,0.0,3.0,0.3723,4,0,55
8,NumberRealEstateLoansOrLines,int64,1000,0,0.0,0.0,1.0,0.974,2.0,8.0,1.019,8,0,2
9,NumberOfTime60-89DaysPastDueNotWorse,int64,1000,0,0.0,0.0,0.0,0.056,0.0,5.0,0.2949,4,0,45


In [13]:
summ.columns

Index(['Field', 'Type', 'Recs', 'Miss', 'Min', 'Q25', 'Q50', 'Avg', 'Q75',
       'Max', 'StDv', 'Uniq', 'OutLo', 'OutHi'],
      dtype='object')

In [7]:
train.shape

(1000, 11)

结果中各列的说明可以参考函数的帮助信息

In [10]:
help(summary)

Help on function summary in module rosaceae.utils:

summary(data, verbose=False)
    Generates descriptive statistics for a dataset
    
    Generates descriptive statistics that summarize the central tendency,
    dispersion and shape of a dataset's distribution. When data is categorious 
    or datatime format, some statistics values will be replaced by NA. 
    
    Parameters
    ----------
    data : pandas data frame
    verbose : True or False
            Print verbose message, default is False.
    
    Returns
    -------
    A pandas data frame with statistics values for each column.
    Field: Field name.
    Type: object, numeric, integer, other.
    Recs: Number of records.
    Miss: Number of missing records.
    Min: Minimum value.
    Q25: First quartile. It splits off the lowest 25\% of data from the highest 75\%.
    Q50: Median or second quartile. It cuts data set in half.
    Avg: Average value.
    Q75: Third quartile. It splits off the lowest 75\% of data from the

## 数据分箱  

在`rosaceae.bins`中提供一些常见的分箱方法：

+ `bin_distince`，按照步长进行划分分箱，适用数值型变量  
+ `bin_frequency`，按照等频分箱，适用数值型变量  
+ `bin_tree`，决策树分箱，利用`sklearn`中的决策树分类模型，适用于数值变量 
+ `bin_chi2`，卡方分箱，适用数值和类别型变量  
+ `bin_custom`，自定义分箱，按照自定义的分箱标准进行划分，适用于数值和类别型变量   

在上面的分箱方法中，有`na_omit`参数来对缺失的数据进行处理，设为`False`时将保留缺失数据并放到单独的Miss分箱，若为`True`，则将缺失数据去除，同时对应的好坏用户数据也会去除。

In [14]:
from rosaceae.bins import bin_distance, bin_frequency, bin_tree, bin_chi2, bin_custom

### bin_distance

以月收入MonthlyIncome 指标来进行测试，先进行等距分箱，这里是按照分箱的数目，数据中的最大值和最小值，把数据分成距离相等的几个分箱。

In [19]:
# 分成5个箱,缺失值另外为个分组， 返回一个字典，包括对应的分箱区间和对应的数据索引
# na_omit 为True，将把缺失值部分去掉。
mon_dis = bin_distance(train['MonthlyIncome'], 5, na_omit=False)

In [21]:
mon_dis.keys()

dict_keys(['[166666.4,inf)', '[-inf,41666.6)', 'Miss', '[83333.2,124999.79999999999)', '[41666.6,83333.2)', '[124999.79999999999,166666.4)'])

In [23]:
bin_distance(train['MonthlyIncome'], 5, na_omit=True, verbose=True).keys()

Distance: 41666.6


dict_keys(['[83333.2,124999.79999999999)', '[166666.4,inf)', '[41666.6,83333.2)', '[-inf,41666.6)', '[124999.79999999999,166666.4)'])

### bin_frequency

等频分箱先将数据从小到大排序，按照给定的分箱数，将数据放入到每个分箱中，每个分箱中的元素个数是接近的。如果将缺失单独为一分箱，那会将剩下的数据按照分箱数进行划分。如果将缺失数据排除，那会把排除后的数据分成对应的几个分箱。最后一个分箱的数目可能与其他分箱有差距，这是因为不能整除的问题，所以把剩下的数据放到最后的一个分箱中。

In [32]:
mon_freq = bin_frequency(train['MonthlyIncome'], 5, na_omit=False, verbose=True)

for k in sorted(mon_freq.keys()):
    print(k, len(mon_freq[k]))

Step: 163
Freq1 163
Freq2 163
Freq3 163
Freq4 163
Freq5 167
Miss 181


In [34]:
mon_freq = bin_frequency(train['MonthlyIncome'], 5, na_omit=True, verbose=True)

for k in sorted(mon_freq.keys()):
    print(k, len(mon_freq[k]))

Step: 163
Freq1 163
Freq2 163
Freq3 163
Freq4 163
Freq5 167


### bin_tree  

决策树分箱是利用`sklearn`中的[`DecisionTreeClassifier`分类函数](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)。使用到的一些参数设置如下：

```python
clf = DecisionTreeClassifier(random_state=0, 
                                criterion='entropy',
                                min_samples_split=0.2,
                                max_leaf_nodes=6, 
                                min_impurity_decrease=0.001,
                                **kwargs)
```

如果还需要设置其他参数，可以通过`**kwargs`进行设置。决策树会根据数据划分叶子节点，我们会把相应的节点信息记录下来。关于决策树的的节点信息可以参考[]()。如果某个叶子节点中的数据占比小于`min_samples_node`设置的阈值，会将该节点合并回父节点上。

In [39]:
mon_tree = bin_tree(train['MonthlyIncome'], train['SeriousDlqin2yrs'], na_omit=True)

for k in mon_tree.keys():
    print(k, len(mon_tree[k]))

[6202.0:7480.0) 75
[-inf:1702.5) 62
[1702.5:6202.0) 428
[7480.0:inf) 254


In [40]:
mon_tree = bin_tree(train['MonthlyIncome'], train['SeriousDlqin2yrs'], na_omit=False)

for k in mon_tree.keys():
    print(k, len(mon_tree[k]))

[6202.0:7480.0) 75
[-inf:1702.5) 62
Miss 181
[1702.5:6202.0) 428
[7480.0:inf) 254


### bin_custom

自定义分箱可以按照提供的区间进行分箱。`groups`参数接受自定义区间，类别型变量的的输入形式类似：`groups=['A', 'B', ('C', 'D')]`，如果多个类别想放到一个分箱中，可以使将多个类别放到元组中。数值型变量的输入形式类似：`groups=['(-inf:4]', '(4:10]', (10:inf)']`，这里第一个和最后一个是分界是包括了无穷大的。 是否保存缺失数据使用的参数是`na_omit`，设置为`True`将去除缺失，`False`将缺失单独分组。  

In [67]:
age_cus = bin_custom(train['age'], groups=['(-inf:40]', '(40:50]', '[50:60)', '[60:inf)'], na_omit=False, verbose=True)

for k in age_cus.keys():
    print(k, len(age_cus[k]))

Keep NA data
Missing data: 0
* Handling (-inf:40]
* Handling (40:50]
* Handling [50:60)
* Handling [60:inf)
(40:50] 199
[60:inf) 301
(-inf:40] 275
[50:60) 246


In [70]:
train['SeriousDlqin2yrs'].value_counts()

0    943
1     57
Name: SeriousDlqin2yrs, dtype: int64

In [69]:
label_cus = bin_custom(train['SeriousDlqin2yrs'], groups=['1','0'])

for k in label_cus.keys():
    print(k, len(label_cus[k]))

('0',) 943
('1',) 57


## WOE和IV的计算
WOE和IV的计算是使用了`rosaceae.scorecard`的`woe_iv`函数。该函数可以同时处理多个变量或单个变量的WOE和IV值，将相应的结果以DataFrame的格式返回。  
使用时需要提前对变量进行分箱，把变量的分箱结果保存在字典中，以对应的变量为键，分箱结果为值。   
也可以提供需要计算WOE和IV的变量名组成的列表，不过这些变量中数值型的箱是使用决策树进行分箱的，类型型的变量按照不同类别进行分类。  

如果将变量的分箱数据保存在字典 bin_info 中
```python

bin_info = {'age': bins_result，  # bins 中分箱函数的结果
            'MonthlyIncome':['(-inf:3000]', '(3000:5000]', '(5000:8500]', '[8500:inf)'],  # 自定义的数值分箱区间
           'NumberOfOpenCreditLinesAndLoans' : [], # 提供一个空列表，说明没有提供相关的分箱信息
                                                   # 将会按照默认的分箱方法进行处理
            ...
           }
```

对于缺失数据的处理，`na_omit`参数只对bin_info中没有提分箱信息用的变量才会生效。`[]`是说明该变量没有提供相关的分箱信息。

In [71]:
train['MonthlyIncome'].describe()

count       819.000000
mean       6617.976801
std        8818.220170
min           0.000000
25%        3300.000000
50%        5217.000000
75%        8332.500000
max      208333.000000
Name: MonthlyIncome, dtype: float64

In [41]:
from rosaceae.scorecard import woe_iv

In [42]:
# 只是提供变量名列表的话，会按照默认的分箱方法进行分箱并计算woe和iv。
woe_iv(train, 'SeriousDlqin2yrs', var=['age', 'MonthlyIncome'], na_omit=False)

Unnamed: 0,Variable,Bin,Good,Bad,pnt_0,pnt_1,WOE,IV_i,IV
0,age,[-inf:36.5),180,9,0.19088,0.157895,-0.189717,0.00625791,0.423641
1,age,[36.5:51.5),276,31,0.292683,0.54386,0.619601,0.155629,0.423641
2,age,[51.5:57.5),148,3,0.156946,0.0526316,-1.09258,0.113972,0.423641
3,age,[57.5:63.5),125,10,0.132556,0.175439,0.280286,0.0120195,0.423641
4,age,[63.5:67.5),56,1,0.0593849,0.0175439,-1.21934,0.0510184,0.423641
5,age,[67.5:inf),158,4,0.16755,0.0701754,-0.870286,0.084744,0.423641
6,MonthlyIncome,[-inf:1702.5),61,1,0.0646872,0.0175439,-1.30486,0.0615154,0.244331
7,MonthlyIncome,[1702.5:6202.0),393,35,0.416755,0.614035,0.387553,0.0764566,0.244331
8,MonthlyIncome,[6202.0:7480.0),75,1,0.0795334,0.0175439,-1.51147,0.0936955,0.244331
9,MonthlyIncome,[7480.0:inf),242,12,0.256628,0.210526,-0.198016,0.00912883,0.244331


In [73]:
bin_info = {'age': bin_tree(train['age'], train['SeriousDlqin2yrs'], na_omit=False),
            'MonthlyIncome':['(-inf:3000]', '(3000:5000]', '(5000:8500]', '[8500:inf)', 'Miss'],
           'RevolvingUtilizationOfUnsecuredLines':[], # 其他变量就按照默认的处理方法好了
           'NumberOfTime30-59DaysPastDueNotWorse':[],
           'DebtRatio':[], 
           'NumberOfOpenCreditLinesAndLoans':[],
           'NumberOfTimes90DaysLate':[],
           'NumberRealEstateLoansOrLines':[],
           'NumberOfTime60-89DaysPastDueNotWorse':[],
           'NumberOfDependents':[]}

In [75]:
woe_df = woe_iv(train, 'SeriousDlqin2yrs', var=bin_info, na_omit=False)

In [76]:
woe_df

Unnamed: 0,Variable,Bin,Good,Bad,pnt_0,pnt_1,WOE,IV_i,IV
0,RevolvingUtilizationOfUnsecuredLines,[-inf:0.2302175909280777),562,12,0.59597,0.210526,-1.04058,0.401085,1.251684
1,RevolvingUtilizationOfUnsecuredLines,[0.2302175909280777:0.3137817233800888),66,1,0.0699894,0.0175439,-1.38364,0.0725657,1.251684
2,RevolvingUtilizationOfUnsecuredLines,[0.3137817233800888:0.6355679929256439),140,8,0.148462,0.140351,-0.0561859,0.00045575,1.251684
3,RevolvingUtilizationOfUnsecuredLines,[0.6355679929256439:1.003534197807312),168,28,0.178155,0.491228,1.01426,0.317536,1.251684
4,RevolvingUtilizationOfUnsecuredLines,[1.003534197807312:inf),7,9,0.00742312,0.157895,3.05733,0.460041,1.251684
5,NumberOfTime60-89DaysPastDueNotWorse,[-inf:0.5),913,42,0.968187,0.736842,-0.273051,0.0631689,0.551969
6,NumberOfTime60-89DaysPastDueNotWorse,[0.5:inf),30,15,0.0318134,0.263158,2.11287,0.4888,0.551969
7,DebtRatio,[-inf:0.12926051020622253),198,3,0.209968,0.0526316,-1.38364,0.217697,0.487792
8,DebtRatio,[0.12926051020622253:0.4607849717140198),382,21,0.40509,0.368421,-0.0948832,0.00347928,0.487792
9,DebtRatio,[0.4607849717140198:1.7644871473312378),167,23,0.177094,0.403509,0.823515,0.186456,0.487792


得到的woe_df 中会按照IV值的大小从大到小排序，同时还会对区间由大到小排序。   
这里需要注意的地方是，如果某个区间里，坏用户数目0，为了防止后面的计算过程中出现除以0的情况发生，这里会在代码里面重新定义为1，以便后面的计算。   
这里的区间`[63.5:67.5)`表示的是`x >= 63.6 and x < 67.5`，也就是`[63,5, 67.5)`， `-inf`和`inf`分别表示负无穷和正无穷。  

# 评分卡模型创建  

分箱结果后，需要将变量的原始取值替换为对应区间的WOE值，以进行后续的连续回归分析。  

`replaceWOE`函数提供了WOE替换功能，这里需要提供一个`woe_iv`生成的WOE数据框结果和原始数据。这里只会替换存在WOE数据框中的变量。

In [77]:
from rosaceae.scorecard import replaceWOE

In [82]:
woe_data = replaceWOE(woe_df, train)

In [83]:
woe_data.head()

Unnamed: 0,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,3.057329,-0.189717,1.112236,0.823515,-0.100886,0.469028,-0.189717,1.051996,-0.273051,0.497732
1,3.057329,-0.189717,-0.421573,-1.38364,0.203325,-0.277669,-0.189717,-0.043865,-0.273051,0.497732
2,3.057329,-0.189717,1.112236,-1.38364,0.30651,-0.277669,0.758322,-0.043865,-0.273051,-0.475612
3,-1.38364,-0.189717,-0.421573,-1.38364,0.30651,-0.277669,-0.189717,-0.043865,-0.273051,-0.475612
4,3.057329,-0.189717,1.112236,-1.38364,-0.100886,-0.277669,-0.189717,-0.540961,-0.273051,-0.475612
