# 一起来打怪之 Credit Scoring 练习

---
## 作业说明

- 答题步骤：
    - 回答问题**请保留每一步**操作过程，请不要仅仅给出最后答案
    - 请养成代码注释的好习惯

- 解题思路：
    - 为方便大家准确理解题目，在习题实战中有所收获，本文档提供了解题思路提示
    - 解题思路**仅供参考**，鼓励原创解题方法
    - 为督促同学们自己思考，解题思路内容设置为**白色**，必要时请从冒号后拖动鼠标查看

- 所用数据
    - 请注意导入数据库后先**查看和了解数据的基本性质**，后面的问题不再一一提醒

## machine learning for credit scoring


Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


----------
## Read the data into Pandas 

In [15]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
import matplotlib.pyplot as plt
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
data.shape

(112915, 11)

------------
## Drop na

In [3]:
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [4]:
data.dropna(inplace=True)
data.shape

(108648, 11)

---------
## Create X and y

In [5]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [6]:
y.mean()

0.06742876076872101

---
## 练习1：把数据切分成训练集和测试集
- 提示：<span style='color:white'>from sklearn.model_selection import train_test_split('Gender') </span>

In [7]:
## your code here

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=0)

# 查看数据的维度
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((76053, 10), (32595, 10), (76053,), (32595,))

In [8]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()  # 对数据的每一列做了一个归一化
sc.fit(X_train)

In [9]:
sc.mean_

array([6.24891681e+00, 5.13431291e+01, 3.82141401e-01, 3.05889733e-01,
       6.96329249e+03, 8.67997318e+00, 2.15875771e-01, 1.01193904e+00,
       1.91761009e-01, 8.57415224e-01])

In [10]:
sc.scale_

array([2.73361312e+02, 1.44369533e+01, 3.57454381e+00, 2.22787954e-01,
       1.58054741e+04, 5.12814227e+00, 3.54201626e+00, 1.07201391e+00,
       3.52622942e+00, 1.15264795e+00])

In [11]:
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [13]:
# Check
np.mean(X_train, axis=0), np.std(X_train, axis=0)

(RevolvingUtilizationOfUnsecuredLines       6.248917
 age                                       51.343129
 NumberOfTime30-59DaysPastDueNotWorse       0.382141
 DebtRatio                                  0.305890
 MonthlyIncome                           6963.292493
 NumberOfOpenCreditLinesAndLoans            8.679973
 NumberOfTimes90DaysLate                    0.215876
 NumberRealEstateLoansOrLines               1.011939
 NumberOfTime60-89DaysPastDueNotWorse       0.191761
 NumberOfDependents                         0.857415
 dtype: float64,
 RevolvingUtilizationOfUnsecuredLines      273.361312
 age                                        14.436953
 NumberOfTime30-59DaysPastDueNotWorse        3.574544
 DebtRatio                                   0.222788
 MonthlyIncome                           15805.474141
 NumberOfOpenCreditLinesAndLoans             5.128142
 NumberOfTimes90DaysLate                     3.542016
 NumberRealEstateLoansOrLines                1.072014
 NumberOfTime60-89Day

In [14]:
# Check
np.mean(X_train_std, axis=0), np.std(X_train_std, axis=0)
# np.mean(X_test_std, axis=0), np.std(X_test_std, axis=0)

(array([-3.73709248e-18, -2.42350447e-16,  2.25159822e-17, -2.42537302e-16,
         2.57392244e-17,  2.70939205e-17,  1.08842818e-17, -3.19521407e-17,
        -8.40845808e-18,  1.21455506e-17]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

----
## 练习2：使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类
尝试查sklearn API了解模型参数含义，调整不同的参数

### Logistic regression
- 提示：<span style='color:white'>from sklearn import linear_model('Gender') </span>

In [19]:
## your code here

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1000.0, max_iter=100, random_state=0)
lr.fit(X_train_std, y_train)


### Decision Tree
- 提示：<span style='color:white'>from sklearn.tree import DecisionTreeClassifier('Gender') </span>

In [36]:
## your code here

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
tree.fit(X_train_std, y_train)

### Random Forest
- 提示：<span style='color:white'>from sklearn.ensemble import RandomForestClassifier('Gender') </span>

In [41]:
## your code here

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(criterion='entropy', 
                                n_estimators=10, # The number of trees in the forest.
                                random_state=1,
                                n_jobs=2)
forest.fit(X_train_std, y_train)

### SVM
- 提示：<span style='color:white'>from sklearn.svm import SVC('Gender') </span>

In [55]:
## your code here

from sklearn.svm import SVC

svm = SVC(kernel='linear', C=0.1, random_state=0, probability = True)
svm.fit(X_train_std, y_train)

### KNN
- 提示：<span style='color:white'>from sklearn.neighbors import KNeighborsClassifier('Gender') </span>

In [49]:
## your code here

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

---

## 练习3：在测试集上进行预测，计算准确度

### Logistic regression
- 提示：<span style='color:white'>y_pred_LR = clf_LR.predict(x_test)('Gender') </span>

In [23]:
## your code here

y_pred_LR = lr.predict(X_test_std)

### Decision Tree
- 提示：<span style='color:white'>y_pred_tree = tree.predict(x_test)('Gender') </span>

In [37]:
## your code here

y_pred_tree = tree.predict(X_test_std)

### Random Forest
- 提示：<span style='color:white'>y_pred_forest = forest.predict(x_test)('Gender') </span>

In [42]:
## your code here

y_pred_forest = forest.predict(X_test_std)

### SVM
- 提示：<span style='color:white'>y_pred_SVC = clf_svc.predict(x_test)('Gender') </span>

In [57]:
## your code here
y_pred_svm = svm.predict(X_test_std)

### KNN
- 提示：<span style='color:white'>y_pred_KNN = neigh.predict(x_test)('Gender') </span>

In [50]:
## your code here
y_pred_knn = knn.predict(X_test_std)

---
## 练习4：查看sklearn的官方说明，了解分类问题的评估标准，并对此例进行评估

**混淆矩阵（Confusion Matrix）相关学习链接**

- Blog:<br>
http://blog.csdn.net/vesper305/article/details/44927047<br>
- WiKi:<br>
http://en.wikipedia.org/wiki/Confusion_matrix<br>
- sklearn doc:<br>
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [26]:
## Logistic Regression

## your code here
from sklearn.metrics import classification_report, confusion_matrix
# 性能评估
classification_report_output = classification_report(y_test, y_pred_LR)
confusion_matrix_output = confusion_matrix(y_test, y_pred_LR)

# 分类报告 
print(classification_report_output)
# 混淆矩阵
print(confusion_matrix_output)

              precision    recall  f1-score   support

           0       0.94      1.00      0.97     30424
           1       0.55      0.04      0.08      2171

    accuracy                           0.93     32595
   macro avg       0.74      0.52      0.52     32595
weighted avg       0.91      0.93      0.91     32595

[[30349    75]
 [ 2079    92]]


In [38]:
## Decision Tree

## your code here
from sklearn.metrics import classification_report, confusion_matrix
# 性能评估
classification_report_output = classification_report(y_test, y_pred_tree)
confusion_matrix_output = confusion_matrix(y_test, y_pred_tree)

# 分类报告 
print(classification_report_output)
# 混淆矩阵
print(confusion_matrix_output)

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     30424
           1       0.54      0.17      0.26      2171

    accuracy                           0.94     32595
   macro avg       0.74      0.58      0.61     32595
weighted avg       0.92      0.94      0.92     32595

[[30104   320]
 [ 1798   373]]


In [43]:
## Random Forest

## your code here
from sklearn.metrics import classification_report, confusion_matrix
# 性能评估
classification_report_output = classification_report(y_test, y_pred_forest)
confusion_matrix_output = confusion_matrix(y_test, y_pred_forest)

# 分类报告 
print(classification_report_output)
# 混淆矩阵
print(confusion_matrix_output)

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     30424
           1       0.48      0.15      0.22      2171

    accuracy                           0.93     32595
   macro avg       0.71      0.57      0.59     32595
weighted avg       0.91      0.93      0.92     32595

[[30084   340]
 [ 1853   318]]


In [58]:
## SVM

## your code here
from sklearn.metrics import classification_report, confusion_matrix
# 性能评估
classification_report_output = classification_report(y_test, y_pred_svm)
confusion_matrix_output = confusion_matrix(y_test, y_pred_svm)

# 分类报告 
print(classification_report_output)
# 混淆矩阵
print(confusion_matrix_output)

              precision    recall  f1-score   support

           0       0.93      1.00      0.97     30424
           1       0.53      0.01      0.02      2171

    accuracy                           0.93     32595
   macro avg       0.73      0.50      0.49     32595
weighted avg       0.91      0.93      0.90     32595

[[30406    18]
 [ 2151    20]]


In [51]:
## KNN

## your code here
from sklearn.metrics import classification_report, confusion_matrix
# 性能评估
classification_report_output = classification_report(y_test, y_pred_knn)
confusion_matrix_output = confusion_matrix(y_test, y_pred_knn)

# 分类报告 
print(classification_report_output)
# 混淆矩阵
print(confusion_matrix_output)

              precision    recall  f1-score   support

           0       0.94      0.99      0.96     30424
           1       0.47      0.11      0.18      2171

    accuracy                           0.93     32595
   macro avg       0.70      0.55      0.57     32595
weighted avg       0.91      0.93      0.91     32595

[[30141   283]
 [ 1922   249]]


## 练习5：调整模型的标准

银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。<br>

比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

- 提示：<span style='color:white'>sklearn的很多分类模型，predict_prob可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)('Gender') </span>

In [35]:
## Logistic Regression
## your code here

# 使用逻辑回归模型的predict_proba方法获取预测概率
y_pred_proba = lr.predict_proba(X_test_std)

# 设置阈值为0.3
threshold = 0.3

# 根据阈值判断最终结果
y_pred_adjusted = (y_pred_proba[:, 1] > threshold).astype(int)

# 重新计算性能评估指标
classification_report_adjusted = classification_report(y_test, y_pred_adjusted)
confusion_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)

# 分类报告 
print(classification_report_adjusted)
# 混淆矩阵
print(confusion_matrix_adjusted)

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     30424
           1       0.51      0.11      0.17      2171

    accuracy                           0.93     32595
   macro avg       0.72      0.55      0.57     32595
weighted avg       0.91      0.93      0.91     32595

[[30201   223]
 [ 1942   229]]


In [39]:
## Decision Tree
## your code here

# 使用逻辑回归模型的predict_proba方法获取预测概率
y_pred_proba = tree.predict_proba(X_test_std)

# 设置阈值为0.3
threshold = 0.3

# 根据阈值判断最终结果
y_pred_adjusted = (y_pred_proba[:, 1] > threshold).astype(int)

# 重新计算性能评估指标
classification_report_adjusted = classification_report(y_test, y_pred_adjusted)
confusion_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)

# 分类报告 
print(classification_report_adjusted)
# 混淆矩阵
print(confusion_matrix_adjusted)

              precision    recall  f1-score   support

           0       0.95      0.98      0.97     30424
           1       0.52      0.24      0.32      2171

    accuracy                           0.93     32595
   macro avg       0.73      0.61      0.64     32595
weighted avg       0.92      0.93      0.92     32595

[[29948   476]
 [ 1660   511]]


In [44]:
## Random Forest
## your code here

# 使用逻辑回归模型的predict_proba方法获取预测概率
y_pred_proba = forest.predict_proba(X_test_std)

# 设置阈值为0.3
threshold = 0.3

# 根据阈值判断最终结果
y_pred_adjusted = (y_pred_proba[:, 1] > threshold).astype(int)

# 重新计算性能评估指标
classification_report_adjusted = classification_report(y_test, y_pred_adjusted)
confusion_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)

# 分类报告 
print(classification_report_adjusted)
# 混淆矩阵
print(confusion_matrix_adjusted)

              precision    recall  f1-score   support

           0       0.95      0.96      0.96     30424
           1       0.39      0.33      0.36      2171

    accuracy                           0.92     32595
   macro avg       0.67      0.65      0.66     32595
weighted avg       0.92      0.92      0.92     32595

[[29321  1103]
 [ 1460   711]]


In [59]:
## SVM
## your code here

# 使用逻辑回归模型的predict_proba方法获取预测概率
y_pred_proba = svm.predict_proba(X_test_std)

# 设置阈值为0.3
threshold = 0.3

# 根据阈值判断最终结果
y_pred_adjusted = (y_pred_proba[:, 1] > threshold).astype(int)

# 重新计算性能评估指标
classification_report_adjusted = classification_report(y_test, y_pred_adjusted)
confusion_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)

# 分类报告 
print(classification_report_adjusted)
# 混淆矩阵
print(confusion_matrix_adjusted)

              precision    recall  f1-score   support

           0       0.93      1.00      0.97     30424
           1       0.53      0.01      0.02      2171

    accuracy                           0.93     32595
   macro avg       0.73      0.50      0.49     32595
weighted avg       0.91      0.93      0.90     32595

[[30406    18]
 [ 2151    20]]


In [54]:
## KNN
## your code here

# 使用逻辑回归模型的predict_proba方法获取预测概率
y_pred_proba = knn.predict_proba(X_test_std)

# 设置阈值为0.3
threshold = 0.3

# 根据阈值判断最终结果
y_pred_adjusted = (y_pred_proba[:, 1] > threshold).astype(int)

# 重新计算性能评估指标
classification_report_adjusted = classification_report(y_test, y_pred_adjusted)
confusion_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)

# 分类报告 
print(classification_report_adjusted)
# 混淆矩阵
print(confusion_matrix_adjusted)

              precision    recall  f1-score   support

           0       0.95      0.97      0.96     30424
           1       0.35      0.26      0.30      2171

    accuracy                           0.92     32595
   macro avg       0.65      0.61      0.63     32595
weighted avg       0.91      0.92      0.91     32595

[[29403  1021]
 [ 1614   557]]
