# 一起来打怪之 Credit Scoring 练习

---
## 作业说明

- 答题步骤：
    - 回答问题**请保留每一步**操作过程，请不要仅仅给出最后答案
    - 请养成代码注释的好习惯

- 解题思路：
    - 为方便大家准确理解题目，在习题实战中有所收获，本文档提供了解题思路提示
    - 解题思路**仅供参考**，鼓励原创解题方法
    - 为督促同学们自己思考，解题思路内容设置为**白色**，必要时请从冒号后拖动鼠标查看

- 所用数据
    - 请注意导入数据库后先**查看和了解数据的基本性质**，后面的问题不再一一提醒

## machine learning for credit scoring


Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


----------
## Read the data into Pandas 

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
data.shape

(112915, 11)

------------
## Drop na

In [3]:
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [4]:
data.dropna(inplace=True)
data.shape

(108648, 11)

---------
## Create X and y

In [5]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [26]:
X

Unnamed: 0,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0.658180,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0.233810,30.0,0.0,0.036050,3300.0,5.0,0.0,0.0,0.0,0.0
4,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
112910,0.385742,50.0,0.0,0.404293,3400.0,7.0,0.0,0.0,0.0,0.0
112911,0.040674,74.0,0.0,0.225131,2100.0,4.0,0.0,1.0,0.0,0.0
112912,0.299745,44.0,0.0,0.716562,5584.0,4.0,0.0,1.0,0.0,2.0
112913,0.000000,30.0,0.0,0.000000,5716.0,4.0,0.0,0.0,0.0,0.0


---
## 练习1：把数据切分成训练集和测试集
- 提示：<span style='color:white'>from sklearn.model_selection import train_test_split('Gender') </span>

In [7]:
## your code here
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=0)

# 查看数据的维度
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((76053, 10), (32595, 10), (76053,), (32595,))

In [27]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
sc = StandardScaler()  # 对数据的每一列做了一个归一化
sc.fit(X_train)
#应用归一化
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [9]:
# Check
np.mean(X_train_std, axis=0), np.std(X_train_std, axis=0)

(array([-3.73709248e-18, -2.42350447e-16,  2.25159822e-17, -2.42537302e-16,
         2.57392244e-17,  2.70939205e-17,  1.08842818e-17, -3.19521407e-17,
        -8.40845808e-18,  1.21455506e-17]),
 array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

----
## 练习2：使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类
尝试查sklearn API了解模型参数含义，调整不同的参数

### Logistic regression
- 提示：<span style='color:white'>from sklearn import linear_model('Gender') </span>

In [11]:
## your code here
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

In [12]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1000.0, max_iter=100, random_state=0)
lr.fit(X_train_std, y_train)
lr.predict_proba(X_test_std[0, :].reshape(1, -1))

array([[0.96849652, 0.03150348]])

In [13]:
[f'{i:.4f}' for i in _[0]]

['0.9685', '0.0315']

In [14]:
# 这个函数首先从输入数组中减去最大值（为了增加数值稳定性），然后对结果应用指数函数，最后将结果归一化，使得所有元素的和为1。
# 这样，softmax函数的输出可以被解释为概率分布。
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)
# check
softmax((X_test_std[:1] @ lr.coef_.T + lr.intercept_)[0])

array([1.])

### Decision Tree
- 提示：<span style='color:white'>from sklearn.tree import DecisionTreeClassifier('Gender') </span>

In [36]:
## your code here
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=0)
tree.fit(X_train_std, y_train)

y_pred_tree = tree.predict(X_test_std)
print('Misclassified samples: %d' % (y_test != y_pred_tree).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred_tree))




Misclassified samples: 2171
Accuracy: 0.93


### Random Forest
- 提示：<span style='color:white'>from sklearn.ensemble import RandomForestClassifier('Gender') </span>

In [37]:
## your code here
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(criterion='entropy', 
                                n_estimators=10, # The number of trees in the forest.
                                random_state=1,
                                n_jobs=2)
forest.fit(X_train_std, y_train)

y_pred_forest = forest.predict(X_test_std)
print('Misclassified samples: %d' % (y_test != y_pred_forest).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred_forest))

Misclassified samples: 2193
Accuracy: 0.93


### SVM
- 提示：<span style='color:white'>from sklearn.svm import SVC('Gender') </span>

In [38]:
## your code here
from sklearn.svm import SVC
# 如果 kernel 为 ‘linear’，那么就使用线性核函数计算 kernelValue，这里直接使用矩阵乘法计算所有训练样本与待计算样本的内积；
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(X_train_std, y_train)

In [39]:
y_pred_svm = svm.predict(X_test_std)
print('SVM_linear:')
print('Misclassified samples: %d' % (y_test != y_pred_svm).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred_svm))

SVM_linear:
Misclassified samples: 2169
Accuracy: 0.93


In [40]:
# 如果 kernel 为 ‘rbf’，那么就使用径向基函数（RBF）核计算 kernelValue，这里计算每个训练样本与待计算样本之间的欧氏距离，并根据参数 sigma 计算出高斯核函数的值。
svm = SVC(kernel='rbf', random_state=0, gamma=0.10, C=10.0) # 高斯核 rbf: Radial Basis Function
svm.fit(X_train_std, y_train)
y_pred_svm = svm.predict(X_test_std)
print('SVM_RBF_C=10.0:')
print('Misclassified samples: %d' % (y_test != y_pred_svm).sum())
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred_svm))

SVM_RBF_C=10.0:
Misclassified samples: 2084
Accuracy: 0.94


In [41]:
#修改参数C的值
svm = SVC(kernel='rbf', random_state=0, gamma=0.10, C=1.0) 
svm.fit(X_train_std, y_train)
y_pred_svm = svm.predict(X_test_std)
print('SVM_RBF_C=1.0:')
print('Misclassified samples: %d' % (y_test != y_pred_svm).sum())
print('Accuracy: %.4f' % accuracy_score(y_test, y_pred_svm))

SVM_RBF_C=1.0:
Misclassified samples: 2128
Accuracy: 0.93


总体来说区别不大

### KNN
- 提示：<span style='color:white'>from sklearn.neighbors import KNeighborsClassifier('Gender') </span>

In [42]:
## your code here
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

y_pred_knn = knn.predict(X_test_std)
print('KNN_N=5:')
print('Misclassified samples: %d' % (y_test != y_pred_knn).sum())
print('Accuracy: %.4f' % accuracy_score(y_test, y_pred_knn))

KNN_N=5:
Misclassified samples: 2205
Accuracy: 0.93


In [43]:
knn = KNeighborsClassifier(n_neighbors=2, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

y_pred_knn = knn.predict(X_test_std)
print('KNN_N=2:')
print('Misclassified samples: %d' % (y_test != y_pred_knn).sum())
print('Accuracy: %.4f' % accuracy_score(y_test, y_pred_knn))

KNN_N=2:
Misclassified samples: 2216
Accuracy: 0.93


In [44]:
knn = KNeighborsClassifier(n_neighbors=10, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

y_pred_knn = knn.predict(X_test_std)
print('KNN_N=10:')
print('Misclassified samples: %d' % (y_test != y_pred_knn).sum())
print('Accuracy: %.4f' % accuracy_score(y_test, y_pred_knn))

KNN_N=10:
Misclassified samples: 2131
Accuracy: 0.93


总体来说区别不大

---

## 练习3：在测试集上进行预测，计算准确度

### Logistic regression
- 提示：<span style='color:white'>y_pred_LR = clf_LR.predict(x_test)('Gender') </span>

In [46]:
## your code here
lr = LogisticRegression(C=100.0**40, random_state=0, penalty='l2') 
# 学习吧！
lr.fit(X_train_std, y_train)

y_train_pred = lr.predict(np.array(X_train_std))  # 训练集上的预测！
y_test_pred = lr.predict(np.array(X_test_std))         # 测试集上预测吧！

# 测试集预测不对的样本数
print('Misclassified samples in test set: %d' % (y_test != y_test_pred).sum()) 

from sklearn.metrics import accuracy_score
print('Training accuracy: %.4f' % accuracy_score(y_train, y_train_pred)) # 训练集的正确率
print(' Test accuracy: %.4f' % accuracy_score(y_test, y_test_pred))  # 测试集的正确率


Misclassified samples in test set: 2154
(accuracy_score) Training accuracy: 0.9331
(accuracy_score) Test accuracy: 0.9339


### Decision Tree
- 提示：<span style='color:white'>y_pred_tree = tree.predict(x_test)('Gender') </span>

In [52]:
## your code here
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=0)
tree.fit(X_train_std, y_train)


y_test_pred = tree.predict(X_test_std)
y_train_pred = tree.predict(X_train_std)  

print('Misclassified samples: %d' % (y_test != y_test_pred).sum())
print('Training accuracy: %.4f' % accuracy_score(y_train, y_train_pred)) 
print('Accuracy: %.4f' % accuracy_score(y_test, y_test_pred))

Misclassified samples: 2171
Training accuracy: 0.9322
Accuracy: 0.9334


### Random Forest
- 提示：<span style='color:white'>y_pred_forest = forest.predict(x_test)('Gender') </span>

In [51]:
## your code here
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(criterion='entropy', 
                                n_estimators=10, # The number of trees in the forest.
                                random_state=1,
                                n_jobs=2)
forest.fit(X_train_std, y_train)

y_test_pred = forest.predict(X_test_std)
y_train_pred = forest.predict(X_train_std)  

print('Misclassified samples: %d' % (y_test != y_test_pred).sum())
print('Training accuracy: %.4f' % accuracy_score(y_train, y_train_pred)) 
print('Accuracy: %.4f' % accuracy_score(y_test, y_test_pred))

Misclassified samples: 2193
Training accuracy: 0.9905
Accuracy: 0.93


可以看见随机森林相对于决策树在训练集上的效果要好得多

### SVM
- 提示：<span style='color:white'>y_pred_SVC = clf_svc.predict(x_test)('Gender') </span>

In [53]:
## your code here
from sklearn.svm import SVC

svm = SVC(kernel='rbf', random_state=0, gamma=0.10, C=1.0)
svm.fit(X_train_std, y_train)

y_test_pred = svm.predict(X_test_std)
y_train_pred = svm.predict(X_train_std) 

print('Misclassified samples: %d' % (y_test != y_test_pred).sum())
print('Training accuracy: %.4f' % accuracy_score(y_train, y_train_pred)) 
print('Accuracy: %.4f' % accuracy_score(y_test, y_test_pred))

Misclassified samples: 2128
Training accuracy: 0.9343
Accuracy: 0.9347


### KNN
- 提示：<span style='color:white'>y_pred_KNN = neigh.predict(x_test)('Gender') </span>

In [54]:
## your code here
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train_std, y_train)

y_test_pred = knn.predict(X_test_std)
y_train_pred = knn.predict(X_train_std) 
print('KNN_N=5:')
print('Misclassified samples: %d' % (y_test != y_test_pred).sum())
print('Training accuracy: %.4f' % accuracy_score(y_train, y_train_pred)) 
print('Accuracy: %.4f' % accuracy_score(y_test, y_test_pred))

KNN_N=5:
Misclassified samples: 2205
Training accuracy: 0.9409
Accuracy: 0.9324


---
## 练习4：查看sklearn的官方说明，了解分类问题的评估标准，并对此例进行评估

**混淆矩阵（Confusion Matrix）相关学习链接**

- Blog:<br>
http://blog.csdn.net/vesper305/article/details/44927047<br>
- WiKi:<br>
http://en.wikipedia.org/wiki/Confusion_matrix<br>
- sklearn doc:<br>
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [55]:
## your code here
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_test_pred))
confusion = confusion_matrix(y_test, y_test_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
print("TP:", TP)
print("TN:", TN)
print("FP:", FP)
print("FN:", FN)

[[30349    75]
 [ 2079    92]]
TP: 92
TN: 30349
FP: 75
FN: 2079


In [67]:
print('准确率、识别率: ' , (TP+TN) / float(TP+TN+FN+FP))#正确分类的比例

print('错误率、误分类率: ' , (FP+FN) / float(TP+TN+FN+FP))#误分类的比例

print('灵敏性、召回率:' ,  TP / float(TP+FN))#（正确识别正例在正例数据的百分比）

print('特效性、真负例率:' ,  TN / float(TN+FP))#（正确识别负例在正例数据的百分比）

print('假阳率:' ,  FP / float(TN+FP))#（实际值为负例识别错误为正例的百分比）

print('精度: ' , TP / float(TP+FP))#（标记为正类的数据实际为正例的百分比）

准确率、识别率:  0.9339162448228255
错误率、误分类率:  0.06608375517717441
灵敏性、召回率: 0.04237678489175495
特效性、真负例率: 0.9975348409150671
假阳率: 0.0024651590849329476
精度:  0.5508982035928144


## 练习5：调整模型的标准

银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。<br>

比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

- 提示：<span style='color:white'>sklearn的很多分类模型，predict_prob可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)('Gender') </span>

In [74]:
## your code here
from sklearn.metrics import precision_score, recall_score, f1_score
#使用随机森林作为例子
forest = RandomForestClassifier(criterion='entropy', 
                                n_estimators=10, # The number of trees in the forest.
                                random_state=1,
                                n_jobs=2)
forest.fit(X_train_std, y_train)

In [78]:
y_test_pred = forest.predict(X_test_std)

conf_matrix = confusion_matrix(y_test, y_test_pred) # 计算混淆矩阵
accuracy = accuracy_score(y_test, y_test_pred) # 计算准确度
precision = precision_score(y_test, y_test_pred) # 计算精确度
recall = recall_score(y_test, y_test_pred) # 计算召回率
f1 = f1_score(y_test, y_test_pred) # 计算 F1 分数

print("混淆矩阵:")
print(conf_matrix)
print(f"准确度: {accuracy:.4f}")
print(f"精确度: {precision:.4f}")
print(f"召回率: {recall:.4f}")

混淆矩阵:
[[30084   340]
 [ 1853   318]]
准确度: 0.9327
精确度: 0.4833
召回率: 0.1465


In [91]:
# 尝试不同的阈值
for i in range(1,8):
    threshold = 0.25+i/10
    y_proba = forest.predict_proba(X_test_std)[:, 1]  # 获取正类别的概率值
    y_pred_adjusted = (y_proba > threshold).astype(int)
    print('threshold = ' ,threshold)
    # 计算调整后的准确度和召回率
    conf_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)
    accuracy_adjusted = accuracy_score(y_test, y_pred_adjusted)
    recall_adjusted = recall_score(y_test, y_pred_adjusted)
    precision_adjusted  = precision_score(y_test, y_pred_adjusted)
    print("混淆矩阵: ")
    print(conf_matrix_adjusted)
    print(f"调整后的准确度: {accuracy_adjusted:.4f}")
    print(f"调整后的精准度: {precision_adjusted:.4f}")
    print(f"调整后的召回率: {recall_adjusted:.4f}")

threshold =  0.35
混淆矩阵: 
[[29321  1103]
 [ 1460   711]]
调整后的准确度: 0.9214
调整后的精准度: 0.3920
调整后的召回率: 0.3275
threshold =  0.45
混淆矩阵: 
[[29786   638]
 [ 1687   484]]
调整后的准确度: 0.9287
调整后的精准度: 0.4314
调整后的召回率: 0.2229
threshold =  0.55
混淆矩阵: 
[[30084   340]
 [ 1853   318]]
调整后的准确度: 0.9327
调整后的精准度: 0.4833
调整后的召回率: 0.1465
threshold =  0.65
混淆矩阵: 
[[30264   160]
 [ 2003   168]]
调整后的准确度: 0.9336
调整后的精准度: 0.5122
调整后的召回率: 0.0774
threshold =  0.75
混淆矩阵: 
[[30363    61]
 [ 2094    77]]
调整后的准确度: 0.9339
调整后的精准度: 0.5580
调整后的召回率: 0.0355
threshold =  0.85
混淆矩阵: 
[[30402    22]
 [ 2144    27]]
调整后的准确度: 0.9335
调整后的精准度: 0.5510
调整后的召回率: 0.0124
threshold =  0.95
混淆矩阵: 
[[30419     5]
 [ 2162     9]]
调整后的准确度: 0.9335
调整后的精准度: 0.6429
调整后的召回率: 0.0041


可见随着阈值的增大，准确度也逐渐增大，但峰值在阈值为0.75处。