# 一起来打怪之 Credit Scoring 练习

---
## 作业说明

- 答题步骤：
    - 回答问题**请保留每一步**操作过程，请不要仅仅给出最后答案
    - 请养成代码注释的好习惯

- 解题思路：
    - 为方便大家准确理解题目，在习题实战中有所收获，本文档提供了解题思路提示
    - 解题思路**仅供参考**，鼓励原创解题方法
    - 为督促同学们自己思考，解题思路内容设置为**白色**，必要时请从冒号后拖动鼠标查看

- 所用数据
    - 请注意导入数据库后先**查看和了解数据的基本性质**，后面的问题不再一一提醒

## machine learning for credit scoring


Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


----------
## Read the data into Pandas 

In [19]:
import os
print(os.getcwd())  # 打印当前工作目录

# with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
#     f = z.open('KaggleCredit2.csv')
#     data = pd.read_csv(f, index_col=0)


import numpy as np

/home/jovyan/ML-2023-12-22


In [2]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()



Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [3]:
data.shape

(112915, 11)

------------
## Drop na

In [4]:
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [5]:
data.dropna(inplace=True)
data.shape

(108648, 11)

---------
## Create X and y

In [6]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [7]:
y.mean()

0.06742876076872101

---
## 练习1：把数据切分成训练集和测试集
- 提示：<span style='color:white'>from sklearn.model_selection import train_test_split('Gender') </span>

In [8]:
## your code here
from sklearn.model_selection import train_test_split


# 利用 train_test_split 函数将数据集划分为训练集和测试集，其中 test_size 表示测试集占总数据集的比例
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 输出训练集和测试集的形状
print("训练集 X 形状:", X_train.shape)
print("测试集 X 形状:", X_test.shape)
print("训练集 y 形状:", y_train.shape)
print("测试集 y 形状:", y_test.shape)


训练集 X 形状: (76053, 10)
测试集 X 形状: (32595, 10)
训练集 y 形状: (76053,)
测试集 y 形状: (32595,)


----
## 练习2：使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类
尝试查sklearn API了解模型参数含义，调整不同的参数

In [9]:
# # 就要可视化才能看出来啊
# from matplotlib.colors import ListedColormap
# import matplotlib.pyplot as plt
# import warnings

# import numpy as np



# def versiontuple(v):
#     return tuple(map(int, (v.split("."))))


# def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

#     # setup marker generator and color map
#     markers = ('s', 'x', 'o', '^', 'v')
#     colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
#     cmap = ListedColormap(colors[:len(np.unique(y))])

#     # plot the decision surface
#     x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
#     x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
#                            np.arange(x2_min, x2_max, resolution))
#     Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
#     Z = Z.reshape(xx1.shape)
#     plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
#     plt.xlim(xx1.min(), xx1.max())
#     plt.ylim(xx2.min(), xx2.max())

#     for idx, cl in enumerate(np.unique(y)):
#         plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
#                     alpha=0.8, c=np.array([cmap(idx)]),
#                     marker=markers[idx], label=cl)

#     # highlight test samples
#     if test_idx:
#         # plot all samples
#         if not versiontuple(np.__version__) >= versiontuple('1.9.0'):
#             X_test, y_test = X[list(test_idx), :], y[list(test_idx)]
#             warnings.warn('Please update to NumPy 1.9.0 or newer')
#         else:
#             X_test, y_test = X[test_idx, :], y[test_idx]

#         plt.scatter(X_test[:, 0],
#                     X_test[:, 1],
#                     alpha=0.15,
#                     linewidths=2,
#                     marker='^',
#                     edgecolors='black',
#                     facecolors='none',
#                     s=55, label='test set')

In [10]:
# from matplotlib.colors import ListedColormap
# import matplotlib.pyplot as plt
# from sklearn.linear_model import SGDRegressor
# import numpy as np

# # 假设 X_train、X_test、y_train、y_test 已经定义，并且具有10个特征

# def batch_generator(X, y, batch_size=100):
#     num_samples = len(X)
#     indices = np.arange(num_samples)
#     np.random.shuffle(indices)

#     for i in range(0, num_samples, batch_size):
#         batch_indices = indices[i:i + batch_size]
#         yield X.iloc[batch_indices], y.iloc[batch_indices]

# # 对数据集进行随机采样，保留部分数据
# sample_size = 1000  # 你可以调整采样大小
# random_indices = np.random.choice(len(X_train), sample_size, replace=False)

# # 创建随机梯度下降线性回归模型
# regressor = SGDRegressor()

# # 逐批次加载数据进行模型训练
# batch_size = 100  # 你可以调整批次大小

# for X_batch, y_batch in batch_generator(X_train.iloc[random_indices], y_train.iloc[random_indices], batch_size=batch_size):
#     regressor.partial_fit(X_batch, y_batch)

# # 绘制决策区域
# def plot_decision_regions(X, y, regressor, test_idx=None, resolution=0.02):
#     markers = ('s', 'x')
#     colors = ('red', 'blue')
#     cmap = ListedColormap(colors[:len(np.unique(y))])

#     # 选择两个特征进行绘制
#     feature1_idx = 0
#     feature2_idx = 1

#     x1_min, x1_max = X.iloc[:, feature1_idx].min() - 1, X.iloc[:, feature1_idx].max() + 1
#     x2_min, x2_max = X.iloc[:, feature2_idx].min() - 1, X.iloc[:, feature2_idx].max() + 1
#     xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
#                            np.arange(x2_min, x2_max, resolution))
#     Z = regressor.predict(np.array([xx1.ravel(), xx2.ravel()] + [0] * (X.shape[1] - 2)).T)
#     Z = Z.reshape(xx1.shape)
#     plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
#     plt.xlim(xx1.min(), xx1.max())
#     plt.ylim(xx2.min(), xx2.max())

#     for idx, cl in enumerate(np.unique(y)):
#         plt.scatter(x=X[y == cl].iloc[:, feature1_idx], y=X[y == cl].iloc[:, feature2_idx],
#                     alpha=0.8, c=np.array([cmap(idx)]),
#                     marker=markers[idx], label=cl)

#     # 突出显示测试样本
#     if test_idx:
#         X_test, y_test = X.iloc[test_idx, :], y.iloc[test_idx]
#         plt.scatter(X_test.iloc[:, feature1_idx],
#                     X_test.iloc[:, feature2_idx],
#                     alpha=0.15,
#                     linewidths=2,
#                     marker='^',
#                     edgecolors='black',
#                     facecolors='none',
#                     s=55, label='测试集')

# # 将采样后的训练集和测试集组合起来以进行可视化
# X_combined_sampled = pd.concat([X_train.iloc[random_indices], X_test], ignore_index=True)
# y_combined_sampled = pd.concat([y_train.iloc[random_indices], y_test], ignore_index=True)

# # 使用定义的函数绘制决策区域
# plot_decision_regions(X_combined_sampled, y_combined_sampled, regressor=regressor)

# # 添加标签和图例
# plt.xlabel('特征 1')
# plt.ylabel('特征 2')
# plt.legend(loc='upper left')
# plt.title('SGDRegressor 决策区域')

# # 显示图形
# plt.show()


### Logistic regression
- 提示：<span style='color:white'>from sklearn import linear_model('Gender') </span>

In [17]:
## your code here

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 创建 Logistic Regression 模型，增加 max_iter 参数
model = LogisticRegression(random_state=42, max_iter=1000)

# 在训练集上训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算模型的准确率
accuracy = accuracy_score(y_test, y_pred)
print("模型准确率:", accuracy)

# 输出分类报告
print("分类报告:\n", classification_report(y_test, y_pred))



模型准确率: 0.9344991563123178
分类报告:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97     30407
           1       0.63      0.06      0.11      2188

    accuracy                           0.93     32595
   macro avg       0.78      0.53      0.54     32595
weighted avg       0.92      0.93      0.91     32595



In [18]:
# 计算 y_test 和 y_pred 的差异
difference = y_test - y_pred

# 计算差异数组中非零元素的占有率
non_zero_rate_difference = np.sum(difference != 0) / len(difference)
print("y_test 和 y_pred 的差异中非零元素的占有率:", non_zero_rate_difference)
1-non_zero_rate_difference



NameError: name 'np' is not defined

### Decision Tree
- 提示：<span style='color:white'>from sklearn.tree import DecisionTreeClassifier('Gender') </span>

In [13]:
## your code here

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# 创建 Decision Tree 模型
tree_model = DecisionTreeClassifier(random_state=42)

# 在训练集上训练模型
tree_model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_tree = tree_model.predict(X_test)

# 计算模型的准确率
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print("决策树模型准确率:", accuracy_tree)

# 输出分类报告
print("分类报告:\n", classification_report(y_test, y_pred_tree))


决策树模型准确率: 0.8936953520478601
分类报告:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94     30407
           1       0.24      0.26      0.25      2188

    accuracy                           0.89     32595
   macro avg       0.59      0.60      0.59     32595
weighted avg       0.90      0.89      0.90     32595



### Random Forest
- 提示：<span style='color:white'>from sklearn.ensemble import RandomForestClassifier('Gender') </span>

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 创建 Random Forest 模型
rf_model = RandomForestClassifier(random_state=42)

# 在训练集上训练模型
rf_model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_rf = rf_model.predict(X_test)

# 计算模型的准确率
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("随机森林模型准确率:", accuracy_rf)

# 输出分类报告
print("分类报告:\n", classification_report(y_test, y_pred_rf))


随机森林模型准确率: 0.9351741064580457
分类报告:
               precision    recall  f1-score   support

           0       0.94      0.99      0.97     30407
           1       0.55      0.17      0.27      2188

    accuracy                           0.94     32595
   macro avg       0.75      0.58      0.62     32595
weighted avg       0.92      0.94      0.92     32595



### SVM
- 提示：<span style='color:white'>from sklearn.svm import SVC('Gender') </span>

In [16]:
## your code here
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# 创建支持向量机模型
# svm_model = SVC(kernel='linear', random_state=42)  # 这个是线性核
svm_model = SVC(random_state=42)


# 在训练集上训练模型
svm_model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_svm = svm_model.predict(X_test)

# 计算模型的准确率 
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("支持向量机模型准确率:", accuracy_svm)  

# 输出分类报告
print("分类报告:\n", classification_report(y_test, y_pred_svm, zero_division=1))

# 耗时：40s ；+标准化：3min ；+标准化+线性核：6min


支持向量机模型准确率: 0.9328731400521553
分类报告:
               precision    recall  f1-score   support

           0       0.93      1.00      0.97     30407
           1       1.00      0.00      0.00      2188

    accuracy                           0.93     32595
   macro avg       0.97      0.50      0.48     32595
weighted avg       0.94      0.93      0.90     32595



### KNN
- 提示：<span style='color:white'>from sklearn.neighbors import KNeighborsClassifier('Gender') </span>

In [16]:
## your code here

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# 创建 KNN 模型
knn_model = KNeighborsClassifier()

# 在训练集上训练模型
knn_model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_knn = knn_model.predict(X_test)

# 计算模型的准确率
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print("KNN 模型准确率:", accuracy_knn)

# 输出分类报告
print("分类报告:\n", classification_report(y_test, y_pred_knn))


KNN 模型准确率: 0.9315232397606995
分类报告:
               precision    recall  f1-score   support

           0       0.93      1.00      0.96     30407
           1       0.33      0.02      0.04      2188

    accuracy                           0.93     32595
   macro avg       0.63      0.51      0.50     32595
weighted avg       0.89      0.93      0.90     32595



---

## 练习3：在测试集上进行预测，计算准确度

### Logistic regression
- 提示：<span style='color:white'>y_pred_LR = clf_LR.predict(x_test)('Gender') </span>

In [17]:
## your code here



# 利用 train_test_split 函数将数据集划分为训练集和测试集，其中 test_size 表示测试集占总数据集的比例
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 输出训练集和测试集的形状
print("训练集 X 形状:", X_train.shape)
print("测试集 X 形状:", X_test.shape)
print("训练集 y 形状:", y_train.shape)
print("测试集 y 形状:", y_test.shape)

# 创建 Logistic Regression 模型
clf_LR = LogisticRegression(random_state=42, max_iter=1000)

# 在训练集上训练模型
clf_LR.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_LR = clf_LR.predict(X_test)

# 计算模型的准确度
accuracy_LR = accuracy_score(y_test, y_pred_LR)
print("Logistic Regression 模型准确度:", accuracy_LR)



训练集 X 形状: (76053, 10)
测试集 X 形状: (32595, 10)
训练集 y 形状: (76053,)
测试集 y 形状: (32595,)


Logistic Regression 模型准确度: 0.9344991563123178


### Decision Tree
- 提示：<span style='color:white'>y_pred_tree = tree.predict(x_test)('Gender') </span>

In [18]:
## your code here

# 创建 Decision Tree 模型
clf_tree = DecisionTreeClassifier(random_state=42)

# 在训练集上训练模型
clf_tree.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_tree = clf_tree.predict(X_test)

# 计算模型的准确度
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print("Decision Tree 模型准确度:", accuracy_tree)


Decision Tree 模型准确度: 0.8936953520478601


### Random Forest
- 提示：<span style='color:white'>y_pred_forest = forest.predict(x_test)('Gender') </span>

In [19]:
## your code here

# 创建 Random Forest 模型
clf_forest = RandomForestClassifier(random_state=42)

# 在训练集上训练模型
clf_forest.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_forest = clf_forest.predict(X_test)

# 计算模型的准确度
accuracy_forest = accuracy_score(y_test, y_pred_forest)
print("Random Forest 模型准确度:", accuracy_forest)


Random Forest 模型准确度: 0.9351741064580457


### SVM
- 提示：<span style='color:white'>y_pred_SVC = clf_svc.predict(x_test)('Gender') </span>

In [20]:
## your code here

# 创建 SVM 模型
clf_svc = SVC(random_state=42)

# 在训练集上训练模型
clf_svc.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_svc = clf_svc.predict(X_test)

# 计算模型的准确度
accuracy_svc = accuracy_score(y_test, y_pred_svc)
print("SVM 模型准确度:", accuracy_svc)


SVM 模型准确度: 0.9328731400521553


### KNN
- 提示：<span style='color:white'>y_pred_KNN = neigh.predict(x_test)('Gender') </span>

In [21]:
## your code here

# 创建 KNN 模型
neigh = KNeighborsClassifier()

# 在训练集上训练模型
neigh.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_KNN = neigh.predict(X_test)

# 计算模型的准确度
accuracy_KNN = accuracy_score(y_test, y_pred_KNN)
print("KNN 模型准确度:", accuracy_KNN)


KNN 模型准确度: 0.9315232397606995


---
## 练习4：查看sklearn的官方说明，了解分类问题的评估标准，并对此例进行评估

**混淆矩阵（Confusion Matrix）相关学习链接**

- Blog:<br>
http://blog.csdn.net/vesper305/article/details/44927047<br>
- WiKi:<br>
http://en.wikipedia.org/wiki/Confusion_matrix<br>
- sklearn doc:<br>
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Logistic Regression

In [21]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [23]:
## your code here
from sklearn.linear_model import LogisticRegression


# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 创建 Logistic Regression 模型
logreg_model = LogisticRegression(random_state=42)

# 在训练集上训练模型
logreg_model.fit(X_train_scaled, y_train)

# 在测试集上进行预测
y_pred_logreg = logreg_model.predict(X_test_scaled)

# 生成混淆矩阵
conf_matrix_logreg = confusion_matrix(y_test, y_pred_logreg)
print("Logistic Regression 混淆矩阵:\n", conf_matrix_logreg)

# 计算准确度
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
print("Logistic Regression 准确度:", accuracy_logreg)

# 输出分类报告
print("Logistic Regression 分类报告:\n", classification_report(y_test, y_pred_logreg))


# 输出
# True Negative (TN): 30342（模型正确预测为负类别的样本数）
# False Positive (FP): 65（模型错误预测为正类别的样本数）
# False Negative (FN): 2090（模型错误预测为负类别的样本数）
# True Positive (TP): 98（模型正确预测为正类别的样本数）


#                     Actual Positive     Actual Negative
# Predicted Positive       TP                FP
# Predicted Negative       FN                TN

Logistic Regression 混淆矩阵:
 [[30342    65]
 [ 2090    98]]
Logistic Regression 准确度: 0.9338855652707471
Logistic Regression 分类报告:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97     30407
           1       0.60      0.04      0.08      2188

    accuracy                           0.93     32595
   macro avg       0.77      0.52      0.52     32595
weighted avg       0.91      0.93      0.91     32595



Decision Tree

In [24]:
# 决策树
from sklearn.tree import DecisionTreeClassifier


# 创建决策树模型
tree_classifier = DecisionTreeClassifier(random_state=42)

# 在训练集上训练模型
tree_classifier.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_tree = tree_classifier.predict(X_test)

# 生成混淆矩阵
conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)
print("决策树混淆矩阵:\n", conf_matrix_tree)

# 计算准确度
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print("决策树准确度:", accuracy_tree)

# 输出分类报告
print("决策树分类报告:\n", classification_report(y_test, y_pred_tree))

决策树混淆矩阵:
 [[28563  1844]
 [ 1621   567]]
决策树准确度: 0.8936953520478601
决策树分类报告:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94     30407
           1       0.24      0.26      0.25      2188

    accuracy                           0.89     32595
   macro avg       0.59      0.60      0.59     32595
weighted avg       0.90      0.89      0.90     32595



随机森林

In [25]:
# 随机森林
from sklearn.ensemble import RandomForestClassifier


# 创建随机森林模型
forest_classifier = RandomForestClassifier(random_state=42)

# 在训练集上训练模型
forest_classifier.fit(X_train, y_train)

# 在测试集上进行预测
y_pred_forest = forest_classifier.predict(X_test)

# 生成混淆矩阵
conf_matrix_forest = confusion_matrix(y_test, y_pred_forest)
print("随机森林混淆矩阵:\n", conf_matrix_forest)

# 计算准确度
accuracy_forest = accuracy_score(y_test, y_pred_forest)
print("随机森林准确度:", accuracy_forest)

# 输出分类报告
print("随机森林分类报告:\n", classification_report(y_test, y_pred_forest))

随机森林混淆矩阵:
 [[30100   307]
 [ 1806   382]]
随机森林准确度: 0.9351741064580457
随机森林分类报告:
               precision    recall  f1-score   support

           0       0.94      0.99      0.97     30407
           1       0.55      0.17      0.27      2188

    accuracy                           0.94     32595
   macro avg       0.75      0.58      0.62     32595
weighted avg       0.92      0.94      0.92     32595



SVM

In [22]:
# SVM
from sklearn.svm import SVC

# 总结： 耗时：40s ；+标准化：3min ；+标准化+线性核：6min 。

# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 创建支持向量机模型
svm_model = SVC(random_state=42)

# 在训练集上训练模型
svm_model.fit(X_train_scaled, y_train)

# 在测试集上进行预测
y_pred_svm = svm_model.predict(X_test_scaled)

# 生成混淆矩阵
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
print("SVM混淆矩阵:\n", conf_matrix_svm)

# 计算准确度
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM准确度:", accuracy_svm)

# 输出分类报告
print("SVM分类报告:\n", classification_report(y_test, y_pred_svm))

SVM混淆矩阵:
 [[30349    58]
 [ 2063   125]]
SVM准确度: 0.9349286700414174
SVM分类报告:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97     30407
           1       0.68      0.06      0.11      2188

    accuracy                           0.93     32595
   macro avg       0.81      0.53      0.54     32595
weighted avg       0.92      0.93      0.91     32595



KNN

In [27]:
# KNN
from sklearn.neighbors import KNeighborsClassifier


# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 创建K近邻模型
knn_classifier = KNeighborsClassifier()

# 在训练集上训练模型
knn_classifier.fit(X_train_scaled, y_train)

# 在测试集上进行预测
y_pred_knn = knn_classifier.predict(X_test_scaled)

# 生成混淆矩阵
conf_matrix_knn = confusion_matrix(y_test, y_pred_knn)
print("KNN混淆矩阵:\n", conf_matrix_knn)

# 计算准确度
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print("KNN准确度:", accuracy_knn)

# 输出分类报告
print("KNN分类报告:\n", classification_report(y_test, y_pred_knn))


KNN混淆矩阵:
 [[30167   240]
 [ 1930   258]]
KNN准确度: 0.933425371989569
KNN分类报告:
               precision    recall  f1-score   support

           0       0.94      0.99      0.97     30407
           1       0.52      0.12      0.19      2188

    accuracy                           0.93     32595
   macro avg       0.73      0.56      0.58     32595
weighted avg       0.91      0.93      0.91     32595



## 练习5：调整模型的标准

银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。<br>

比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

- 提示：<span style='color:white'>sklearn的很多分类模型，predict_prob可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)('Gender') </span>

In [29]:
## your code here

from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix



from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix

# 假设 X_train, X_test, y_train, y_test 已经定义

# 创建 Logistic Regression 模型
model = LogisticRegression(random_state=42, max_iter=1000)

# 在训练集上训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测，默认阈值为0.5
y_pred_default = model.predict(X_test)

# 计算默认阈值下的评估指标
accuracy_default = accuracy_score(y_test, y_pred_default)
recall_default = recall_score(y_test, y_pred_default)
precision_default = precision_score(y_test, y_pred_default)

# 打印默认阈值下的评估指标
print("默认阈值下的准确率:", accuracy_default)
print("默认阈值下的召回率:", recall_default)
print("默认阈值下的精确度:", precision_default)

# 将阈值调整为0.3
threshold = 0.3
y_pred_adjusted = (model.predict_proba(X_test)[:, 1] > threshold).astype(int)

# 计算调整阈值后的评估指标
accuracy_adjusted = accuracy_score(y_test, y_pred_adjusted)
recall_adjusted = recall_score(y_test, y_pred_adjusted)
precision_adjusted = precision_score(y_test, y_pred_adjusted)

# 打印调整阈值后的评估指标
print(f"\n将阈值调整为 {threshold} 后的准确率:", accuracy_adjusted)
print(f"将阈值调整为 {threshold} 后的召回率:", recall_adjusted)
print(f"将阈值调整为 {threshold} 后的精确度:", precision_adjusted)

# 输出混淆矩阵
conf_matrix_adjusted = confusion_matrix(y_test, y_pred_adjusted)
print("\n调整阈值后的混淆矩阵:\n", conf_matrix_adjusted)




默认阈值下的准确率: 0.9344991563123178
默认阈值下的召回率: 0.058957952468007314
默认阈值下的精确度: 0.6292682926829268

将阈值调整为 0.3 后的准确率: 0.933425371989569
将阈值调整为 0.3 后的召回率: 0.12522851919561243
将阈值调整为 0.3 后的精确度: 0.5169811320754717

调整阈值后的混淆矩阵:
 [[30151   256]
 [ 1914   274]]
