# 一起来打怪之 Credit Scoring 练习

---
## 作业说明

- 答题步骤：
    - 回答问题**请保留每一步**操作过程，请不要仅仅给出最后答案
    - 请养成代码注释的好习惯

- 解题思路：
    - 为方便大家准确理解题目，在习题实战中有所收获，本文档提供了解题思路提示
    - 解题思路**仅供参考**，鼓励原创解题方法
    - 为督促同学们自己思考，解题思路内容设置为**白色**，必要时请从冒号后拖动鼠标查看

- 所用数据
    - 请注意导入数据库后先**查看和了解数据的基本性质**，后面的问题不再一一提醒

## machine learning for credit scoring


Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


----------
## Read the data into Pandas 

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
data.shape

(112915, 11)

------------
## Drop na

In [2]:
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [3]:
data.dropna(inplace=True)
data.shape

(108648, 11)

---------
## Create X and y

In [4]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [7]:
y.mean()

0.06742876076872101

---
## 练习1：把数据切分成训练集和测试集
- 提示：<span style='color:white'>from sklearn.model_selection import train_test_split('Gender') </span>

In [5]:
## import train_test_split
from sklearn.model_selection import train_test_split

In [6]:
## apply train_test_split to the dataset
## 上面我们已经创建了 features (X) 和 labels (y)
# train_test_split 参数如下:
# - X: Features
# - y: Labels
# - test_size: The proportion of the dataset to include in the test split (e.g., 0.2 for 20% testing, 0.3 for 30% testing)
# - random_state: Seed for the random number generator, ensures reproducibility

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now you can use X_train and y_train for training your model
# and X_test and y_test for evaluating its performance

In [7]:
## 检验训练集是否占 80% 
X_train.shape[0]/X.shape[0]

0.7999963183859804

----
## 练习2：使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类
尝试查sklearn API了解模型参数含义，调整不同的参数

### Logistic regression
- 提示：<span style='color:white'>from sklearn import linear_model('Gender') </span>

In [8]:
## import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [9]:
# Create a logistic regression model
# add max_iter based on training result
# try different solver than the default one
logreg_model = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear')

In [10]:
# Train the model on the training set
logreg_model.fit(X_train, y_train)

### Decision Tree
- 提示：<span style='color:white'>from sklearn.tree import DecisionTreeClassifier('Gender') </span>

In [11]:
## import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [12]:
# Create a Decision Tree model
decision_tree_model = DecisionTreeClassifier(random_state=42)

# Train the model on the training set
decision_tree_model.fit(X_train, y_train)

### Random Forest
- 提示：<span style='color:white'>from sklearn.ensemble import RandomForestClassifier('Gender') </span>

In [13]:
## import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [14]:
# Create a RandomForestClassifier model
random_forest_model = RandomForestClassifier(random_state=42)

# Train the model on the training set
random_forest_model.fit(X_train, y_train)

### SVM
- 提示：<span style='color:white'>from sklearn.svm import SVC('Gender') </span>

In [15]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create an SVM model
svm_model = SVC(random_state=42)

# Train the model on the training set
svm_model.fit(X_train, y_train)

### KNN
- 提示：<span style='color:white'>from sklearn.neighbors import KNeighborsClassifier('Gender') </span>

In [16]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create a KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors (n_neighbors) as needed

# Train the model on the training set
knn_model.fit(X_train, y_train)

---

## 练习3：在测试集上进行预测，计算准确度

### Logistic regression
- 提示：<span style='color:white'>y_pred_LR = clf_LR.predict(x_test)('Gender') </span>

In [17]:
# Make predictions on the testing set
y_pred = logreg_model.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.9320294523699953


### Decision Tree
- 提示：<span style='color:white'>y_pred_tree = tree.predict(x_test)('Gender') </span>

In [18]:
# Make predictions on the testing set
y_pred_decision_tree = decision_tree_model.predict(X_test)

# Evaluate the Accuracy of the Decision Tree model
accuracy_decision_tree = accuracy_score(y_test, y_pred_decision_tree)

# Print the results
print("Accuracy:", accuracy_decision_tree)

Accuracy: 0.8938794293603314


### Random Forest
- 提示：<span style='color:white'>y_pred_forest = forest.predict(x_test)('Gender') </span>

In [19]:
# Make predictions on the testing set
y_pred_random_forest = random_forest_model.predict(X_test)

# Evaluate the accuracy of the RandomForestClassifier model
accuracy_random_forest = accuracy_score(y_test, y_pred_random_forest)

# Print the results
print("Accuracy:", accuracy_random_forest)

Accuracy: 0.9340082834790612


### SVM
- 提示：<span style='color:white'>y_pred_SVC = clf_svc.predict(x_test)('Gender') </span>

In [20]:
# Make predictions on the testing set
y_pred_svm = svm_model.predict(X_test)

# Evaluate the accuracy of the SVM model
accuracy_svm = accuracy_score(y_test, y_pred_svm)

# Print the results
print("Accuracy:", accuracy_svm)

Accuracy: 0.9317073170731708


### KNN
- 提示：<span style='color:white'>y_pred_KNN = neigh.predict(x_test)('Gender') </span>

In [21]:
# Make predictions on the testing set
y_pred_knn = knn_model.predict(X_test)

# Evaluate the accuracy of the KNN model
accuracy_knn = accuracy_score(y_test, y_pred_knn)

# Print the results
print("Accuracy:", accuracy_knn)

Accuracy: 0.9306948918545789


---
## 练习4：查看sklearn的官方说明，了解分类问题的评估标准，并对此例进行评估

**混淆矩阵（Confusion Matrix）相关学习链接**

- Blog:<br>
http://blog.csdn.net/vesper305/article/details/44927047<br>
- WiKi:<br>
http://en.wikipedia.org/wiki/Confusion_matrix<br>
- sklearn doc:<br>
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [24]:
models = ["Logistic Regression", "Decision Tree", "Random Forest", "SVM", "KNN"]
conf_matrices = [conf_matrix, conf_matrix_decision_tree, conf_matrix_random_forest, conf_matrix_svm, conf_matrix_knn]

for model, conf_matrix in zip(models, conf_matrices):
    print(f"{model} Model:")
    print("\nConfusion Matrix:\n", conf_matrix)

Logistic Regression Model:

Confusion Matrix:
 [[20235    11]
 [ 1466    18]]
Decision Tree Model:

Confusion Matrix:
 [[19024  1222]
 [ 1084   400]]
Random Forest Model:

Confusion Matrix:
 [[20032   214]
 [ 1220   264]]
SVM Model:

Confusion Matrix:
 [[20246     0]
 [ 1484     0]]
KNN Model:

Confusion Matrix:
 [[20195    51]
 [ 1455    29]]


## 练习5：调整模型的标准

银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。<br>

比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

- 提示：<span style='color:white'>sklearn的很多分类模型，predict_prob可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)('Gender') </span>

Logistic regression with threshold probability ranging from 0.1 to 1.0, with a step 0.1

In [40]:
import numpy as np
for p in np.arange(0.1, 1.1, 0.1):
    predictions = (logreg_model.predict_proba(X_test)[:, 1] > p).astype(int)
    
    print(f"Logistic Regression Model (p = {p}):")
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, predictions))

Logistic Regression Model (p = 0.1):

Confusion Matrix:
 [[15677  4569]
 [  890   594]]
Logistic Regression Model (p = 0.2):

Confusion Matrix:
 [[19837   409]
 [ 1389    95]]
Logistic Regression Model (p = 0.30000000000000004):

Confusion Matrix:
 [[20234    12]
 [ 1465    19]]
Logistic Regression Model (p = 0.4):

Confusion Matrix:
 [[20234    12]
 [ 1466    18]]
Logistic Regression Model (p = 0.5):

Confusion Matrix:
 [[20235    11]
 [ 1466    18]]
Logistic Regression Model (p = 0.6):

Confusion Matrix:
 [[20235    11]
 [ 1467    17]]
Logistic Regression Model (p = 0.7000000000000001):

Confusion Matrix:
 [[20235    11]
 [ 1470    14]]
Logistic Regression Model (p = 0.8):

Confusion Matrix:
 [[20239     7]
 [ 1475     9]]
Logistic Regression Model (p = 0.9):

Confusion Matrix:
 [[20243     3]
 [ 1484     0]]
Logistic Regression Model (p = 1.0):

Confusion Matrix:
 [[20246     0]
 [ 1484     0]]


Models that have attribute predict_proba (that means other than svm_model)

In [42]:
p = 0.3
names = ["Logistic Regression", "Decision Tree", "Random Forest", "KNN"]
models = [logreg_model, decision_tree_model, random_forest_model, knn_model]

for name, model in zip(names, models):
    print(f"{model} Model:")
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, (model.predict_proba(X_test)[:, 1] > p).astype(int)))

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear') Model:

Confusion Matrix:
 [[20234    12]
 [ 1465    19]]
DecisionTreeClassifier(random_state=42) Model:

Confusion Matrix:
 [[19024  1222]
 [ 1084   400]]
RandomForestClassifier(random_state=42) Model:

Confusion Matrix:
 [[19503   743]
 [  928   556]]
KNeighborsClassifier() Model:

Confusion Matrix:
 [[19702   544]
 [ 1352   132]]


SVM model

In [43]:
# Train the SVM model with probability estimates
svm_model = SVC(kernel='rbf', probability=True)
svm_model.fit(X_train, y_train)
p = 0.5
print("SVM Model:")
print("\nConfusion Matrix:\n", confusion_matrix(y_test, (svm_model.predict_proba(X_test)[:, 1] > p).astype(int)))

SVM Model:

Confusion Matrix:
 [[20246     0]
 [ 1484     0]]
