#### Logistic Regression for Credit Scoring
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

In [2]:
import pandas as pd
import zipfile

pd.set_option('display.max_columns', 500)
with zipfile.ZipFile(r"../../data/logistic_regression_data/KaggleCredit2.csv.zip", 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


#### Dropping Null Data

In [3]:
data.shape

(112915, 11)

In [7]:
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [8]:
data.dropna(inplace=True)
data.shape

(108648, 11)

#### Creating Data 'X' and 'y'

In [9]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [11]:
from sklearn import model_selection
x_train, x_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.2)
print(x_train.shape)


(86918, 10)


#### Train and Test

In [14]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class='ovr', solver='sag', class_weight='balanced')
lr.fit(x_train, y_train)



LogisticRegression(class_weight='balanced', multi_class='ovr', solver='sag')

In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

train_score = accuracy_score(y_train, lr.predict(x_train))
test_score = lr.score(x_test, y_test)
print("Accuracy on Train Data: ", train_score)
print("Accuracy on Test Data: ", test_score)


train_recall = recall_score(y_train, lr.predict(x_train), average='macro')
test_recall = recall_score(y_test, lr.predict(x_test), average='macro')
print("Recall on Train Data: ",train_recall)
print("Recall on Test Data: ", test_recall)

Accuracy on Train Data:  0.932706689063255
Accuracy on Test Data:  0.9319374137137598
Recall on Train Data:  0.49998766513303156
Recall on Test Data:  0.5


银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。
比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

In [53]:
y_pro = lr.predict_proba(x_test)
y_prd2 = [list(p >= 0.3).index(1) for i, p in enumerate(y_pro)]
train_score = accuracy_score(y_test, y_prd2)
print(train_score)


0.9319374137137598
