### machine learning for credit scoring
Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.
#### [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)



Attribute Information:

|Variable Name	| Description |	Type |
|------|------|------|
|SeriousDlqin2yrs |	Person experienced 90 days past due delinquency or worse | Y/N |
|RevolvingUtilizationOfUnsecuredLines | Total balance on credit divided by the sum of credit limits	| percentage |
|age | Age of borrower in years | integer |
|NumberOfTime30-59DaysPastDueNotWorse | Number of times borrower has been 30-59 days past due | integer |
|DebtRatio | Monthly debt payments	| percentage |
|MonthlyIncome | Monthly income | real |
|NumberOfOpenCreditLinesAndLoans | Number of Open loans | integer  |
|NumberOfTimes90DaysLate | Number of times borrower has been 90 days or more past due | integer | 
|NumberRealEstateLoansOrLines |	Number of mortgage and real estate loans | integer |
|NumberOfTime60-89DaysPastDueNotWorse |	Number of times borrower has been 60-89 days past due | integer |
|NumberOfDependents	| Number of dependents in family | integer |

Read the data into Pandas

In [3]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('KaggleCredit2.csv.zip', 'r') as z:   ##读取zip里的文件
    f = z.open('KaggleCredit2.csv')
    data = pd.read_csv(f, index_col=0)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [4]:
data.shape

(112915, 11)

In [8]:
# drop NA
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [10]:
data.dropna(inplace=True) # remove all the NA
data.shape

(108648, 11)

In [13]:
# Create X and y
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [15]:
y.mean()
# Avg of y

0.06742876076872101

### Exercise 1

In [16]:
# train, test split
from sklearn import model_selection

In [17]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)

In [18]:
x_test.shape

(21730, 10)

### Exercise 2

In [24]:
from sklearn.linear_model import LogisticRegression

In [25]:
lr = LogisticRegression(multi_class='ovr', solver='sag', class_weight='balanced')
lr.fit(x_train, y_train)
score=lr.score(x_train, y_train)



In [26]:
score

0.9322579902896984

### Exercise 3

In [28]:
from sklearn.metrics import accuracy_score

In [30]:
train_score = accuracy_score(y_train, lr.predict(x_train))

In [31]:
test_score = lr.score(x_test, y_test)

In [32]:
print("x_train accuracy: ", train_score)
print("y_test accuracy: ", test_score)

x_train accuracy:  0.9322579902896984
y_test accuracy:  0.9337321675103544


### Exercise 4