## Será Que Seu Cliente Vai Te Pagar? Usando Machine Learning Para Prever Inadimplência

http://mariofilho.com/sera-que-seu-cliente-vai-te-pagar-usando-machine-learning-para-prever-inadimplencia/

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

### Carregamento dos dados

In [2]:
data = pd.read_excel('default of credit card clients.xls', header=1, index_col='ID')
data[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']] += 2

In [3]:
X = data.drop('default payment next month', axis=1)
y = data['default payment next month'].copy()

### Separação entre treino e teste

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold, train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

### Regressão Logística

In [5]:
#parece que o random_state do LogisticRegression não está funcionando, 
#então este modelo pode dar resultados diferentes cada vez que for rodado
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits=5, random_state=0, shuffle=True)

categs = [1,2,3,5,6,7,8,9,10]
ohe = OneHotEncoder(categorical_features=categs, handle_unknown='ignore')
model = make_pipeline(ohe, LogisticRegression(C=0.5, random_state=1, class_weight='balanced'))
cross_val_score(model, X_train, y_train, n_jobs=-1, cv=kf, scoring='roc_auc').mean()

0.70470591425353191

### Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier
kf = KFold(n_splits=5, random_state=0, shuffle=True)

model = RandomForestClassifier(n_jobs=-1, n_estimators=500, max_features=6, min_samples_leaf=20, random_state=0, class_weight='balanced')
cross_val_score(model, X_train, y_train, n_jobs=1, cv=kf, scoring='roc_auc').mean()

0.77731156622232267

### Gradient Boosted Trees

In [7]:
import xgboost as xgb

kf = KFold(n_splits=5, random_state=0, shuffle=True)

model = xgb.XGBClassifier(learning_rate=0.009, max_depth=6, min_child_weight=3,
                     subsample=0.15, colsample_bylevel=0.85, n_estimators=500)
cross_val_score(model, X_train, y_train, n_jobs=1, cv=kf, scoring='roc_auc').mean()

0.78069163130595398

### Modelo final no teste

In [8]:
from sklearn.metrics import roc_auc_score

model = xgb.XGBClassifier(learning_rate=0.009, max_depth=6, min_child_weight=3,
                     subsample=0.15, colsample_bylevel=0.85, n_estimators=500)
model.fit(X_train, y_train)
p = model.predict_proba(X_test)
roc_auc_score(y_test, p[:, 1])

0.78902959234540626