## Dicionário de dados


- '**receita_cliente**': Renda do cliente em R$
- '**anuidade_emprestimo**': Valor anual da taxa de juros do empréstimo em $
- '**anos_casa_propria**': Idade da propriedade do cliente em anos
- '**telefone_trab**': Acessibilidade do número de telefone comercial (1 indica Sim e 0 indica Não)
- '**avaliacao_cidade**': Classificação da cidade do cliente: 3 para excelente, 2 para bom e 1 para médio.
- '**score_1**': Pontuação originada de uma fonte externa. Este é um escore normalizado.
- '**score_2**': Pontuação originada de uma fonte externa. Este é um escore normalizado.
- '**score_3**': Pontuação originada de uma fonte externa. Este é um escore normalizado.
- '**score_social**': Quantidade de amigos/familiares do cliente que não cumpriram com pagamentos de empréstimos nos últimos 60 dias.
- '**troca_telefone**': Quantidade de dias antes do pedido de empréstimo em que o cliente mudou seu número de telefone.
- '**inadimplente**': 1 indica que o cliente não honrou com o pagamento do empréstimo, e 0 indica o contrário.


## Carregando os dados


In [3]:
import pandas as pd

dados = pd.read_csv(
    "https://3070-classificacao-otimizacao.s3.us-east-2.amazonaws.com/dados_inadimplencia.csv"
)
dados.head()

Unnamed: 0,receita_cliente,anuidade_emprestimo,anos_casa_propria,telefone_trab,avaliacao_cidade,score_1,score_2,score_3,score_social,troca_telefone,inadimplente
0,16855.246324,2997.0,12.157324,0,2.0,0.501213,0.003109,0.513171,0.117428,243.0,1
1,13500.0,2776.05,12.157324,0,2.0,0.501213,0.26973,0.513171,0.0979,617.0,0
2,11250.0,2722.188351,12.157324,0,3.0,0.701396,0.518625,0.700184,0.1186,9.0,0
3,27000.0,6750.0,3.0,0,2.0,0.501213,0.649571,0.513171,0.0474,300.0,0
4,22500.0,3097.8,12.157324,0,2.0,0.440744,0.509677,0.513171,0.0144,2913.0,1


In [4]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14578 entries, 0 to 14577
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   receita_cliente      14578 non-null  float64
 1   anuidade_emprestimo  14578 non-null  float64
 2   anos_casa_propria    14578 non-null  float64
 3   telefone_trab        14578 non-null  int64  
 4   avaliacao_cidade     14578 non-null  float64
 5   score_1              14578 non-null  float64
 6   score_2              14578 non-null  float64
 7   score_3              14578 non-null  float64
 8   score_social         14578 non-null  float64
 9   troca_telefone       14578 non-null  float64
 10  inadimplente         14578 non-null  int64  
dtypes: float64(9), int64(2)
memory usage: 1.2 MB


In [5]:
dados["inadimplente"].value_counts(normalize=True) * 100

inadimplente
0    67.649883
1    32.350117
Name: proportion, dtype: float64

## Dividindo os dados em treino e teste


In [6]:
from sklearn.model_selection import train_test_split

SEED = 42

x = dados.drop("inadimplente", axis=1)
y = dados["inadimplente"]

x_treino, x_teste, y_treino, y_teste = train_test_split(
    x, y, test_size=0.33, stratify=y
)

## Criando os modelos


In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score

modelo_decision_tree = DecisionTreeClassifier(max_depth=3, random_state=SEED)
modelo_decision_tree.fit(x_treino, y_treino)
y_pred = modelo_decision_tree.predict(x_teste)
print(f"Recall Decision Tree: {recall_score(y_teste, y_pred)}")

Recall Decision Tree: 0.13881748071979436


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

regression_pipeline = make_pipeline(StandardScaler(), LogisticRegression())
regression_pipeline.fit(x_treino, y_treino)
y_pred = regression_pipeline.predict(x_teste)
print(f"Recall Logistic Regression: {recall_score(y_teste, y_pred)}")

Recall Logistic Regression: 0.22043701799485863


## GridSearchCV


### Decision Tree


In [9]:
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold

param_grid_dt = {
    "criterion": ["gini", "entropy"],
    "max_depth": np.linspace(6, 12, 4, dtype=int),
    "min_samples_split": np.linspace(5, 20, 4, dtype=int),
    "min_samples_leaf": np.linspace(5, 20, 4, dtype=int),
    "max_features": ["sqrt", "log2"],
    "splitter": ["best", "random"],
}


decision_tree = DecisionTreeClassifier(max_depth=3, random_state=SEED)
cv = StratifiedKFold(shuffle=True, random_state=SEED)

tree_search_cv = GridSearchCV(
    decision_tree, param_grid=param_grid_dt, cv=cv, scoring="recall", n_jobs=-1
)

tree_search_cv.fit(x_treino, y_treino)

In [10]:
tree_search_cv.best_params_

{'criterion': 'gini',
 'max_depth': 12,
 'max_features': 'sqrt',
 'min_samples_leaf': 15,
 'min_samples_split': 5,
 'splitter': 'best'}

In [11]:
df_tree_search_cv = pd.DataFrame(tree_search_cv.cv_results_)
df_tree_search_cv.head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_max_features,param_min_samples_leaf,param_min_samples_split,param_splitter,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.009628,0.002852,0.006375,0.007193,gini,6,sqrt,5,5,best,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.254747,0.286392,0.330696,0.153481,0.128165,0.230696,0.07766,135
1,0.003601,0.001195,0.00279,0.001414,gini,6,sqrt,5,5,random,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.080696,0.166139,0.151899,0.196203,0.15981,0.150949,0.038193,489
2,0.010566,0.000231,0.001914,0.000105,gini,6,sqrt,5,10,best,"{'criterion': 'gini', 'max_depth': 6, 'max_fea...",0.254747,0.286392,0.330696,0.153481,0.128165,0.230696,0.07766,135


In [12]:
df_tree_search_cv.loc[tree_search_cv.best_index_]

mean_fit_time                                                       0.014933
std_fit_time                                                        0.005219
mean_score_time                                                     0.002205
std_score_time                                                      0.001073
param_criterion                                                         gini
param_max_depth                                                           12
param_max_features                                                      sqrt
param_min_samples_leaf                                                    15
param_min_samples_split                                                    5
param_splitter                                                          best
params                     {'criterion': 'gini', 'max_depth': 12, 'max_fe...
split0_test_score                                                   0.313291
split1_test_score                                                   0.302215

### Análise visual


In [13]:
import plotly.express as px

px.scatter(
    df_tree_search_cv,
    x="param_max_depth",
    y="mean_test_score",
    title="Max Depth vs. Recall",
)

In [14]:
px.scatter(
    df_tree_search_cv,
    x="param_criterion",
    y="mean_test_score",
    title="Criterion vs. Recall",
)

In [15]:
px.scatter(
    df_tree_search_cv,
    x="param_max_features",
    y="mean_test_score",
    title="Max Features vs. Recall",
)

In [16]:
px.scatter(
    df_tree_search_cv,
    x="param_min_samples_split",
    y="mean_test_score",
    title="Min. Samples Split vs. Recall",
)

### Logistic Regression


In [17]:
max_iter = np.linspace(100, 300, 5, dtype=int)
c = [0.001, 0.01, 0.1, 1, 10]

param_grid_regression = [
    {
        "logisticregression__solver": ["newton-cg", "lbfgs"],
        "logisticregression__penalty": ["l2"],
        "logisticregression__max_iter": max_iter,
        "logisticregression__C": c,
    },
    {
        "logisticregression__solver": ["liblinear"],
        "logisticregression__penalty": ["l1", "l2"],
        "logisticregression__max_iter": max_iter,
        "logisticregression__C": c,
    },
]

regression_pipeline = make_pipeline(StandardScaler(), LogisticRegression())
regression_search_cv = GridSearchCV(
    regression_pipeline,
    param_grid=param_grid_regression,
    cv=cv,
    scoring="recall",
    n_jobs=-1,
)

regression_search_cv.fit(x_treino, y_treino)

In [18]:
regression_search_cv.best_params_

{'logisticregression__C': 0.001,
 'logisticregression__max_iter': 100,
 'logisticregression__penalty': 'l2',
 'logisticregression__solver': 'liblinear'}

In [19]:
df_regression_search_cv = pd.DataFrame(regression_search_cv.cv_results_)
df_regression_search_cv.head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__max_iter,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006726,0.000554,0.001681,7.3e-05,0.001,100,l2,newton-cg,"{'logisticregression__C': 0.001, 'logisticregr...",0.087025,0.113924,0.107595,0.110759,0.074367,0.098734,0.015393,86
1,0.004048,0.000262,0.001613,4e-05,0.001,100,l2,lbfgs,"{'logisticregression__C': 0.001, 'logisticregr...",0.087025,0.113924,0.107595,0.110759,0.074367,0.098734,0.015393,86
2,0.005919,0.000108,0.001641,2.1e-05,0.001,150,l2,newton-cg,"{'logisticregression__C': 0.001, 'logisticregr...",0.087025,0.113924,0.107595,0.110759,0.074367,0.098734,0.015393,86


In [20]:
df_regression_search_cv.loc[regression_search_cv.best_index_]

mean_fit_time                                                                  0.008306
std_fit_time                                                                   0.005124
mean_score_time                                                                0.001611
std_score_time                                                                 0.000053
param_logisticregression__C                                                       0.001
param_logisticregression__max_iter                                                  100
param_logisticregression__penalty                                                    l2
param_logisticregression__solver                                              liblinear
params                                {'logisticregression__C': 0.001, 'logisticregr...
split0_test_score                                                              0.251582
split1_test_score                                                              0.245253
split2_test_score               

In [21]:
px.scatter(
    df_regression_search_cv,
    x="param_logisticregression__max_iter",
    y="mean_test_score",
    title="Max Iter. vs. Recall",
)

In [22]:
px.scatter(
    df_regression_search_cv,
    x="param_logisticregression__C",
    y="mean_test_score",
    title="C vs. Recall",
)

In [23]:
px.scatter(
    df_regression_search_cv,
    x="param_logisticregression__solver",
    y="mean_test_score",
    title="Solver vs. Recall",
)

In [24]:
px.scatter(
    df_regression_search_cv,
    x="param_logisticregression__penalty",
    y="mean_test_score",
    title="Penalty vs. Recall",
)

## Validação Cruzada Aninhada


### Decision Tree


In [25]:
from sklearn.model_selection import cross_val_score

inner_cv = StratifiedKFold(shuffle=True, random_state=SEED)
outer_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=SEED)
decision_tree = DecisionTreeClassifier(max_depth=3, random_state=SEED)

tree_nested_cv = GridSearchCV(
    decision_tree, param_grid=param_grid_dt, cv=inner_cv, scoring="recall", n_jobs=-1
)

tree_nested_cv_results = cross_val_score(
    tree_nested_cv, x_treino, y_treino, cv=outer_cv
)

In [26]:
print(f"Média do Recall: {tree_nested_cv_results.mean()}")

Média do Recall: 0.24713643077547778


### Logistic Regression


In [27]:
inner_cv = StratifiedKFold(shuffle=True, random_state=SEED)
outer_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=SEED)
regression_pipeline = make_pipeline(StandardScaler(), LogisticRegression())

regression_nested_cv = GridSearchCV(
    regression_pipeline,
    param_grid=param_grid_regression,
    cv=inner_cv,
    scoring="recall",
    n_jobs=-1,
)

regression_nested_cv_results = cross_val_score(
    regression_nested_cv, x_treino, y_treino, cv=outer_cv
)

print(f"Média do Recall: {regression_nested_cv_results.mean()}")

Média do Recall: 0.2632900907199874


## RandomizedSearchCV Aninhado


### Decision Tree


In [28]:
from sklearn.model_selection import RandomizedSearchCV

tree_nested_rcv = RandomizedSearchCV(
    decision_tree,
    n_iter=100,
    param_distributions=param_grid_dt,
    cv=inner_cv,
    scoring="recall",
    n_jobs=-1,
    random_state=SEED,
)

tree_nested_rcv_results = cross_val_score(
    tree_nested_rcv, x_treino, y_treino, cv=outer_cv
)
print(f"Média do Recall: {tree_nested_rcv_results.mean()}")

Média do Recall: 0.2797269690586157


In [29]:
tree_nested_rcv.fit(x_treino, y_treino)
tree_nested_rcv.best_params_

{'splitter': 'best',
 'min_samples_split': 5,
 'min_samples_leaf': 15,
 'max_features': 'sqrt',
 'max_depth': 12,
 'criterion': 'gini'}

### Logistic Regression


In [30]:
regression_nested_rcv = RandomizedSearchCV(
    regression_pipeline,
    n_iter=50,
    param_distributions=param_grid_regression,
    cv=inner_cv,
    scoring="recall",
    n_jobs=-1,
    random_state=SEED,
)

regression_nested_rcv_results = cross_val_score(
    regression_nested_rcv, x_treino, y_treino, cv=outer_cv
)
print(f"Média do Recall: {tree_nested_rcv_results.mean()}")

Média do Recall: 0.2797269690586157


In [31]:
regression_nested_rcv.fit(x_treino, y_treino)
regression_nested_rcv.best_params_

{'logisticregression__solver': 'liblinear',
 'logisticregression__penalty': 'l2',
 'logisticregression__max_iter': 150,
 'logisticregression__C': 0.001}

## Otimização Bayesiana

In [32]:
from skopt.space import Real, Integer, Categorical
from skopt import BayesSearchCV

space_dt = {
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(6, 12),
    'min_samples_split': Integer(5, 20),
    'min_samples_leaf': Integer(5, 20),
    'max_features': Categorical(['sqrt', 'log2']),
    'splitter': Categorical(['best', 'random'])
}

tree_nested_bcv = BayesSearchCV(
    decision_tree,
    n_iter=50,
    search_spaces=space_dt,
    cv=inner_cv,
    scoring="recall",
    n_jobs=-1,
    random_state=SEED,
)

tree_nested_bcv_results = cross_val_score(
    tree_nested_bcv, x_treino, y_treino, cv=outer_cv
)
print(f"Média do Recall: {tree_nested_bcv_results.mean()}")


The objective has been evaluated at point ['entropy', 12, 'log2', 5, 5, 'best'] before, using random point ['gini', 10, 'log2', 17, 18, 'best']


The objective has been evaluated at point ['gini', 12, 'log2', 5, 5, 'best'] before, using random point ['entropy', 7, 'log2', 12, 9, 'best']


The objective has been evaluated at point ['gini', 12, 'log2', 5, 5, 'best'] before, using random point ['gini', 8, 'log2', 10, 11, 'random']


The objective has been evaluated at point ['gini', 12, 'sqrt', 5, 5, 'best'] before, using random point ['gini', 10, 'sqrt', 17, 16, 'random']



Média do Recall: 0.25505182926646136


In [33]:
tree_nested_bcv.fit(x_treino, y_treino)
tree_nested_bcv.best_params_

OrderedDict([('criterion', 'gini'),
             ('max_depth', 12),
             ('max_features', 'sqrt'),
             ('min_samples_leaf', 15),
             ('min_samples_split', 6),
             ('splitter', 'best')])

### Logistic Regression

In [36]:
max_iter = Integer(100, 300)
c = Categorical([0.001, 0.01, 0.1, 1, 10])

space_lr = [
    {
        'logisticregression__solver': Categorical(['newton-cg', 'lbfgs']),
        'logisticregression__penalty': Categorical(['l2']),
        'logisticregression__max_iter': max_iter,
        'logisticregression__C': c
    },
    {
        'logisticregression__solver': Categorical(['liblinear']),
        'logisticregression__penalty': Categorical(['l1', 'l2']),
        'logisticregression__max_iter': max_iter,
        'logisticregression__C': c
    },
]

regression_nested_bcv = BayesSearchCV(
    make_pipeline(StandardScaler(), LogisticRegression()),
    n_iter=25,
    search_spaces=space_lr,
    cv=inner_cv,
    scoring="recall",
    n_jobs=-1,
    random_state=SEED
)

regression_nested_bcv_results = cross_val_score(
    regression_nested_bcv, x_treino, y_treino, cv=outer_cv
)


The objective has been evaluated at point [0.1, 248, 'l2', 'liblinear'] before, using random point [0.001, 183, 'l1', 'liblinear']


The objective has been evaluated at point [0.001, 100, 'l2', 'liblinear'] before, using random point [0.01, 146, 'l1', 'liblinear']


The objective has been evaluated at point [0.001, 100, 'l2', 'liblinear'] before, using random point [0.1, 168, 'l2', 'liblinear']


The objective has been evaluated at point [0.001, 100, 'l2', 'liblinear'] before, using random point [1, 137, 'l1', 'liblinear']


The objective has been evaluated at point [0.001, 100, 'l2', 'liblinear'] before, using random point [10, 100, 'l1', 'liblinear']


The objective has been evaluated at point [0.001, 300, 'l2', 'liblinear'] before, using random point [0.01, 264, 'l1', 'liblinear']


The objective has been evaluated at point [0.001, 300, 'l2', 'liblinear'] before, using random point [10, 116, 'l2', 'liblinear']


The objective has been evaluated at point [0.001, 100, 'l2', 'liblinea

In [37]:
print(f"Média do Recall: {regression_nested_bcv_results.mean()}")

Média do Recall: 0.2632900907199874


In [38]:
regression_nested_bcv.fit(x_treino, y_treino)
regression_nested_bcv.best_params_

OrderedDict([('logisticregression__C', 0.001),
             ('logisticregression__max_iter', 100),
             ('logisticregression__penalty', 'l2'),
             ('logisticregression__solver', 'liblinear')])