<a href="https://colab.research.google.com/github/nildodnjunior/mestrado_comp_ifes_ml/blob/master/aula5a_gridsearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset com pessoas que possuem ou não problemas de fígado. Classificação binária

Obtido em https://openml.org/search?type=data&status=active&id=1480

In [2]:
from sklearn.datasets import fetch_openml
ilpd = fetch_openml(name='ilpd', version=1, parser='auto')
X, y = ilpd.data.to_numpy(), ilpd.target.to_numpy()

In [3]:
X[:,1] = (X[:,1] == 'Male').astype(int) #Convertendo Male = 1, Female = 0
y = y.astype(int)-1 #Convertendo valores de texto no original para 0 e 1 (0 = não tem problema, 1 = tem problema)

In [4]:
#Separação das bases de treino e teste
from sklearn.model_selection import train_test_split

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
Xtr.shape, ytr.shape, Xte.shape, yte.shape

((466, 10), (466,), (117, 10), (117,))

In [5]:
#Usando SMOTE para rebalancear a base de treino, pois há uma proporção de aproximadamente 28/72% entre as classes positivas e negativas para essa base
#Após o balanceamento fica em 50%
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42)
print(f"Balanceamento original: {sum(ytr)/len(ytr)}")
Xtr, ytr = sm.fit_resample(Xtr, ytr.ravel())
print(f"Balanceamento após SMOTE: {sum(ytr)/len(ytr)}")

Balanceamento original: 0.2939914163090129
Balanceamento após SMOTE: 0.5


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, make_scorer, accuracy_score, roc_auc_score, recall_score
import numpy as np
from sklearn.model_selection import cross_validate
from scipy.stats import randint

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
params_LR = {
    'scaler__with_mean': [True, False],
    'clf__solver': ['liblinear', 'newton-cg', 'lbfgs'],
    'clf__C': [0.1, 0.5, 0.8, 1.0],
    'clf__tol': [1e-5, 1e-4, 1e-3, 1e-2],
    'clf__max_iter': [300, 600, 1000]
}

#Usando recall pois é mais importante detectar se a pessoa tem problema do que não
modelo_aninhado_LR = GridSearchCV(pipe, params_LR, verbose=1, scoring=make_scorer(recall_score))

scores_LR = cross_validate(modelo_aninhado_LR, Xtr, ytr, return_estimator=True,
                        scoring=make_scorer(recall_score))
print(f"Média dos scores: {np.mean(scores_LR['test_score'])}")
print("Melhores modelos:")
for modelo_LR, score in zip(scores_LR['estimator'], scores_LR['test_score']):
    print(modelo_LR.best_params_, score)
    print(f"Métrica na base de teste: {recall_score(yte, modelo_LR.predict(Xte))}")

Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Fitting 5 folds for each of 288 candidates, totalling 1440 fits
Média dos scores: 0.8662937062937063
Melhores modelos:
{'clf__C': 0.1, 'clf__max_iter': 300, 'clf__solver': 'liblinear', 'clf__tol': 1e-05, 'scaler__with_mean': True} 0.8636363636363636
Métrica na base de teste: 0.9
{'clf__C': 0.1, 'clf__max_iter': 300, 'clf__solver': 'liblinear', 'clf__tol': 0.01, 'scaler__with_mean': True} 0.8636363636363636
Métrica na base de teste: 0.9
{'clf__C': 0.1, 'clf__max_iter': 300, 'clf__solver': 'liblinear', 'clf__tol': 1e-05, 'scaler__with_mean': True} 0.8939393939393939
Métrica na base de teste: 0.9333333333333333
{'clf__C': 0.1, 'clf__max_iter': 300, 'clf__solver': 'liblinear', 'clf__tol': 1e-05, 'scaler__with_mean': True} 0.87692307692307

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, make_scorer, accuracy_score, roc_auc_score, recall_score
import numpy as np
from sklearn.model_selection import cross_validate
from scipy.stats import randint

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier())
])
params_RF = {
    'scaler__with_mean': [True, False],
    'clf__criterion': ['gini', 'entropy', 'log_loss'],
    'clf__max_depth': [1, 5, 10, 15, 20],
    'clf__n_estimators': [50, 100, 200, 400]
}

modelo_aninhado_RF = GridSearchCV(pipe, params_RF, verbose=1, scoring=make_scorer(recall_score))

scores_RF = cross_validate(modelo_aninhado_RF, Xtr, ytr, return_estimator=True,
                        scoring=make_scorer(recall_score))
print(f"Média dos scores: {np.mean(scores_RF['test_score'])}")
print("Melhores modelos:")
for modelo_RF, score in zip(scores_RF['estimator'], scores_RF['test_score']):
    print(modelo_RF.best_params_, score)
    print(f"Métrica na base de teste: {recall_score(yte, modelo_RF.predict(Xte))}")

Fitting 5 folds for each of 120 candidates, totalling 600 fits
Fitting 5 folds for each of 120 candidates, totalling 600 fits
Fitting 5 folds for each of 120 candidates, totalling 600 fits
Fitting 5 folds for each of 120 candidates, totalling 600 fits
Fitting 5 folds for each of 120 candidates, totalling 600 fits
Média dos scores: 0.8633566433566433
Melhores modelos:
{'clf__criterion': 'log_loss', 'clf__max_depth': 1, 'clf__n_estimators': 50, 'scaler__with_mean': False} 0.8333333333333334
Métrica na base de teste: 0.9333333333333333
{'clf__criterion': 'entropy', 'clf__max_depth': 1, 'clf__n_estimators': 50, 'scaler__with_mean': False} 0.803030303030303
Métrica na base de teste: 0.9333333333333333
{'clf__criterion': 'entropy', 'clf__max_depth': 1, 'clf__n_estimators': 50, 'scaler__with_mean': False} 0.8787878787878788
Métrica na base de teste: 0.9666666666666667
{'clf__criterion': 'log_loss', 'clf__max_depth': 1, 'clf__n_estimators': 50, 'scaler__with_mean': True} 0.9076923076923077
Mét