# Grid Search CV

Código para determinar os melhores parâmetros de entrada no modelo de regressão logística utilizando Grid Search CV

Dataset obtido no Kaggle (https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
dados = pd.read_csv('../Regressao_Logistica/framingham.csv')

In [3]:
dados.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [4]:
dados = dados.dropna()

In [5]:
dados = dados.drop('education',axis=1)

Normalizando os dados

In [6]:
from sklearn.preprocessing import MinMaxScaler
colunas = ['age','cigsPerDay','BPMeds','totChol','sysBP','diaBP','BMI','heartRate','glucose']
normalizador = MinMaxScaler(feature_range=(0,1)) 
dados[colunas] = normalizador.fit_transform(dados[colunas])

Balanceando as classes

In [7]:
contagem = dados['TenYearCHD'].value_counts()
dados_1 = dados[dados['TenYearCHD']==1]
dados_0 = dados[dados['TenYearCHD']==0]
dados_0_novo = dados_0.sample(n=contagem[1],random_state=42)
dados = pd.concat([dados_0_novo,dados_1])

Determinando as variáveis X e Y

In [8]:
X = dados.drop('TenYearCHD',axis=1).values
Y = dados['TenYearCHD'].values

Separando em amostra de treino e teste

In [9]:
from sklearn.model_selection import train_test_split
X_treino,X_teste,Y_treino,Y_teste=train_test_split(X,Y,test_size=0.25,random_state=0)

Criando modelo inicial de regressão logística

In [10]:
from sklearn.linear_model import LogisticRegression
modelo = LogisticRegression()

In [11]:
modelo.get_params

<bound method BaseEstimator.get_params of LogisticRegression()>

Implementando modelo GridSearchCV para determinar os melhores parâmetros do modelo

In [12]:
from sklearn.model_selection import GridSearchCV

In [13]:
parametros = {"C" : [0.1,0.2,0.3,0.4,0.45,0.49,0.5,0.51,0.55,0.6,0.65,0.69,0.7,0.71,0.72,0.73,0.74,0.75,0.77,0.8,0.9,1.0],
             "solver" : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            "penalty" :['l1','l2']}

In [14]:
melhor_modelo = GridSearchCV(modelo, parametros, n_jobs=-1, cv=2, refit=True)

In [15]:
melhor_modelo.fit(X_treino, Y_treino)

GridSearchCV(cv=2, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': [0.1, 0.2, 0.3, 0.4, 0.45, 0.49, 0.5, 0.51, 0.55,
                               0.6, 0.65, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74,
                               0.75, 0.77, 0.8, 0.9, 1.0],
                         'penalty': ['l1', 'l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']})

In [16]:
modelo_final = melhor_modelo.best_estimator_

In [17]:
modelo_final

LogisticRegression(penalty='l1', solver='saga')

Utilizando modelo refinado para fazer previsão dos valores

In [18]:
Y_previsto = modelo_final.predict(X_teste)

Gerando matriz de confusão

In [19]:
from sklearn.metrics import confusion_matrix

In [20]:
cm=confusion_matrix(Y_teste,Y_previsto)
cm

array([[ 84,  37],
       [ 52, 106]])

Modelo refinado apresentou uma convergência pior indicando que não é indicado para este tipo de problema.