# Validação cruzada

Determinar a acurácia do modelo utilizando n amostras aleatórias utilizando validação cruzada (cross-validation)

Dataset obtido no Kaggle (https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
dados = pd.read_csv('../Regressao_Logistica/framingham.csv')

In [3]:
dados.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [4]:
dados = dados.dropna()

Removendo coluna educação

In [5]:
dados = dados.drop('education',axis=1)

Normalizando colunas

In [6]:
colunas = ['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose']

In [7]:
from sklearn.preprocessing import MinMaxScaler

In [8]:
normalizador = MinMaxScaler(feature_range=(0,1)) 

In [9]:
dados[colunas] = normalizador.fit_transform(dados[colunas])

Balanceando amostra

In [10]:
contagem = dados['TenYearCHD'].value_counts()

In [11]:
dados_1 = dados[dados['TenYearCHD']==1]
dados_0 = dados[dados['TenYearCHD']==0]

In [12]:
dados_0_novo = dados_0.sample(n=contagem[1],random_state=42)

In [13]:
dados = pd.concat([dados_0_novo,dados_1])

Determinando as variáveis X e Y

In [14]:
X = dados.drop('TenYearCHD',axis=1).values
Y = dados['TenYearCHD'].values

Criando modelo de regressão logística

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
modelo = LogisticRegression()

Implementando validação cruzada com 5 amostras

In [17]:
from sklearn.model_selection import cross_val_score

In [18]:
modelo_cv = cross_val_score(modelo, X, Y, cv=5)

In [19]:
modelo_cv

array([0.632287  , 0.60538117, 0.65919283, 0.69506726, 0.63063063])

In [20]:
print("Acuracia: %0.2f (+/- %0.2f)" % (modelo_cv.mean(), modelo_cv.std() * 2))

Acuracia: 0.64 (+/- 0.06)


Quando não definida, o modelo de validação cruzada utilizada a métrica do modelo escolhido. Iremos alterar a métrica e observar os seus efeitos

In [21]:
modelo_cv_f1 = cross_val_score(modelo, X, Y, cv=5, scoring='f1_macro')

In [22]:
print("Acuracia: %0.2f (+/- %0.2f)" % (modelo_cv_f1.mean(), modelo_cv_f1.std() * 2))

Acuracia: 0.64 (+/- 0.06)


In [23]:
modelo_cv_prec = cross_val_score(modelo, X, Y, cv=5, scoring='precision')

In [24]:
print("Acuracia: %0.2f (+/- %0.2f)" % (modelo_cv_prec.mean(), modelo_cv_prec.std() * 2))

Acuracia: 0.64 (+/- 0.08)


In [25]:
modelo_cv_recall = cross_val_score(modelo, X, Y, cv=5, scoring='recall')

In [26]:
print("Acuracia: %0.2f (+/- %0.2f)" % (modelo_cv_recall.mean(), modelo_cv_recall.std() * 2))

Acuracia: 0.66 (+/- 0.07)
