Vamos prever a chance de alguém desenvolver uma doença cardíaca.<br>
Como é um caso de classificação binária, vamos treinar 2 modelos de aprendizado de máquina: uma Árvore de Decisão e uma Regressão Logística e iremos escolher o melhor dentre eles.<br>
Como o conjunto de dados já está organizado e tratado, não precisaremos fazer esses passos.
#### Conjunto de dados obtido do kaggle: https://www.kaggle.com/datasets/luyezhang/heart-2020-cleaned

Importando as bibliotecas

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
# Importando o dataset
df = pd.read_csv('heart_2020_cleaned.csv')

In [3]:
# Exibindo os 5 primeiros registros
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [4]:
# Resumo estatístico
df.describe()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime
count,319795.0,319795.0,319795.0,319795.0
mean,28.325399,3.37171,3.898366,7.097075
std,6.3561,7.95085,7.955235,1.436007
min,12.02,0.0,0.0,1.0
25%,24.03,0.0,0.0,6.0
50%,27.34,0.0,0.0,7.0
75%,31.42,2.0,3.0,8.0
max,94.85,30.0,30.0,24.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

## Random Forest

#### Preparação dos Dados

Criação das variáveis independentes (preditoras) e independente (classe)

In [6]:
labelencoder = LabelEncoder()

In [7]:
classe = df['HeartDisease']
previsores = df.drop('HeartDisease', axis=1)

In [8]:
# Aplicar o LabelEncoder nas colunas que contêm 'Yes' ou 'No'
labelencoder = LabelEncoder()
binary_cols = ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Diabetic', 'Asthma', 'KidneyDisease', 'SkinCancer', 'Sex']
for col in binary_cols:
    previsores[col] = labelencoder.fit_transform(previsores[col])

# Aplicar o LabelEncoder nas colunas 'AgeCategory', 'Race' e 'GenHealth'
categorical_cols = ['AgeCategory', 'Race', 'GenHealth']
for col in categorical_cols:
    previsores[col] = labelencoder.fit_transform(previsores[col])

# Aplicar a codificação one-hot nas outras colunas categóricas
previsores_encoded = pd.get_dummies(previsores, drop_first=True)

Divisão da base de dados em treino e teste

#### Separando dados de treino e teste

In [9]:
X_treino, X_teste, y_treino, y_teste = train_test_split(previsores_encoded, classe, test_size=0.2, random_state=0)

Criação do modelo, treinamento, obtenção das previsões e da taxa de acerto

#### Inicializar o modelo Random Forest

In [10]:
rnd_forest = RandomForestClassifier(n_estimators=150)

#### Treinar o modelo

In [11]:
rnd_forest.fit(X_treino, y_treino)

RandomForestClassifier(n_estimators=150)

In [13]:
previsoes = rnd_forest.predict(X_teste)
confusao = confusion_matrix(y_teste, previsoes)
confusao

array([[57383,  1129],
       [ 4821,   626]], dtype=int64)

In [19]:
taxa_acerto = accuracy_score(y_teste, previsoes) * 100
print(f'Taxa de acerto do modelo: {taxa_acerto:.2f}%')

Taxa de acerto do modelo: 90.70%
