### Stroke Prediction Dataset - Acidente Vascular Cerebral
* https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

*  Stroke is a medical emergency. A stroke occurs when blood flow to a part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die within minutes

* Age: People aged 55 years and over

* Hypertension: if the systolic pressure is 140 mm Hg or more, or the diastolic pressure is 90 mm Hg or more

* Hypercholesterolemia: If the cholesterol level in the blood is 200 milligrams per deciliter

* Smoking

* Diabetes

* Obesity: if the body mass index (BMI) is 30 or more



In [63]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [64]:
# Carrega a base de dados
df = pd.read_csv('/content/healthcare-dataset-stroke-data.csv')

In [65]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


### Limpeza e Normalização

In [66]:
# formato dos dados
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [67]:
df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [68]:
df.shape

(5110, 12)

Limpeza do atributo BMI

In [69]:
media = df['bmi'].mean()

In [70]:
media

28.893236911794666

In [71]:
df['bmi'].fillna(media, inplace = True)

In [72]:
df.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

Normalização com StandardScaler()

In [73]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                5110 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [74]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [75]:
columns = ['age', 'avg_glucose_level', 'bmi']

In [76]:
scaler = StandardScaler()

In [77]:
df[columns] = scaler.fit_transform(df[columns])

In [78]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,1.051434,0,1,Yes,Private,Urban,2.706375,1.001234,formerly smoked,1
1,51676,Female,0.78607,0,0,Yes,Self-employed,Rural,2.121559,4.615554e-16,never smoked,1
2,31112,Male,1.62639,0,1,Yes,Private,Rural,-0.005028,0.4685773,never smoked,1
3,60182,Female,0.255342,0,0,Yes,Private,Urban,1.437358,0.7154182,smokes,1
4,1665,Female,1.582163,1,0,Yes,Self-employed,Rural,1.501184,-0.6357112,never smoked,1


#### Transformação dos atributos categóricos

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                5110 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [80]:
# smoking_status
df['smoking_status'].unique()

array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

In [81]:
# Residence_type
df['Residence_type'].unique()

array(['Urban', 'Rural'], dtype=object)

In [82]:
# work_type
df['work_type'].unique()

array(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'],
      dtype=object)

In [83]:
# ever_married
df['ever_married'].unique()

array(['Yes', 'No'], dtype=object)

In [84]:
# gender
df['gender'].unique()

array(['Male', 'Female', 'Other'], dtype=object)

Transformação com LabelEncoder
  - Residence_type
  - gender
  - ever-married

In [85]:
labelEncoder = LabelEncoder()

In [86]:
df['Residence_type'] = labelEncoder.fit_transform(df['Residence_type'])
df['Residence_type'].unique()

array([1, 0])

In [87]:
df['gender'] = labelEncoder.fit_transform(df['gender'])
df['gender'].unique()

array([1, 0, 2])

In [88]:
df['ever_married'] = labelEncoder.fit_transform(df['ever_married'])
df['ever_married'].unique()

array([1, 0])

In [89]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,1,1.051434,0,1,1,Private,1,2.706375,1.001234,formerly smoked,1
1,51676,0,0.78607,0,0,1,Self-employed,0,2.121559,4.615554e-16,never smoked,1
2,31112,1,1.62639,0,1,1,Private,0,-0.005028,0.4685773,never smoked,1
3,60182,0,0.255342,0,0,1,Private,1,1.437358,0.7154182,smokes,1
4,1665,0,1.582163,1,0,1,Self-employed,0,1.501184,-0.6357112,never smoked,1


Transformação com get dummies

In [90]:
df['smoking_status'].unique()

array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

In [91]:
df = pd.get_dummies(data = df, columns=['smoking_status'])

In [92]:
df.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'stroke',
       'smoking_status_Unknown', 'smoking_status_formerly smoked',
       'smoking_status_never smoked', 'smoking_status_smokes'],
      dtype='object')

In [93]:
df = pd.get_dummies(data = df, columns=['work_type'])

In [94]:
df.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'Residence_type', 'avg_glucose_level', 'bmi', 'stroke',
       'smoking_status_Unknown', 'smoking_status_formerly smoked',
       'smoking_status_never smoked', 'smoking_status_smokes',
       'work_type_Govt_job', 'work_type_Never_worked', 'work_type_Private',
       'work_type_Self-employed', 'work_type_children'],
      dtype='object')

In [95]:
df.shape

(5110, 19)

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              5110 non-null   int64  
 1   gender                          5110 non-null   int64  
 2   age                             5110 non-null   float64
 3   hypertension                    5110 non-null   int64  
 4   heart_disease                   5110 non-null   int64  
 5   ever_married                    5110 non-null   int64  
 6   Residence_type                  5110 non-null   int64  
 7   avg_glucose_level               5110 non-null   float64
 8   bmi                             5110 non-null   float64
 9   stroke                          5110 non-null   int64  
 10  smoking_status_Unknown          5110 non-null   bool   
 11  smoking_status_formerly smoked  5110 non-null   bool   
 12  smoking_status_never smoked     51

In [97]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children
0,9046,1,1.051434,0,1,1,1,2.706375,1.001234,1,False,True,False,False,False,False,True,False,False
1,51676,0,0.78607,0,0,1,0,2.121559,4.615554e-16,1,False,False,True,False,False,False,False,True,False
2,31112,1,1.62639,0,1,1,0,-0.005028,0.4685773,1,False,False,True,False,False,False,True,False,False
3,60182,0,0.255342,0,0,1,1,1.437358,0.7154182,1,False,False,False,True,False,False,True,False,False
4,1665,0,1.582163,1,0,1,0,1.501184,-0.6357112,1,False,False,True,False,False,False,False,True,False


### Amostragem: Houldout e Balanceamento das classes
* Separação entre treino e teste (70% e 30%)

In [98]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [99]:
# Separar os atributos e a classe
X = df.drop(['stroke'], axis=1)
y = df['stroke']

In [100]:
X.shape, df.shape

((5110, 18), (5110, 19))

In [101]:
# Amostragem por houldout
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [102]:
X_train.shape, X_test.shape

((3577, 18), (1533, 18))

In [103]:
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
y_pred = DT.predict(X_test)


In [104]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.95      0.95      1455
           1       0.17      0.21      0.19        78

    accuracy                           0.91      1533
   macro avg       0.56      0.58      0.57      1533
weighted avg       0.92      0.91      0.91      1533



In [105]:
confusion_matrix(y_test,y_pred)

array([[1378,   77],
       [  62,   16]])

In [106]:
y_test.value_counts()

stroke
0    1455
1      78
Name: count, dtype: int64

Balanceamento das classes
* https://medium.com/analytics-vidhya/undersampling-and-oversampling-an-old-and-a-new-approach-4f984a0e8392
* Abordagem SMOTE

In [107]:
df['stroke'].value_counts()

stroke
0    4861
1     249
Name: count, dtype: int64

In [108]:
from imblearn.over_sampling import SMOTE

In [109]:
sm = SMOTE()
X_balanced, y_balanced = sm.fit_resample(X, y)

In [110]:
X_balanced.shape

(9722, 18)

In [111]:
y_balanced.value_counts()

stroke
1    4861
0    4861
Name: count, dtype: int64

In [112]:
X_train,X_test,y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.3)

In [113]:
X_train.shape, X_test.shape

((6805, 18), (2917, 18))

In [114]:
y_train.value_counts(), y_test.value_counts()

(stroke
 1    3421
 0    3384
 Name: count, dtype: int64,
 stroke
 0    1477
 1    1440
 Name: count, dtype: int64)

In [115]:
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
y_pred = DT.predict(X_test)


In [116]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.90      0.92      1477
           1       0.90      0.93      0.92      1440

    accuracy                           0.92      2917
   macro avg       0.92      0.92      0.92      2917
weighted avg       0.92      0.92      0.92      2917



In [117]:
confusion_matrix(y_test,y_pred)

array([[1333,  144],
       [ 102, 1338]])

## Amostragem: Validação Cruzada

In [118]:
from sklearn.model_selection import (
    KFold,
    LeaveOneOut,
    StratifiedKFold,
    cross_validate
)

### K-fold Cross-validation

In [119]:
DT = DecisionTreeClassifier()
kf = KFold(n_splits = 10, shuffle = True)
clf = cross_validate(
    DT,
    X,
    y,
    scoring = 'balanced_accuracy',
    cv = kf
)

print(f"{clf['test_score']}\nMedia: {np.mean(clf['test_score'])}")

[0.53282828 0.5350119  0.58410549 0.56258503 0.51869128 0.5937187
 0.50646259 0.56934972 0.57176871 0.55416964]
Media: 0.5528691336595701


Balanceado

In [120]:
DT = DecisionTreeClassifier()
kf = KFold(n_splits = 10, shuffle = True)
clf = cross_validate(
    DT,
    X_balanced,
    y_balanced,
    scoring = 'balanced_accuracy',
    cv = kf
)

print(f"{clf['test_score']}\nMedia: {np.mean(clf['test_score'])}")

[0.91570546 0.92843129 0.9349311  0.92714813 0.93115793 0.92411431
 0.93630047 0.93054184 0.93080781 0.93465674]
Media: 0.9293795082069328


### Stratified K-fold Cross-validation

In [121]:
DT = DecisionTreeClassifier()
skf = StratifiedKFold(n_splits = 10, shuffle = True)
clf = cross_validate(
    DT,
    X,
    y,
    scoring = 'balanced_accuracy',
    cv = skf
)

print(f"{clf['test_score']}\nMedia: {np.mean(clf['test_score'])}")

[0.52707819 0.49839506 0.49736626 0.53222222 0.55633745 0.59222222
 0.53325103 0.59530864 0.57427984 0.54401951]
Media: 0.5450480412536652


Balanceado

In [122]:
DT = DecisionTreeClassifier()
skf = StratifiedKFold(n_splits = 10, shuffle = True)
clf = cross_validate(
    DT,
    X_balanced,
    y_balanced,
    scoring = 'balanced_accuracy',
    cv = skf
)

print(f"{clf['test_score']}\nMedia: {np.mean(clf['test_score'])}")

[0.93319095 0.93423666 0.9218107  0.93209877 0.93004115 0.93312757
 0.93106996 0.92695473 0.9382716  0.92695473]
Media: 0.9307756821389036


### Leave-One-Out Cross-Validation

In [123]:
DT = DecisionTreeClassifier()
loo = LeaveOneOut()

clf = cross_validate(
    DT,
    X,
    y,
    scoring='accuracy',
    cv = loo
)

print(f"{clf['test_score']}\nMedia: {np.mean(clf['test_score'])}")

Balanceado

In [124]:
DT = DecisionTreeClassifier()
loo = LeaveOneOut()

clf = cross_validate(
    DT,
    X_balanced,
    y_balanced,
    scoring='accuracy',
    cv = loo
)

print(f"{clf['test_score']}\nMedia: {np.mean(clf['test_score'])}")