<h1> Stroke Predictions </h1>

1. Context

* According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

2. Data Collection
* Kaggle: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

<h3> Import Libraries </h3>

In [24]:
import warnings
warnings.filterwarnings('ignore')

# standard libraries
import pandas as pd

%matplotlib inline

# modeling
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

from imblearn.over_sampling import SMOTE

from catboost import CatBoostClassifier
import lightgbm as lgb

<h4> Import the csv into a Pandas DataFrame (df) </h4>

In [5]:
df = pd.read_csv('data/healthcare-dataset-stroke-data.csv')

<h4> Top 5 rows </h4>

In [6]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


<h4> Prep Data </h4>

<h5> Convert binary variables to categorical </h5>

In [9]:
df['hypertension'] = df['hypertension'].astype(str)
df['heart_disease'] = df['heart_disease'].astype(str)

<h5> Define categorical and numeric variables </h5>

In [52]:
categorical_features = ['gender', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
numeric_features = ['age', 'avg_glucose_level', 'bmi']

<h5> Separate input and target variables </h5>

In [10]:
X = df.drop(columns=['id', 'stroke'])

In [11]:
y = df['stroke']

<h5> Columns transformers </h5>

In [43]:
numeric_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=2, weights='uniform')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('OneHotEncoder', OneHotEncoder())
])

In [53]:
preprocessor = ColumnTransformer(
   remainder = 'passthrough',
   transformers=[
       ('categorical', categorical_transformer, categorical_features),
       ('numeric', numeric_transformer, numeric_features)
])

In [54]:
X_transformed = preprocessor.fit_transform(X)

<h5> Split data into train and test </h5>

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=865)
X_train.shape, X_test.shape

((4088, 23), (1022, 23))

<h5> Rebalance data using SMOTE </h5>

In [None]:
smote = SMOTE(random_state=865)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

<h5> Function to evaluate models </h5>

In [20]:
def evaluate_models(actual, predicted):
    acc = accuracy_score(actual, predicted)
    precision = precision_score(actual, predicted)
    recall = recall_score(actual, predicted)
    auc = roc_auc_score(actual, predicted)
    return acc, precision, recall, auc

<h4> Training </h4>

In [68]:
models = {
    'Logistic Regression': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'SVM': SVC(),
    'Cat Boosting': CatBoostClassifier(silent=True),
    'Light GBM': lgb.LGBMClassifier()
}

model_list = []
recall_list = []

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train_smote, y_train_smote)

    # Predictions
    y_train_pred = model.predict(X_train_smote)
    y_test_pred = model.predict(X_test)

    # Evaluate
    model_train_accuracy, model_train_precision, model_train_recall, model_train_auc = evaluate_models(y_train_smote, y_train_pred)
    model_test_accuracy, model_test_precision, model_test_recall, model_test_auc = evaluate_models(y_test, y_test_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])

    print('Training set metrics')
    print('Accuracy: {:.4f}'.format(model_train_accuracy))
    print('Precision: {:.4f}'.format(model_train_precision))
    print('Recall: {:.4f}'.format(model_train_recall))
    print('AUC: {:.4f}'.format(model_train_auc))

    print('*'*20)

    print('Test set metrics')
    print('Accuracy: {:.4f}'.format(model_test_accuracy))
    print('Precision: {:.4f}'.format(model_test_precision))
    print('Recall: {:.4f}'.format(model_test_recall))
    print('AUC: {:.4f}'.format(model_test_auc))

    recall_list.append(model_test_recall)

    print('*'*40)
    print('\n')


Logistic Regression
Training set metrics
Accuracy: 0.7865
Precision: 0.7626
Recall: 0.8321
AUC: 0.7865
********************
Test set metrics
Accuracy: 0.7456
Precision: 0.1577
Recall: 0.8393
AUC: 0.7897
****************************************


KNN
Training set metrics
Accuracy: 0.9407
Precision: 0.8976
Recall: 0.9949
AUC: 0.9407
********************
Test set metrics
Accuracy: 0.8170
Precision: 0.1124
Recall: 0.3393
AUC: 0.5920
****************************************


Decision Tree
Training set metrics
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
AUC: 1.0000
********************
Test set metrics
Accuracy: 0.8777
Precision: 0.1290
Recall: 0.2143
AUC: 0.5652
****************************************


Random Forest
Training set metrics
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
AUC: 1.0000
********************
Test set metrics
Accuracy: 0.9139
Precision: 0.0556
Recall: 0.0357
AUC: 0.5003
****************************************


AdaBoost
Training set metrics
Accuracy: 0.87