<a href="https://colab.research.google.com/github/jurajhunak/Stroke_from_collab/blob/main/Stroke_classification_Kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

from https://www.kaggle.com/datasets/prosperchuks/health-dataset?select=stroke_data.csv

# Imports

## Instalations

In [None]:
!pip install pandas-profiling==3.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
!pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp38-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [None]:
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Libraries

In [2]:
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [3]:
# Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier

In [4]:
# metrics
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score

In [5]:
#Model tuning
from sklearn.model_selection import GridSearchCV

In [92]:
# Saving model
from joblib import dump, load

# Dataset

In [6]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/SDA Data Science Colab Notebooks/Kaggle/Diabetes, stroke, hypertension/archive/stroke_data.csv")

In [7]:
data.head()

Unnamed: 0,sex,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1.0,63.0,0,1,1,4,1,228.69,36.6,1,1
1,1.0,42.0,0,1,1,4,0,105.92,32.5,0,1
2,0.0,61.0,0,0,1,4,1,171.23,34.4,1,1
3,1.0,41.0,1,0,1,3,0,174.12,24.0,0,1
4,1.0,85.0,0,0,1,4,1,186.21,29.0,1,1


In [8]:
data.shape

(40910, 11)

## Data overview, pandas profiling

As we can see, this is a well prepared dataset with low/ almost no need for any preprocessing. Later, when we will want to improve chosen model performance, we might use some feature scaling techniques (like normalization, standardisation or similar). Let's call pandas profiling to better visualise and thus understand given dataset

In [9]:
profile = ProfileReport(data)

TypeError: ignored

In [None]:
profile.to_file("Stroke_Classification_Dataset_fromKaggle.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

From this report, we can see that this dataset is quite balanced with almost no missing values. We can see few exceptions in categories:
- sex: 3 missing among 40,910; I would suggest to delete these rows
- age: 23 zeros, minimum value -9, which is obvious non-sense. I would suggest to delete all data with age below 0

We can visualise following categories below:

## Data preprocessing

### Age

In [10]:
plt.hist(data.age)

(array([ 145., 1341., 3737., 5385., 6445., 7698., 6644., 5625., 3230.,
         660.]),
 array([ -9. ,   2.2,  13.4,  24.6,  35.8,  47. ,  58.2,  69.4,  80.6,
         91.8, 103. ]),
 <a list of 10 Patch objects>)

In [11]:
data.age.min()

-9.0

In [12]:
data[data['age'] <= 0].shape

(81, 11)

In [13]:
# deleting them
data = data.drop(data.loc[data['age']<=0].index)

now, we have only 40,829 columns

In [14]:
data.shape

(40829, 11)

### Sex

In [16]:
plt.hist(data.sex)

(array([18172.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,
            0., 22654.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

In [17]:
data.sex.unique()

array([ 1.,  0., nan])

In [18]:
#display missing values
data[(data.sex != 0) & (data.sex != 1)]

Unnamed: 0,sex,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
22478,,39.0,0,0,1,4,1,70.56,28.6,1,0
28908,,36.0,0,0,1,4,1,70.56,28.6,1,0
35184,,77.0,0,0,1,4,1,70.56,28.6,1,0


In [19]:
# delete missing values
data = data.drop(data[(data.sex != 0) & (data.sex != 1)].index)

In [20]:
data.shape

(40826, 11)

  Now, shape of our data is smaller by 3 (deleted missing sex values)

# Targets to find out, validation metrics

This is a binary classification problem. We want to build a model, which predicts whether the patient suffered a stroke or not. For this purpose, we should use AUC-ROC score and confusion matrix as our validation metrics and focus on improving their scores. 

# Baseline model

I will use loop with various models to find out the best performing one. 

## X, y definition

In [21]:
X = data.drop(['stroke'], axis=1)
y = data.stroke

## Train, test split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.75, random_state=42)

## Models

In [23]:
model = [
    DecisionTreeClassifier(max_depth=3, criterion='entropy'),
    DecisionTreeClassifier(max_depth=3, criterion='gini'),
    RandomForestClassifier(n_estimators=3, criterion='entropy'),
    AdaBoostClassifier(n_estimators=3),
    GradientBoostingClassifier(n_estimators=3, max_depth=3),
    XGBClassifier(n_estimators=3, max_depth=3),
    CatBoostClassifier(silent=True),
    LGBMClassifier(n_estimators = 5, ),
    KNeighborsClassifier(),
]

### Training

In [24]:
#Do some preperation for the loop
col = []
algorithms = pd.DataFrame(columns = col)
idx = 0

#Train and score algorithms
for a in model:
    
    a.fit(X_train, y_train)
    y_pred = a.predict(X_test)
    acc_train = accuracy_score(y_train, a.predict(X_train)) #Other way: a.score(X_test, y_test)
    acc_test = accuracy_score(y_test, y_pred) #Other way: a.score(X_test, y_test)
    f1 = f1_score(y_test, y_pred)
    cv = cross_val_score(a, X_test, y_test).mean()
    auc = roc_auc_score(y_test, y_pred)
    
    Alg = a.__class__.__name__
    
    algorithms.loc[idx, 'Algorithm'] = Alg
    algorithms.loc[idx, 'Accuracy [train]'] = round(acc_train * 100, 2)
    algorithms.loc[idx, 'Accuracy [test]'] = round(acc_test * 100, 2)
    algorithms.loc[idx, 'F1 Score'] = round(f1 * 100, 2)
    algorithms.loc[idx, 'CV Score'] = round(cv * 100, 2)
    algorithms.loc[idx, 'AUC Score'] = round(auc * 100, 2)

    idx+=1

### Overview

#### Scoring metrics

In [25]:
algorithms.sort_values(by='AUC Score', ascending=False)

Unnamed: 0,Algorithm,Accuracy [train],Accuracy [test],F1 Score,CV Score,AUC Score
6,CatBoostClassifier,99.91,99.81,99.81,97.67,99.81
2,RandomForestClassifier,99.76,98.44,98.46,94.61,98.44
7,LGBMClassifier,82.21,81.61,82.61,81.24,81.61
8,KNeighborsClassifier,87.43,81.09,83.36,69.48,81.09
4,GradientBoostingClassifier,69.64,69.25,65.82,68.8,69.25
5,XGBClassifier,69.64,69.25,65.82,68.37,69.25
3,AdaBoostClassifier,68.35,67.86,65.03,67.86,67.86
0,DecisionTreeClassifier,66.88,66.44,61.44,66.1,66.45
1,DecisionTreeClassifier,66.58,66.03,61.15,66.33,66.04


From these metrics, we can conclude that Catboost and Random forest are best algorithms for our task, with considerable better performance than other models. 

However, we should be careful to overfitting (accuracy scores are very high for train as well as test data). But on the other hand, test and train accuracies are not very distant from each other, what decreases the risk of overfitting.

#### Confusion matrix

In [26]:
true_positives = y_test[y_test==1]
true_positives.shape

(5105,)

In [27]:
print("Matrix displayed as follows:\nTN FP\nFN TP")
for b in model:

  name = b.__class__.__name__

  b.fit(X_train, y_train)
  y_pred_cmx = b.predict(X_test)
  cmx = confusion_matrix(y_test, y_pred_cmx)

  print(f"Algorithm: {name}\n", cmx)

Matrix displayed as follows:
TN FP
FN TP
Algorithm: DecisionTreeClassifier
 [[4053 1049]
 [2376 2729]]
Algorithm: DecisionTreeClassifier
 [[4011 1091]
 [2376 2729]]
Algorithm: RandomForestClassifier
 [[4980  122]
 [  30 5075]]
Algorithm: AdaBoostClassifier
 [[3875 1227]
 [2054 3051]]
Algorithm: GradientBoostingClassifier
 [[4045 1057]
 [2082 3023]]
Algorithm: XGBClassifier
 [[4045 1057]
 [2082 3023]]
Algorithm: CatBoostClassifier
 [[5083   19]
 [   0 5105]]
Algorithm: LGBMClassifier
 [[3872 1230]
 [ 647 4458]]
Algorithm: KNeighborsClassifier
 [[3441 1661]
 [ 269 4836]]


Looking at the confusion matrix, Catboost performs very well with only 19 false positives and 0 false negatives. 

RandomForest has also satisfactory results. 

As this is medical problem, great news is that our best performing models have low false negative rates. False negatives are big problems in this matter, becouse we might oversee patients who in reality need medical treatment/ preventive steps. 

# Feature Engineering

Because this dataset has been very balanced and well-prepared, I would not provide any feature engineering, feature selection or any other data processing technique (standardscaling, normalization, ...). 

Another reason for this decision is the fact, that both CatBoost and RandomForest already performed well with original input data.

# Model tuning

Now we will try GridSearch cross validation with Catboost and RandomForest

## Catboost Algorithm

In [None]:
param_grid = {
            'iterations':[100,500,1000],
            'learning_rate':[0.01,0.1,0.5],
            'depth':[3,6,10]
            }

In [None]:
grid_catboost = GridSearchCV(estimator=CatBoostClassifier(), param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)

In [None]:
grid_catboost.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


GridSearchCV(cv=5,
             estimator=<catboost.core.CatBoostClassifier object at 0x7f1587ed4af0>,
             n_jobs=-1,
             param_grid={'depth': [3, 6, 10], 'iterations': [100, 500, 1000],
                         'learning_rate': [0.01, 0.1, 0.5]},
             verbose=1)

In [None]:
grid_catboost.best_params_

{'depth': 10, 'iterations': 500, 'learning_rate': 0.5}

## RandomForest Algorithm

In [None]:
param_grid_ranfor = {
    'n_estimators' : [1,3,5],
    'criterion' : ["gini", "entropy"],
    'min_samples_split' : [2, 3, 5],
    'min_samples_leaf' : [1, 2, 3]
}

In [None]:
grid_randomforest = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid_ranfor, cv=5, n_jobs=-1, verbose=1)

In [None]:
RandomForestClassifier().get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

In [None]:
grid_randomforest.fit(X_train, y_train)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 5],
                         'n_estimators': [1, 3, 5]},
             verbose=1)

In [None]:
grid_randomforest.best_params_

{'criterion': 'gini',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 5}

In [None]:
result_gridsearchcv_randomforest = pd.DataFrame(grid_randomforest.cv_results_)

In [None]:
# result_gridsearchcv_randomforest

# Applying models with best parameters

Now, I will try to use models with optimals parameters according to those found with GridSearch CV. 

However, especially with CatBoost, it is questionalble, if there is any need to improve the model at all (very good results in AUC-ROC and Confusion matrix). 

## Catboost after GridSearchCV

Fitting

In [65]:
model_cb = CatBoostClassifier(depth=10, iterations=500, learning_rate=0.5)
model_cb.fit(X_train, y_train, silent=True)

<catboost.core.CatBoostClassifier at 0x7f9e90814ee0>

Prediction

In [66]:
y_pred_cb = model_cb.predict(X_test)

Validation metrics

In [85]:
acc_train_cb = accuracy_score(y_train, model_cb.predict(X_train)) 
acc_test_cb = accuracy_score(y_test, y_pred_cb)
f1_cb = f1_score(y_test, y_pred_cb)
cv_cb = cross_val_score(model_cb, X_test, y_test).mean()
auc_cb = roc_auc_score(y_test, y_pred_cb)

results_cb= pd.DataFrame()

results_cb.loc[0, 'Acc train'] = round(acc_train_cb,2)
results_cb.loc[0, 'Acc test'] = round(acc_test_cb, 2)
results_cb.loc[0, 'f1 score'] = round(f1_cb, 2)
results_cb.loc[0, 'CV score'] = round(cv_cb,2)
results_cb.loc[0, 'AUC-ROC'] = auc_cb


 

0:	learn: 0.4977319	total: 14.6ms	remaining: 7.3s
1:	learn: 0.4471485	total: 32.3ms	remaining: 8.04s
2:	learn: 0.4009074	total: 45.1ms	remaining: 7.47s
3:	learn: 0.3629396	total: 56.7ms	remaining: 7.03s
4:	learn: 0.3372554	total: 68.5ms	remaining: 6.78s
5:	learn: 0.3035480	total: 80.3ms	remaining: 6.61s
6:	learn: 0.2814252	total: 93.2ms	remaining: 6.56s
7:	learn: 0.2709101	total: 107ms	remaining: 6.57s
8:	learn: 0.2514493	total: 118ms	remaining: 6.45s
9:	learn: 0.2214041	total: 130ms	remaining: 6.38s
10:	learn: 0.2188153	total: 142ms	remaining: 6.33s
11:	learn: 0.2022333	total: 154ms	remaining: 6.28s
12:	learn: 0.1802314	total: 166ms	remaining: 6.22s
13:	learn: 0.1634724	total: 178ms	remaining: 6.17s
14:	learn: 0.1509151	total: 190ms	remaining: 6.14s
15:	learn: 0.1417189	total: 203ms	remaining: 6.13s
16:	learn: 0.1334625	total: 219ms	remaining: 6.21s
17:	learn: 0.1260095	total: 230ms	remaining: 6.16s
18:	learn: 0.1193351	total: 242ms	remaining: 6.13s
19:	learn: 0.1130343	total: 254ms	r

In [86]:
results_cb

Unnamed: 0,Acc train,Acc test,f1 score,CV score,AUC-ROC
0,1.0,1.0,1.0,0.99,1.0


## RandomForest after GridSearchCV

{'criterion': 'gini',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 5}

Fitting

In [87]:
model_rf = RandomForestClassifier(criterion="gini", min_samples_leaf=1, min_samples_split=5, n_estimators=5)
model_rf.fit(X_train, y_train)

RandomForestClassifier(min_samples_split=5, n_estimators=5)

Prediction

In [88]:
y_pred_rf = model_rf.predict(X_test)

Validation

In [89]:
acc_train_rf = accuracy_score(y_train, model_rf.predict(X_train)) 
acc_test_rf = accuracy_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
cv_rf = cross_val_score(model_rf, X_test, y_test).mean()
auc_rf = roc_auc_score(y_test, y_pred_rf)

results_rf= pd.DataFrame()

results_rf.loc[0, 'Acc train'] = round(acc_train_rf,2)
results_rf.loc[0, 'Acc test'] = round(acc_test_rf, 2)
results_rf.loc[0, 'f1 score'] = round(f1_rf, 2)
results_rf.loc[0, 'CV score'] = round(cv_rf,2)
results_rf.loc[0, 'AUC-ROC'] = round(auc_rf, 2)

In [90]:
results_rf

Unnamed: 0,Acc train,Acc test,f1 score,CV score,AUC-ROC
0,1.0,0.99,0.99,0.96,0.99


When we compare it with original reslults before GridSearchCV (below), we see improvement in all metrics. But in my opinion, this would have better meaning with more complicated dataset (e.g. inbalanced data, too many categorical variables, etc.)

In [91]:
algorithms.sort_values(by='AUC Score', ascending=False)

Unnamed: 0,Algorithm,Accuracy [train],Accuracy [test],F1 Score,CV Score,AUC Score
6,CatBoostClassifier,99.91,99.81,99.81,97.67,99.81
2,RandomForestClassifier,99.76,98.44,98.46,94.61,98.44
7,LGBMClassifier,82.21,81.61,82.61,81.24,81.61
8,KNeighborsClassifier,87.43,81.09,83.36,69.48,81.09
4,GradientBoostingClassifier,69.64,69.25,65.82,68.8,69.25
5,XGBClassifier,69.64,69.25,65.82,68.37,69.25
3,AdaBoostClassifier,68.35,67.86,65.03,67.86,67.86
0,DecisionTreeClassifier,66.88,66.44,61.44,66.1,66.45
1,DecisionTreeClassifier,66.58,66.03,61.15,66.33,66.04


# Saving model

In [93]:
dump(model_cb, 'Stroke-Prediction-with-Catboost.pkl')

['Stroke-Prediction-with-Catboost.pkl']

In [94]:
dump(model_rf, 'Stroke-Prediction-with-RandomForest.pkl')

['Stroke-Prediction-with-RandomForest.pkl']