Hi everyone! In this kernel, we will apply a stroke classification with ml models. The outline of the project:

* [Introduction](#1)
* [Background and Motivation](#2)
* [Packages & Libraries](#3)
* [First Look on the Data and Basic EDA](#4)
* [Data Visualization](#5)
* [Encoding, Missing Value Imputation and Oversampling](#6)
    * [Encoding](#7)
    * [Missing Value Imputation](#8)
    * [Oversampling with Synthetic Minority Oversampling Technique (SMOTE)](#9)
    * [Data Visualization After Oversampling](#10)
    * [Splitting Data and Feature Scaling](#11)
* [ML Models & Hyperparameter Tuning](#12)
    * [Model Preparation](#13)
    * [Hyperparameter Tuning](#14)
    * [Determined Models with Tuned Parameters](#15)

<a id = "1"></a>
### Introduction
<img src="https://www.cdc.gov/stroke/images/Stroke-Medical-Illustration.jpg?_=77303?noicon">

<a id = "2"></a>
### Background and Motivation

Background: A stroke, sometimes called a brain attack, occurs when something blocks blood supply to part of the brain or when a blood vessel in the brain bursts. Brain cells begin to die in minutes. A stroke is a medical emergency, and prompt treatment is crucial. A stroke can cause lasting brain damage, long-term disability, or even death. Early action can reduce brain damage and other complications.

Motivation: Our objective is to understand what are the reasons that cause stroke to people and see if we can succesfully detect stroke on the features from given data using ML techniques.

<a id = "3"></a>
### Packages & Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, roc_curve, precision_recall_curve, auc, confusion_matrix, roc_auc_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.impute import KNNImputer
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings("ignore")

<a id = "4"></a>
### First Look on the Data and Basic EDA

In [None]:
data = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df = data.copy()
df.head()

In [None]:
# drop unnecessary columns
df.drop(["id"], axis = 1, inplace = True)
df.head()

In [None]:
# data info: columns with data types
df.info()

In [None]:
# bmi column seems it has null values. Let's check it out!
df.isnull().sum()

How many target variables of both "stroke" or "not stroke" classes on the dataset?

In [None]:
data = df["stroke"].value_counts()
labels = df["stroke"].value_counts().index

palette_color = sns.color_palette('bright')
plt.pie(data, labels=labels, colors=palette_color, autopct='%.0f%%')
plt.title("The percentage of stroke 1: stroke, 0: non-stroke");

#### Issues:
There is 201 null BMI values on the dataset. Additionally the data is imbalance therefore we need to solve these issues for better results.

<a id = "5"></a>
### Data Visualization

In [None]:
# a short look into the number of each categorical features grouped by stroke variable.
sns.set_theme(style = 'darkgrid')
for i in df.columns[:-1]:  # exclude stroke column
    if (df[i].dtype == 'object') or (df[i].dtype == 'int64'):
            sns.countplot(data = df, x = i, hue = 'stroke')
            plt.title('The number of the samples with {} based on stroke'.format(i))
            plt.show()

In [None]:
# a short look into numeric variables like bmi, avg_glucose_level and age
sns.set_theme(style = 'darkgrid')
for i in df.columns[:-1]: # exclude stroke column
    if df[i].dtype == 'float64':
            sns.displot(data = df, x = i, hue = 'stroke', kde = True)
            plt.title('The distribution of the {} based on stroke'.format(i))
            plt.show()

In [None]:
# Correlation heatmap
sns.heatmap(df.corr(), annot = True, cmap = 'crest')

In [None]:
# Violin plot to visualize each numerical variables by stroke
plt.figure(figsize = (12, 7))
sns.violinplot(data = df, x = "work_type", y="avg_glucose_level", hue = 'stroke', inner = 'stick')
plt.show()

<a id = "6"></a>
### Encoding, Missing Value Imputation and Oversampling

<a id = "7"></a>
#### Encoding

In [None]:
# First we need to transform our columns to be encoded to numpy arrays
gender = df.iloc[:,0:1].values
ever_married = df.iloc[:,4].values  
work_type = df.iloc[:,5:6].values
residence_type = df.iloc[:,6].values 
smoking_status = df.iloc[:,9:10].values

In [None]:
# Other variables
age = df[["age"]]
hypertension = df[["hypertension"]]
heart_disease = df[["heart_disease"]]
avg_glucose_level = df[["avg_glucose_level"]]
bmi = df[["bmi"]]
stroke = df[["stroke"]]

In [None]:
unique, counts = np.unique(ever_married, return_counts = True)
print(np.asarray((unique, counts)).T)

In [None]:
# Label Encoding for ever_married and residence_type columns which has two labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
ever_married = le.fit_transform(ever_married)
ever_married = pd.DataFrame(ever_married, columns = ["ever_married"])
print("Labels 0, 1:",le.classes_)
residence_type = le.fit_transform(residence_type)
residence_type = pd.DataFrame(residence_type, columns = ["residence_type"])
print("Labels 0, 1:",le.classes_)

In [None]:
# One Hot Encoding for gender, work_type and smoking status columns
ohe = OneHotEncoder(dtype = np.int64, sparse = False)
gender = ohe.fit_transform(gender)
gender = pd.DataFrame(gender, columns = ['female', 'male', 'other'])
print("Gender dummies respectively 0, 1, 2:", ohe.categories_)
work_type = ohe.fit_transform(work_type)
work_type = pd.DataFrame(work_type, columns = ['govt_job', 'never_worked', 'private', 'self-employed', 'children'])
print("Work type dummies respectively 0, 1, 2, 3, 4:", ohe.categories_)
smoking_status = ohe.fit_transform(smoking_status)
smoking_status = pd.DataFrame(smoking_status, columns = ['unknown', 'formerly_smoked', 'never_smoked', 'smokes'])
print("Smoking status dummies respectively 0, 1, 2, 3:", ohe.categories_)

In [None]:
frames = [gender, age, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, bmi, smoking_status, stroke]
df_en = pd.concat(frames, axis = 1)
df_en.head(3)

<a id = "8"></a>
#### Missing Value Imputation

In [None]:
# Missing value solution with Simple Imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
df_en = pd.DataFrame(imputer.fit_transform(df_en), columns = df_en.columns)
df_en.head()

In [None]:
int_vars = ["female","male","other","age","hypertension","heart_disease","ever_married","govt_job","never_worked","private","self-employed", 
            "children", "residence_type","unknown","formerly_smoked","never_smoked","smokes"]

df1 = df_en[int_vars].astype(np.int64)
num_vars = ["avg_glucose_level","bmi"]
df2 = df_en[num_vars]
label = ["stroke"]
df3 = df_en[label]
df_en = pd.concat([df1, df2, df3], axis = 1)
df_en

In [None]:
df_en.isnull().sum()

<a id = "9"></a>
#### Oversampling with Synthetic Minority Oversampling Technique (SMOTE)

Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the **Synthetic Minority Oversampling Technique,** or **SMOTE** for short.

Source: [https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/]

In [None]:
from imblearn.over_sampling import SMOTE
X, y = df_en.iloc[:, 0:-1], df_en.iloc[:, -1:]

print("Before Oversampling, the counts of label 1: ", y.value_counts()[1])
print("Before Oversampling, the counts of label 0: ", y.value_counts()[0])

oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

print("After Oversampling, the counts of label 1: ", y.value_counts()[1])
print("After Oversampling, the counts of label 0: ", y.value_counts()[0])

<a id = "10"></a>
#### Data Visualization After Oversampling

In [None]:
# visualization after SMOTE
df_as = pd.concat([X, y], axis = 1)

# a short look into the number of each categorical features grouped by stroke variable.
sns.set_theme(style = 'darkgrid')
for i in df_as.columns[:-1]:  # exclude stroke column
    if (df_as[i].dtype == 'object') or (df_as[i].dtype == 'int64'):
            sns.countplot(data = df_as, x = i, hue = 'stroke')
            plt.title('The number of the samples with {} based on stroke'.format(i))
            plt.show()

We can easily see that stroke 1 labels are increased **after SMOTE**

In [None]:
sns.set_theme(style = 'darkgrid')
for i in df_as.columns[:-1]: # exclude stroke column
    if df_as[i].dtype == 'float64':
            sns.displot(data = df_as, x = i, hue = 'stroke', kde = True)
            plt.title('The distribution of the {} based on stroke'.format(i))
            plt.show()

<a id = "11"></a>
#### Splitting Data and Feature Scaling

In [None]:
# Splitting data to validation and training
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Scaling features between -1 and 1
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

<a id = "12"></a>
### ML Models & Hyperparameter Tuning

<a id = "13"></a>
#### Model Preparation

In [None]:
# Models to be used for ML
models = [('Logistic Regression', LogisticRegression()),
          ('Decision Tree Classifier', DecisionTreeClassifier()),
          ('Random Forest', RandomForestClassifier()),
          ('Linear Discriminant Analyzer', LinearDiscriminantAnalysis()),
          ('Ada Boost', AdaBoostClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('Support Vector Classifier', SVC(probability = True)),
          ('XG Boost', XGBClassifier()),
          ('Cat Boost', CatBoostClassifier(logging_level = 'Silent'))]

models_score = []
for name, model in models:
    model = model
    model.fit(x_train, y_train)
    model.predict(x_test)
    models_score.append([name, accuracy_score(y_test, model.predict(x_test))])
    
    print("Model: ",name)
    print('Validation Accuracy: ', accuracy_score(y_test, model.predict(x_test)))
    print('Training Accuracy: ', accuracy_score(y_train, model.predict(x_train)))
    
    plt.figure()
    cf_matrix = confusion_matrix(y_test, model.predict(x_test))
    plt.title('Confusion Matrix: {}'.format(name))
    sns.heatmap(cf_matrix, annot = True, fmt = 'g', cmap = sns.cubehelix_palette(as_cmap=True))
    plt.show()
    
    import scikitplot as skplt

    skplt.metrics.plot_roc(y_test, model.predict_proba(x_test))
    plt.title('ROC Curves: {}'.format(name))
    plt.show()

In [None]:
# Barplot to show the test accuracy scores of each algorithms
plt.figure(figsize = (12, 6))
sns.barplot(x = np.array(models_score)[:, 0], y=np.array(models_score)[:, 1].astype('float64'))
plt.xticks(rotation = 45);

<a id = "14"></a>
#### Hyperparameter Tuning

In [None]:
# Models with hyperparameters to be tuned
grid_models = [(LogisticRegression(),[{'C' : [0.3, 0.7, 1], 'random_state' : [42]}]),
               (DecisionTreeClassifier(),[{'criterion' : ['gini','entropy'], 'random_state' : [42]}]),
               (RandomForestClassifier(),[{'n_estimators' : [100, 200, 300], 'criterion' : ['gini','entropy'], 'random_state' : [42]}]),
               (LinearDiscriminantAnalysis(),[{'solver' : ['svd', 'lsqr', 'eigen']}]),
               (AdaBoostClassifier(),[{'n_estimators' : [50, 100, 150], 'random_state' : [42]}]),
               (KNeighborsClassifier(),[{'n_neighbors' : [4, 6, 8, 10], 'metric' : ['euclidean', 'manhattan', 'chebyshev','minkowski']}]),
               (SVC(),[{'C' : [0.3, 0.7, 1], 'kernel' : ['rbf','linear','polynomial'], 'random_state' : [42]}]),
               (XGBClassifier(),[{'max_depth' : [3, 5, 7], 'min_child_weight' : [1, 3, 5]}]),
               (CatBoostClassifier(),[{'n_estimators' : [100, 200, 300], 'max_depth' : [3,5,7]}])]

In [None]:
for model, param_grid  in grid_models:
    cv = GridSearchCV(estimator = model, param_grid = param_grid, scoring = 'accuracy', cv = 5)
    cv.fit(x_train, y_train)
    best_accuracy = cv.best_score_
    best_params = cv.best_params_
    print('{}: \nBest Accuracy: {:.2f}%'.format(model, best_accuracy*100))
    print('Best Parameters: ',best_params)
    print('******************************')

When we expand the hide block and look at the results we can easily determine that the **Random Forest, XGBoost** and **Cat Boost** algorithms give the best scores after tuning.

<a id = "15"></a>
#### Determined Models with Tuned Parameters

In [None]:
# The best three models with the best chosen parameters after tuning
models = [('Random Forest', RandomForestClassifier(criterion = 'gini', n_estimators = 100, random_state = 42)),
          ('XG Boost', XGBClassifier(max_depth = 7, min_child_weight = 1)),
          ('Cat Boost', CatBoostClassifier(max_depth = 7, n_estimators = 300, logging_level = 'Silent'))]

models_score = []
for name, model in models:
    model = model
    model.fit(x_train, y_train)
    model.predict(x_test)
    models_score.append([name, accuracy_score(y_test, model.predict(x_test))])
    
    print("Model: ",name)
    print('Validation Accuracy: ', accuracy_score(y_test, model.predict(x_test)))
    print('Training Accuracy: ', accuracy_score(y_train, model.predict(x_train)))
    
    plt.figure()
    cf_matrix = confusion_matrix(y_test, model.predict(x_test))
    plt.title('Confusion Matrix: Tuned {}'.format(name))
    sns.heatmap(cf_matrix, annot = True, fmt = 'g', cmap = sns.cubehelix_palette(as_cmap=True))
    plt.show()
    
    import scikitplot as skplt

    skplt.metrics.plot_roc(y_test, model.predict_proba(x_test))
    plt.title('ROC Curves: Tuned {}'.format(name))
    plt.show()
    
    importance = model.feature_importances_
    # summarize feature importance
    for i,v in enumerate(importance):
        print('Feature: %0d, Score: %.5f' % (i,v))
    # plot feature importance
    plt.figure(figsize = (12, 5))
    plt.bar([x for x in range(len(importance))], importance)
    plt.title("{} Classification Feature Importance".format(name))
    plt.xticks(range(0, 19))
    plt.show()
    print('')
    print('#######################################################')
    print('#######################################################')
    print('')

**Thank you!
Please upvote if you liked this work...**