In [1]:
# Data manipulation
import numpy as np
import pandas as pd

# Misc
import os

# SMOTE
from imblearn.over_sampling import SMOTENC

df = pd.read_csv(os.path.join("..", "data", "train.csv"))

In [2]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,urban_residence,is_obese,has_disease,stroke
0,Female,36.0,False,False,True,Private,68.48,24.3,never smoked,True,False,False,0
1,Female,27.0,False,False,False,Private,104.21,35.7,never smoked,False,True,True,0
2,Female,40.0,False,False,True,Private,72.99,46.4,Unknown,True,True,True,0
3,Female,44.0,False,False,True,Private,124.06,20.8,never smoked,True,False,False,0
4,Female,81.0,False,False,True,pensioner,95.84,21.5,never smoked,True,False,False,1


## SMOTE

Due to the class imbalanced that we've seen during data exploration (1:20 stroke:no-stroke), we cannot simply apply machine learning. The methods would likely ignore the stroke samples and predict "no stroke" more often than not. What would result is great accuracy values would not be more than just saying "no" all the time (in addition we won't use the accuracy metric, but more on this later).

In order to do something about the class imbalance, we apply SMOTE to our training data. This is a technique to artifially sample new data points from the minority class ("Stroke" in this case). For this, SMOTE randomly chooses a "stroke" sample, selects a close neighbor of this sample which belongs to the "Stroke" class and samples a point from the line between them. The sampled point is a new "Stroke" sample we can use.

For reference: https://arxiv.org/pdf/1106.1813.pdf

Note that in this case we use SMOTENC (SMOTE - Non Continuous) which can handle categorical values as well.

In [3]:
smote = SMOTENC(
    categorical_features=(df.dtypes == "object"),
    sampling_strategy=0.1,  # The class ratio of minority to majority after resampling
    random_state=42,
)

In [4]:
df_aug, df_aug_labels = smote.fit_resample(df.drop("stroke", axis=1), df.stroke)
df_aug["stroke"] = df_aug_labels

In [5]:
print("Old stroke count: ", df.stroke.sum())
print("New stroke count: ", df_aug.stroke.sum())

Old stroke count:  199
New stroke count:  388


As you see we successfully applied SMOTENC to double the amount of strokes that we have in the data. Before we move on, lets redo the feature engineering in case SMOTE confused things.

#### Feature correction

In [6]:
df_aug.loc[df_aug.age >= 67, "work_type"] = "pensioner"
df_aug["is_obese"] = df_aug.bmi > 25
df_aug["has_disease"] = df_aug.heart_disease | df_aug.hypertension | df_aug.is_obese

## Save results

In [7]:
df_aug.to_csv(os.path.join("..", "data", "train_augmented_10.csv"), index=False)

## Experimental

To try a few things later on we will produce different training sets with SMOTE

In [8]:
for class_ratio in [0.2, 0.25, 0.4, 0.5]:
    smote = SMOTENC(
        categorical_features=(df.dtypes == "object"),
        sampling_strategy=class_ratio,  # The class ratio of minority to majority after resampling
        random_state=42,
    )

    df_aug, df_aug_labels = smote.fit_resample(df.drop("stroke", axis=1), df.stroke)
    df_aug["stroke"] = df_aug_labels

    df_aug.loc[df_aug.age >= 67, "work_type"] = "pensioner"
    df_aug["is_obese"] = df_aug.bmi > 25
    df_aug["has_disease"] = df_aug.heart_disease | df_aug.hypertension | df_aug.is_obese

    df_aug.to_csv(os.path.join("..", "data", f"train_augmented_{int(class_ratio*100)}.csv"), index=False)