## 1. Importing Packages

In [8]:
import pandas as pd

## 2. Feature Engineering and Selection

In [9]:
df_cleaned = pd.read_csv("../data/stroke_dataset_cleaned.csv")
df_modelling = df_cleaned.drop(["id", "Unnamed: 0", "Residence_type", "gender", "ever_married"], axis = 1)
df_modelling.head()

Unnamed: 0,age,hypertension,heart_disease,work_type,avg_glucose_level,bmi,smoking_status,stroke
0,23,0,0,Private,92.26,17.1,Unknown,0
1,82,0,0,Self-employed,82.21,26.0,Never smoked,0
2,26,0,0,Private,103.61,31.4,Never smoked,0
3,13,0,0,Children,94.12,20.1,Never smoked,0
4,7,0,0,Children,87.94,28.1,Unknown,0


The following columns were dropped for the remainder of this analysis:

`id`: An unique patient identifier that has no statistical meaning to the analysis, and only serves to weed out duplicate patient records.

`Residence_type`: Significant high p-value (1.000 vs 0.05) in the Chi-square test for target-feature analysis indicates that it is independent of stroke onset in patients.

`gender` & `ever_married`: High p-values above significance level in the Chi-square test for target-feature analysis (0.078052 in `gender` & 0.172338 in `ever_married`), indicating that they are weak predictors of stroke. Additionally, these features have high multicollinearity with other strong predictors, which would interfere with model performance.

The columns for `smoking_status` and `bmi` were kept for analysis despite the results of the target-feature relationship. Despite the high p-value in the Chi-Square analysis of `smoking_status` and `stroke` (0.511786), and the lack of difference in distribution and median values in the boxplot, literature reviews suggest otherwise that BMI and smoking history are in fact strong predictors of stroke. As such, their values are kept for the modelling process.

In [10]:
# Encoding Smoking Status Column
smoking_scoring = {

    "Unknown": 0,
    "Never smoked": 0,
    "Formerly smoked": 0.5,
    "Smokes": 1

}

df_modelling["smoking_status"] = df_modelling["smoking_status"].replace(smoking_scoring)

# Encoding Work Type Column

work_type_encoded = pd.get_dummies(df_modelling["work_type"], prefix= 'work')
df_modelling = pd.concat([df_modelling.drop(["work_type"],axis=1), work_type_encoded], axis = 1)

df_modelling.head()
df_modelling.to_csv("../data/stroke_dataset_modelling.csv")

  df_modelling["smoking_status"] = df_modelling["smoking_status"].replace(smoking_scoring)


The `smoking_status` column was ordinally encoded to reflect potential health risks: "Unknown" and "Never smoked" were encoded as 0. "Smokes" were encoded as 1 to reflect active risk, and "Formerly smoked" was encoded as 0.5 to capture any residual effects of past smoking.

For the `work_type` column, one-hot encoding was used to avoid introducing ordinality. Dummy variables were created using the "work_" prefix and the original columns were dropped.