TODO: Add Title, TOC, Intro, Goal

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
f"pandas version: {pd.__version__}"

'pandas version: 2.1.4'

In [3]:
# Load the data
df = pd.read_csv("../data/interim/healthcare-stroke-data-cleaned.csv", index_col='id')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5109 entries, 9046 to 44679
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5109 non-null   object 
 1   age                5109 non-null   float64
 2   hypertension       5109 non-null   int64  
 3   heart_disease      5109 non-null   int64  
 4   ever_married       5109 non-null   object 
 5   work_type          5109 non-null   object 
 6   Residence_type     5109 non-null   object 
 7   avg_glucose_level  5109 non-null   float64
 8   bmi                5109 non-null   float64
 9   smoking_status     5109 non-null   object 
 10  stroke             5109 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 479.0+ KB


In [5]:
df.shape

(5109, 11)

We are starting to build a clean dataframe that only contains numeric data - the categorical columns will be transformed and added to this.

In [6]:
df_clean = df.select_dtypes("number").copy()
df_clean.head()

Unnamed: 0_level_0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9046,67.0,0,1,228.69,36.6,1
51676,61.0,0,0,202.21,30.5,1
31112,80.0,0,1,105.92,32.5,1
60182,49.0,0,0,171.23,34.4,1
1665,79.0,1,0,174.12,24.0,1


The rest of the columns, we will encode in some numeric format:

In [7]:
# todo list - we need to transform all of these!
df_categorical = df.select_dtypes("object").copy()
df_categorical.head()

Unnamed: 0_level_0,gender,ever_married,work_type,Residence_type,smoking_status
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9046,Male,Yes,Private,Urban,formerly smoked
51676,Female,Yes,Self-employed,Rural,never smoked
31112,Male,Yes,Private,Rural,never smoked
60182,Female,Yes,Private,Urban,smokes
1665,Female,Yes,Self-employed,Rural,never smoked


In [8]:
df_categorical.nunique().sort_values()

gender            2
ever_married      2
Residence_type    2
smoking_status    4
work_type         5
dtype: int64

### Mapping `gender`

We will binary encode this column, mapping Male as 0 and Female as 1.

In [9]:
df_categorical["gender"].value_counts()

gender
Female    2994
Male      2115
Name: count, dtype: int64

In [10]:
df_clean['gender'] = np.where(df_categorical["gender"] == "Female", 1, 0)
df_clean.head()

Unnamed: 0_level_0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9046,67.0,0,1,228.69,36.6,1,0
51676,61.0,0,0,202.21,30.5,1,1
31112,80.0,0,1,105.92,32.5,1,0
60182,49.0,0,0,171.23,34.4,1,1
1665,79.0,1,0,174.12,24.0,1,1


In [11]:
# check the value counts match the original categorical
df_clean['gender'].value_counts()

gender
1    2994
0    2115
Name: count, dtype: int64

We can confirm that the `gender` column is transformed correctly and added to the clean table.

### Mapping `ever_married`

We will binary encode this column, mapping No as 0 and Yes as 1.

In [12]:
df_categorical['ever_married'].value_counts()

ever_married
Yes    3353
No     1756
Name: count, dtype: int64

In [13]:
df_clean['ever_married'] = np.where(df_categorical["ever_married"] == "Yes", 1, 0)
df_clean.head()

Unnamed: 0_level_0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender,ever_married
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9046,67.0,0,1,228.69,36.6,1,0,1
51676,61.0,0,0,202.21,30.5,1,1,1
31112,80.0,0,1,105.92,32.5,1,0,1
60182,49.0,0,0,171.23,34.4,1,1,1
1665,79.0,1,0,174.12,24.0,1,1,1


In [14]:
# check the value counts match the original categorical
df_clean['ever_married'].value_counts()

ever_married
1    3353
0    1756
Name: count, dtype: int64

We can confirm that the `ever_married` column is transformed correctly and added to the clean table.

### Mapping `Residence_type`

We will binary encode this column, mapping Rural as 0 and Urban as 1.

In [15]:
df_categorical['Residence_type'].value_counts()

Residence_type
Urban    2596
Rural    2513
Name: count, dtype: int64

In [16]:
df_clean['residence_type'] = np.where(df_categorical["Residence_type"] == "Urban", 1, 0)
df_clean.head()

Unnamed: 0_level_0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender,ever_married,residence_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9046,67.0,0,1,228.69,36.6,1,0,1,1
51676,61.0,0,0,202.21,30.5,1,1,1,0
31112,80.0,0,1,105.92,32.5,1,0,1,0
60182,49.0,0,0,171.23,34.4,1,1,1,1
1665,79.0,1,0,174.12,24.0,1,1,1,0


In [17]:
# check the value counts match the original categorical
df_clean['residence_type'].value_counts()

residence_type
1    2596
0    2513
Name: count, dtype: int64

We can confirm that the `Residence_type` column is transformed correctly and added to the clean table.

### Mapping `smoking_status` and `work_type` using Dummy variables

In [18]:
df_categorical['smoking_status'].value_counts()

smoking_status
never smoked       1892
Unknown            1544
formerly smoked     884
smokes              789
Name: count, dtype: int64

**Decision Point / Assumption** Our EDA in the previous notebook showed that the outcome for stroke is similar for "smokes" and "formerly smoked". We will combine "formerly smoked" and "smokes" into one category to help with analysis since our dataset is small.

In [19]:
df_categorical['smoking_status'] = df_categorical['smoking_status'].replace({'formerly smoked': 'smokes'})

In [20]:
dummies = pd.get_dummies(df_categorical[['smoking_status', 'work_type']], dtype=int).drop(columns=['smoking_status_Unknown', 'work_type_children'])
dummies.head()

Unnamed: 0_level_0,smoking_status_never smoked,smoking_status_smokes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9046,0,1,0,0,1,0
51676,1,0,0,0,0,1
31112,1,0,0,0,1,0
60182,0,1,0,0,1,0
1665,1,0,0,0,0,1


In [21]:
# Fixing the column names before concatenating with the clean dataset
dummies.rename(columns={"smoking_status_never smoked": "smoking_status_never_smoked",
                        "work_type_Govt_job": "work_type_govt_job",
                        "work_type_Never_worked": "work_type_never_worked",
                        "work_type_Private": "work_type_private",
                        "work_type_Self-employed": "work_type_self_employed"}, inplace=True)
dummies.head()

Unnamed: 0_level_0,smoking_status_never_smoked,smoking_status_smokes,work_type_govt_job,work_type_never_worked,work_type_private,work_type_self_employed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9046,0,1,0,0,1,0
51676,1,0,0,0,0,1
31112,1,0,0,0,1,0
60182,0,1,0,0,1,0
1665,1,0,0,0,0,1


Lets put the dummies together with the already clean data:

In [22]:
print(df_clean.shape)
print(dummies.shape)

(5109, 9)
(5109, 6)


In [23]:
df_clean = pd.concat([df_clean, dummies], axis=1)
df_clean.head()

Unnamed: 0_level_0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender,ever_married,residence_type,smoking_status_never_smoked,smoking_status_smokes,work_type_govt_job,work_type_never_worked,work_type_private,work_type_self_employed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
9046,67.0,0,1,228.69,36.6,1,0,1,1,0,1,0,0,1,0
51676,61.0,0,0,202.21,30.5,1,1,1,0,1,0,0,0,0,1
31112,80.0,0,1,105.92,32.5,1,0,1,0,1,0,0,0,1,0
60182,49.0,0,0,171.23,34.4,1,1,1,1,0,1,0,0,1,0
1665,79.0,1,0,174.12,24.0,1,1,1,0,1,0,0,0,0,1


In [24]:
df_clean.shape

(5109, 15)

In [25]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5109 entries, 9046 to 44679
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   age                          5109 non-null   float64
 1   hypertension                 5109 non-null   int64  
 2   heart_disease                5109 non-null   int64  
 3   avg_glucose_level            5109 non-null   float64
 4   bmi                          5109 non-null   float64
 5   stroke                       5109 non-null   int64  
 6   gender                       5109 non-null   int32  
 7   ever_married                 5109 non-null   int32  
 8   residence_type               5109 non-null   int32  
 9   smoking_status_never_smoked  5109 non-null   int32  
 10  smoking_status_smokes        5109 non-null   int32  
 11  work_type_govt_job           5109 non-null   int32  
 12  work_type_never_worked       5109 non-null   int32  
 13  work_type_private  

All the data has been converted to numerical values, and there are no null or duplicate values. Our data is ready to be put into a model. Let's save this.

In [26]:
df_clean.to_csv("../data/interim/healthcare-stroke-data-preprocessed.csv")