# Stroke Prediction Analysis for Government Awareness Campaign

## Introduction
In this notebook, we'll use the healthcare stroke dataset from Kaggle [more info about dataset here] to discern the factors that increase a person's likelihood of experiencing a stroke. We will perform [xyz models] to perform an inference analysis on the dataset. 

## Import libraries

In [3]:
import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.impute import SimpleImputer

## Importing and previewing the dataset

In [4]:
df = pd.read_csv("healthcare-dataset-stroke-data.csv")
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In total, there are 10 predictors, and one outcome variable, stroke - 1 for yes and 0 for no.

The 10 predictors are:

- `gender`: Female, Male 

- `age`: Continuous

- `hypertension`: 0, 1

- `heart_disease`: 0, 1

- `ever_married`: Yes, No

- `work_type`: Private, Self-employed, children, Govt_job, Never_worked

- `Residence_type`: Urban, Rural

- `avg_glucose_level`: Continuous

- `bmi`: Continuous, including some NaN values

- `smoking_status`: never smoked, Unknown, formerly smoked, smokes


We'll start by addressing the NaN values in the bmi column. To decide how to address these values, we'll first have a closer look at the bmi column. This will help us understand if we should drop the null values, replace them with the mean, or use another method. 

In [5]:
df[df["bmi"].isna()]["stroke"].value_counts(), df[df["bmi"].isna()]["stroke"].value_counts(normalize=True)

(stroke
 0    161
 1     40
 Name: count, dtype: int64,
 stroke
 0    0.800995
 1    0.199005
 Name: proportion, dtype: float64)

In [6]:
df[df["bmi"].notna()]["stroke"].value_counts(), df[df["bmi"].notna()]["stroke"].value_counts(normalize=True)

(stroke
 0    4700
 1     209
 Name: count, dtype: int64,
 stroke
 0    0.957425
 1    0.042575
 Name: proportion, dtype: float64)

From the above, we can see that the patients with no BMI data are almost 5 times more likely to have had a stroke. Moreover, since our dataset contains only 249 actual strokes, we would be losing valuable information by dropping these rows.  

Let's take a basic look at the importance of BMI as a stroke predictor. 

In [7]:
df.groupby("stroke")["bmi"].agg([np.mean, np.std])

Unnamed: 0_level_0,mean,std
stroke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,28.823064,7.908287
1,30.471292,6.329452


In [8]:
df["bmi"].min(), df["bmi"].max()

(10.3, 97.6)

We can see there is little difference between the mean value and standard deviation of the BMI column in the stroke vs nonstroke cases. Given this information, we are likely safe to replace the NaN values with the mean. 

In [17]:
# calculate mean of bmi when stroke equals 1
mean_1 = df.loc[df['stroke'] == 1, 'bmi'].mean()
# calculate mean of bmi when stroke equals 0
mean_0 = df.loc[df['stroke'] == 0, 'bmi'].mean()

# replace null values for bmi when stroke equals 1 with mean_1
df.loc[(df['stroke'] == 1) & (df['bmi'].isna()), 'bmi'] = mean_1
# replace null values for bmi when stroke equals 0 with mean_0 
df.loc[(df['stroke'] == 0) & (df['bmi'].isna()), 'bmi'] = mean_0

In [19]:
df[df["bmi"].isna()]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke


As shown above, there are no more missing values in the bmi column. We replaced the missing values dynamically using the stroke column.

In [24]:
df.isnull().values.any()

False

No more missing values exist in the data. 