## Stroke Prediction Dataset Proposal - Group 16

#### Introduction:

Every year, 6.5 million people die from a stroke. A stroke occurs when an artery is blocked, or when a blood vessel bursts, leading to brain damage. Such damage can result in long-term impairments or even death.

Using the Stroke Prediction Dataset from Kaggle, we intend to create a classification model to predict if a person will have a stroke or not based on multiple relevant parameters. These parameters in the dataset include gender, age, hypertension, heart disease, average glucose levels, BMI, smoking status and more.


#### Preliminary exploratory data analysis:


In [1]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
# reading in data
url = "https://raw.githubusercontent.com/margshen/dsci_group_16/5b515efa41fcc3085d53cb27fa09d25698036ca0/data/stroke_data.csv"
stroke_data = pd.read_csv(url)

stroke_data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [3]:
# wrangling stroke column
stroke_data["stroke"] = stroke_data["stroke"].replace({1: "yes", 0: "no"})
stroke_data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,yes
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,yes
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,yes
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,yes
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,yes
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,no
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,no
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,no
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,no


In [4]:
# splitting data
stroke_training, stroke_testing = train_test_split(
    stroke_data,
    test_size = 0.25,
    random_state = 1
)


In [5]:
# number of class observations in stroke class
stroke_count = stroke_training["stroke"].value_counts()

print(stroke_count)

no     3658
yes     174
Name: stroke, dtype: int64


In [6]:
# rows with missing data
rows_with_missing_data = stroke_training.isna().sum()

print(rows_with_missing_data)

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  153
smoking_status         0
stroke                 0
dtype: int64


In [7]:
# means of potential predictor values
mean_value_of_columns = stroke_training[["age", "avg_glucose_level", "bmi"]].mean()
mean_value_of_columns

age                   43.280021
avg_glucose_level    105.521550
bmi                   28.963930
dtype: float64

In [8]:
plot = alt.Chart(stroke_training).mark_circle(opacity = 0.4).encode(
    x = alt.X("age", title = "Age"),
    y = alt.Y("avg_glucose_level", title = "Average Glucose Level"),
    color = alt.Color("stroke", title = "Stroke Classification")
)

plot

#### Methods:
First, filter out the variables (columns) that are relevant to our prediction
- To do this, we will graph each variable vs. stroke using bar graphs to see if there is a correlation between the two. Ones with very little or no correlation are irrelevant for predicting the occurrence of a stroke.
<br>

Before analysis, we will scale the data so every variable is standardized
1. Split the data into a training set (75%) and a testing set (25%)
2. Use 5-fold/10-fold cross-validation on the training data to select the best k value to use for the classifier
3. Plot the estimated accuracy vs. the number of neighbors to pick the best k value 
4. Evaluate the classifier by testing it against the testing data set. Further tune the classifier according to accuracy, precision and recall.
5. Use the knn classifier to predict unknown observations

Visualizations:
1. Use a confusion matrix display to demonstrate the accuracy of our classifier
2. Plot a sample 2D plot of our classification results:
>- Use the 2 predictors with the strongest correlation to whether a person will have a stroke or not as the “x” and “y” variables on the plot
>- Colour the graph by stroke or not
> > - Known data points should be a hollow circle
>- Perform classification using our tuned classifier on a set of unknown data points
>- Layer the set of unknown data points on top of our plot, coloring it by the label it is predicted to be
> > - Predicted data points should be a solid circle


#### Expected outcomes and significance:
Through an analysis of this dataset, we hope to see which factors have a larger correlation with if someone will have a stroke or not. The impact of these findings could help guide healthcare professionals ways of properly supervising these patients to prevent and reduce the chance of getting a stroke. Future questions that can be derived from this would probably be more focused on the variables. For example if there were to be more stroke patients who are self-employed which is within the work-type variable, more analysis can be done on this to see how being self-employed or types of self-employment result in a stroke. Overall, an analysis on this dataset may provide some clarity and insight on what could affect the chance of getting a stroke and which groups of individuals are more prone towards it. 