# Stroke Predictor - DSCI 100 Group Project Proposal

## Introduction

The chances of getting a stroke can often be predicted by certain factors such as glucose level and smoking status. Being able to tell a person if they are categorized to have a higher chance of stroke gives a person a chance to reduce said factors in hopes of preventing a stroke. The question and purpose of our project is to classify a large set of data that gives the factors that lead to stroke and then, when given a new example, predict if the person is likely to have a stroke.
The data set is provided on Kaggle (Stroke Prediction Dataset | Kaggle), which lists various background clinical features from the patients, such as disease history and age, all potentially useful in predicting a stroke. 

## Preliminary exploratory data analysis

In [None]:
### LOADING LIBRARY

### Run this cell before continuing.
library(plyr)
library(dplyr)
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)


### Read dataset from web into R

In [None]:
url <- "https://gist.githubusercontent.com/aishwarya8615/d2107f828d3f904839cbcb7eaa85bd04/raw/cec0340503d82d270821e03254993b6dede60afb/healthcare-dataset-stroke-data.csv"
stroke_data <- read_csv(url)

## modify variable types
stroke_data_final <- stroke_data |>
                    mutate (stroke = as.factor(stroke),
                           bmi = as.numeric(bmi))
stroke_data_final

To view what correlation various factors have on developing stroke, we will plot three scatter graphs: age vs. average glucose level, age vs. BMI, and BMI vs. average glucose level. These three variables were chosen because they were continuous measurements and are closely related to the develpopment of stroke. 

### Splitting the dataset into training and testing datasets based on stroke history

In [None]:
set.seed(1)

stroke_split <- initial_split(stroke_data_final, prop = 0.75, strata = stroke)
stroke_train <- training(stroke_split)
stroke_test <- testing(stroke_split) 

### Summary

The tables below show the min and the max of each continous feature when a person has or has not a stroke

In [None]:
continous_factors <- select(stroke_train, bmi,avg_glucose_level,age,stroke)

factors_with_stroke <- filter(continous_factors, stroke == 1)
factors_without_stroke <- filter(continous_factors, stroke == 0)


min_max <- list(
  min = ~min(.x, na.rm = TRUE), 
  max = ~max(.x, na.rm = TRUE)
)

avg <- list(
  avg = ~mean(.x, na.rm = TRUE)
)



factors_with_stroke |> summarise(across(where(is.numeric), min_max))
factors_without_stroke |> summarise(across(where(is.numeric), min_max))




The tables below show the average of each continous feature when people have or have not a stroke

In [None]:
factors_with_stroke |> summarise(across(where(is.numeric), avg))
factors_without_stroke |> summarise(across(where(is.numeric), avg))

The table below shows how many cells with missing data are present in each column. This is imoprtant as we need to exclude any examples which may have incomplete data

In [None]:
stroke_bmi <- sum(is.na(factors_with_stroke$bmi))
stroke_avg_glucos <- sum(is.na(factors_with_stroke$avg_glucose_level))
stroke_age <- sum(is.na(factors_with_stroke$age))
no_stroke_bmi <- sum(is.na(factors_without_stroke$bmi))
no_stroke_averge_glucose <- sum(is.na(factors_without_stroke$avg_glucose_level))
no_stroke_age <- sum(is.na(factors_without_stroke$age))

missing_data <- tibble("With stroke: Bmi" = stroke_bmi, "With stroke: Avg Glucose" = stroke_avg_glucos, "With stroke: Age" = stroke_bmi, 
            "Without stroke: Bmi" = no_stroke_bmi, "Without stroke: Avg Glucose" = no_stroke_averge_glucose, "Without stroke: Age" = no_stroke_age)
missing_data

### Data Visualization

In [None]:
options(repr.plot.width = 20, repr.plot.height = 20)

age_vs_glucose <- ggplot (stroke_train, aes(x=age, y=avg_glucose_level, color= stroke))+
                geom_point(aes(alpha = stroke))+
                scale_alpha_manual(values = c(0.1, 1))+
                scale_color_manual(values = c("red", "blue"))+
                labs(x="Age", y="Average Blood Glucose Level (mg/dL)", color = "Stroke History")+
                theme(text = element_text(size = 15))
options(repr.plot.width = 15, repr.plot.length = 20)

age_vs_bmi <- ggplot (stroke_train, aes(x=age, y=bmi, color= stroke))+
                geom_point(aes(alpha = stroke))+
                scale_alpha_manual(values = c(0.1, 1))+
                scale_color_manual(values = c("red", "blue"))+
                labs(x="Age", y="body Mass Index (kg/m^2)", color = "Stroke History")+
                theme(text = element_text(size = 15))
options(repr.plot.width = 15, repr.plot.length = 20)

avg_glucose_vs_bmi <- ggplot (stroke_train, aes(x=avg_glucose_level, y=bmi, color= stroke))+
                geom_point(aes(alpha = stroke))+
                scale_alpha_manual(values = c(0.1, 1))+
                scale_color_manual(values = c("red", "blue"))+
                labs(x="Average Blood Glucose Level (mg/dL)", y="body Mass Index (kg/m^2)", color = "Stroke History")+
                theme(text = element_text(size = 15))
options(repr.plot.width = 15, repr.plot.length = 20)

plot_grid(age_vs_glucose, age_vs_bmi, avg_glucose_vs_bmi, ncol = 1)

## Methods

We use classification since we are more interested in predicting potential strokes rather than finding dependency relationship between stroke and other variables (regression). While the latter helps with finding out causes of stroke (and how strongly they contribute to stroke), classification provides useful warning as it allows the patient to be more aware of their lifestyle and doctors more proactive in diagnosis and treatment.

To build our model, we split our initial data so that 75% will be used in the training data and the rest for the test data. We also need to set a seed so that the code is reproducible. We will then build the model using the optimal K value found from applying cross-validation on the training data.

Three columns from the dataset will be used in this study: age, BMI, and average blood glucose level. These three variables are chosen not only because they are three important factors that contribute to an individual’s risk of developing stroke, but their continuous property makes it easier to establish the correlation with having a stroke. To visualize the relationship between the three factors and with stroke, we will produce three scatter plots: age vs. BMI, age vs. average blood glucose level, and finally BMI vs. average blood glucose level. All three plots will have datapoints color labeled to show the history of stroke (have had stroke or not).

## Expected outcomes and significance

We expect the model to find more potential stroke cases among elderly people compared to young people. We expect BMI and age to also contribute to risks of stroke, albeit not as strong, but this is a regression problem that is not within our scope.

By developing this stroke predictor, we hope to use a set of conditions that could potentially contribute to the development of strokes to predict whether an individual will develop their first stroke in the future. We will also examine the accuracy of this predictor for any future improvements. In a real-life setting, this predictor can be used for stroke prevention among high-risk patients and to promote lifestyle changes to reduce the risk of stroke. As this predictor only predicts the first stroke of an individual, while one in four stroke patients in the US (CDC, 2022) experiences a second stroke sometime in their life, future work can be done to predict the chance of having multiple strokes based on a more extensive range of health conditions and it could raise new inquiries regarding additional risk factors that could improve the model.

**References:** 
https://www.cdc.gov/stroke/facts.htm#:~:text=Stroke%20statistics,-In%202020%2C%201&text=Every%20year%2C%20more%20than%20795%2C000,are%20first%20or%20new%20strokes.&text=About%20185%2C000%20strokes%E2%80%94nearly%201,have%20had%20a%20previous%20stroke.&text=About%2087%25%20of%20all%20strokes,to%20the%20brain%20is%20blocked.
