# DSCI 100 Project

## Introduction


***Background***

According to the World Health Organization (WHO), stroke is the second leading cause of death in the world; this makes up approximately 11% of total deaths. Our project explores a dataset involving key characteristics of patients who have and have not suffered from a stroke. We will use data analysis tecniques such as regression and clustering to investigate this information and provide valuable insights that will guide future reasearch into this issue.

***Predictive Question***

Our analysis is focused around a key question: Which factors are the strongest predictors of a stroke?


***Data***

To clarify the content of our dataset, each row/observation represents a patient and each column represents a health-related characteristic that is hypothesized to be a stroke predictor; both categorical and numerical data are included in this dataset. 

The variables represented in each column are as follows: 
- **id** (patient id)
- **gender** (male/female)
- **age** 
- **hypertension** (1/0) → Interpret as, "Have they had hypertension? 1 is Yes, 0 is No."
- **heart_disease** (1/0) → Interpret as, "Have they had heart disease? 1 is Yes, 0 is No."
- **ever_married** (Yes/No) → Interpret as, "have they been married?"
- **work_type** (Private/Self-employed)
- **Residence_type** (Urban/Rural)
- **avg_glucose_level**
- **bmi** → bmi stands for "Body Mass Index" and is a ratio of body mass to the square of a person's height.
- **smoking_status** (formerly smoked/never smoked)
- **stroke** (1/0) → Interpret as, "Have they had a stroke? 1 is Yes, 0 is No."

The dataset we are using is downloaded from Kaggle:
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

## Methods and results


***Importing Packages***

In [None]:
install.packages("plotly")

library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(GGally)
library(broom)
library(RColorBrewer)
library(plotly)

***Loading Data***

In [None]:
stroke <- read_csv('data/stroke_data.csv')

stroke |>
    head(10)

***Wrangling***


In [None]:
# subtracting 1 from the as.numeric since factor index starts at 1,
# but we want to start at 0 for consistency

stroke_numeric <- stroke |>
    select(-id) |>
    mutate(ever_married = as.numeric(as_factor(ever_married)) - 1) |>
    mutate(residence_type = as.numeric(as_factor(Residence_type)) - 1) |>
    select(-Residence_type) |>
    mutate(work_type = as.numeric(as_factor(work_type)) - 1) |>
    mutate(gender = as.numeric(as_factor(gender)) - 1) |>
    mutate(smoking_status = as.numeric(as_factor(smoking_status)) - 1) |>
    filter(bmi != "N/A") |>
    mutate(bmi = as.numeric(bmi)) |>
    mutate(stroke = as_factor(stroke)) |>
    relocate(residence_type, .after = work_type)

stroke_numeric |>
    head(10)

In [None]:
temp_stroke <- stroke_numeric |>
    select(stroke)

stroke_scaled <- stroke_numeric |>
    select(-stroke) |>
    mutate(across(everything(), scale)) |>
    bind_cols(temp_stroke)

stroke_scaled |>
    head(10)

***Summary***



In [None]:
set.seed(2020)

stroke_ks <- tibble(k = 1:25)

elbow_stats <- stroke_ks |>
    rowwise() |>
    mutate(tourism_clusts = list(kmeans(stroke_scaled, k, nstart=25))) |>
    mutate(glanced = list(glance(tourism_clusts))) |>
    select(-tourism_clusts) |>
    unnest(glanced)

elbow_stats

***Visualization***


In [None]:
set.seed(2020)
options(repr.plot.height = 8, repr.plot.width = 8)

stroke_elbow_plot <- elbow_stats |>
    ggplot(aes(x = k, y = tot.withinss)) +
        geom_point() +
        geom_line() +
        labs(x = "K value", y = "Total Within-Cluster Sum of Squares") +
        scale_x_continuous(breaks = seq(0, 25, 1))

stroke_elbow_plot

In [None]:
set.seed(2021)
final_stroke_clusters <- kmeans(stroke_scaled |> select(-stroke), 7, nstart=4)
final_stroke_augment <- augment(final_stroke_clusters, stroke_scaled)
final_stroke_clusters$center

In [None]:
final_stroke_plot_age_v_bmi <- final_stroke_augment |>
    ggplot(aes(x = age, y = bmi, color = .cluster)) +
        geom_point() +
        labs(x = "Age", y = "BMI", title = "Age vs BMI", color = "Cluster")

final_stroke_plot_age_v_bmi

In [None]:
final_stroke_plot_age_v_gluc <- final_stroke_augment |>
    ggplot(aes(x = age, y = avg_glucose_level, color = .cluster)) +
        geom_point() +
        labs(x = "Age", y = "Average Glucose Level", title = "Age vs Average Glucose Level", color = "Cluster")

final_stroke_plot_age_v_gluc

In [None]:
options(repr.plot.height = 8, repr.plot.width = 15)
final_stroke_augment |>
    select(-stroke) |>
    pivot_longer(cols = -.cluster, names_to = 'category', values_to = 'value')  |> 
    ggplot(aes(value, fill = .cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        facet_wrap(facets = vars(category), scales = 'free') +
        labs(color = "Cluster") +
        theme_minimal() +
        theme(text = element_text(size = 20))

***Analysis***

With regards to the density plot, it appears as though that the variables driving the clustering are 

## Discussion

***Findings***


***Significance***

## References


-citations for readings and data used to complete the project-