# DSCI 100 Project

## Introduction


***Background***

According to the World Health Organization (WHO), stroke is the second leading cause of death in the world making up approximately 11% of total deaths. Our project explores a dataset involving key characteristics of patients who have and have not suffered from a stroke. We will use data analysis tecniques such as regression and clustering to investigate this information and provide valuable insights that may guide future reasearch into this issue.

***Predictive Question***

Our analysis is focused around a key question: Which factors are the strongest predictors of a stroke?


***Data***

To clarify the content of our dataset, each row/observation represents a patient and each column represents a health-related characteristic that is hypothesized to be a stroke predictor; both categorical and numerical data are included in this dataset. 

The variables represented in each column are as follows: 
- **id** (patient id)
- **gender** (male/female)
- **age** 
- **hypertension** (1/0) → Interpret as, "Have they had hypertension? 1 is Yes, 0 is No."
- **heart_disease** (1/0) → Interpret as, "Have they had heart disease? 1 is Yes, 0 is No."
- **ever_married** (Yes/No) → Interpret as, "have they been married?"
- **work_type** (Private/Self-employed)
- **Residence_type** (Urban/Rural)
- **avg_glucose_level**
- **bmi** → bmi stands for "Body Mass Index" and is a ratio of body mass to the square of a person's height.
- **smoking_status** (formerly smoked/never smoked)
- **stroke** (1/0) → Interpret as, "Have they had a stroke? 1 is Yes, 0 is No."

## Methods and results


***Importing Packages***

In [1]:
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(GGally)
library(broom)
library(RColorBrewer)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



***Loading Data***

***Wrangling***


In [18]:
# subtracting 1 from the as.numeric since factor index starts at 1,
# but we want to start at 0 for consistency

stroke_numeric <- stroke |>
    select(-id) |>
    mutate(stroke = as_factor(stroke)) |>
    mutate(ever_married = as.numeric(as_factor(ever_married)) - 1) |>
    mutate(Residence_type = as.numeric(as_factor(Residence_type)) - 1) |>
    mutate(work_type = as.numeric(as_factor(work_type)) - 1) |>
    mutate(gender = as.numeric(as_factor(gender)) - 1) |>
    mutate(smoking_status = as.numeric(as_factor(smoking_status)) - 1) |>
    filter(bmi != "N/A") |>
    mutate(bmi = as.numeric(bmi))

stroke_numeric |>
    head(10)

gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
0,67,0,1,0,0,0,228.69,36.6,0,1
0,80,0,1,0,0,1,105.92,32.5,1,1
1,49,0,0,0,0,0,171.23,34.4,2,1
1,79,1,0,0,1,1,174.12,24.0,1,1
0,81,0,0,0,0,0,186.21,29.0,0,1
0,74,1,1,0,0,1,70.09,27.4,1,1
1,69,0,0,1,0,0,94.39,22.8,1,1
1,78,0,0,0,0,0,58.57,24.2,3,1
1,81,1,0,0,0,1,80.43,29.7,1,1
1,61,0,1,0,2,1,120.46,36.8,2,1


In [39]:
temp_stroke <- stroke_numeric |>
    select(stroke)

stroke_scaled <- stroke_numeric |>
    select(-stroke) |>
    mutate(across(everything(), scale)) |>
    bind_cols(temp_stroke)

stroke_scaled |>
    head(10)

gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",<fct>
-1.1998205,1.070029,-0.3180343,4.3815219,-0.72941,-0.7495475,-0.9855395,2.77741545,0.98124492,-1.4528566,1
-1.1998205,1.6463949,-0.3180343,4.3815219,-0.72941,-0.7495475,1.014466,0.01384039,0.45922236,-0.5355482,1
0.8318866,0.2719838,-0.3180343,-0.2281847,-0.72941,-0.7495475,-0.9855395,1.48398039,0.70113526,0.3817603,1
0.8318866,1.6020591,3.1436741,-0.2281847,-0.72941,0.1400871,1.014466,1.54903481,-0.62301952,-0.5355482,1
-1.1998205,1.6907307,-0.3180343,-0.2281847,-0.72941,-0.7495475,-0.9855395,1.82118292,0.01359335,-1.4528566,1
-1.1998205,1.3803799,3.1436741,4.3815219,-0.72941,-0.7495475,1.014466,-0.79269943,-0.19012277,-0.5355482,1
0.8318866,1.1587006,-0.3180343,-0.2281847,1.370692,-0.7495475,-0.9855395,-0.24570201,-0.77580661,-0.5355482,1
0.8318866,1.5577232,-0.3180343,-0.2281847,-0.72941,-0.7495475,-0.9855395,-1.05201673,-0.59755501,1.2990687,1
0.8318866,1.6907307,3.1436741,-0.2281847,-0.72941,-0.7495475,1.014466,-0.55994415,0.10271915,-0.5355482,1
0.8318866,0.8040139,-0.3180343,4.3815219,-0.72941,1.0297216,1.014466,0.34113844,1.00670944,0.3817603,1


***Summary***



***Visualization***


In [37]:
options(repr.plot.height = 8, repr.plot.width = 15)

***Analysis***

## Discussion

***Findings***


***Significance***

## References


-citations for readings and data used to complete the project-