<a href="https://colab.research.google.com/github/matthewdheilmanvanderbilt/CS5262_MachineLearning/blob/main/stoke_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background
This project will be used to fulfill the first set of assignments for the Vanderbilt CS5265-50 (Foundations of Machine Learning) class.  The project consists of delivering a machine learning application.  The building of the project will be done over the course of seven (7) weeks. 

The purpose of the project is to:
1.   Setup the environment for the machine learning assignment
1.   Learn how to setup and use Google Colab
1.   Apply literate coding techniques in solving an analytical problem
1.   Learn to use pandas to solve a data exploration problem

The author of this project is Matt Heilman.  The professor is Dana Zheng.  The professor has asked the author to select a dataset will meet the following guidelines:
*   Tabular/structured/2D (i.e., rows and columns)
*   $\geq$ 5000 samples/records/rows, but no more than 50k rows
*   $\geq$ 7 features/variables/columns, but no more than 25 columns
*   Independence between data points (rows in your data)
*   Identify a dataset that you can use for a binary classification task directly or after transformation
*   Diverse data types for your columns:
*   Continuous variables (e.g., measurements such as weight, temperature, etc.)
*   Count variables (e.g., age, number of people in a classroom, etc.)
*   Categorical variables (e.g., gender, type of Operating Systems, degrees of severity, etc.)

## Introduction
Stoke is the leading cause of death in the United States (1).  It is the second largest cause of death worldwide (2).  Strokes are a major cause of serious disability for adults (1).  Every year, more than 795,000 people have strokes int he US (3).  Stokes related costs in the US were nearly \\$56.6 billion between 2018 and 2019 (3).   The US has the highest average cost per patient for stroke treatment of \\$59,900 (5).

The CDC estimates 80% of stokes are preventable (4).  High blood pressure is the single most important treatable factor (4).  Other factors include life-style changes and medicine (6). 

## Literature Review


1.   The CDC website was consulted for high level statistics on strokes.
2.   Publicly available journal articles from NIH were also used. 

All references/citations are listed in the References section (below).


## Research Gap
There are no gaps in the research of strokes, their causes, their prevention, and their costs.  There is a significant amount of peer reviewed journal articles available.   

## Challenges
The author of this project has no machine learning education or experience.  This project, and the author's knowledge of machine learning, will evolve over the next few weeks.  At this point, the author is too ignorant to know what research gaps exist and this section will be updated before the final deliverable.

# Project Description

## Project Topic (Problem Formulation)
This project will attempt to predict if a person has a risk of a stroke.  Strokes are preventable and the monetary cost to prevent a stroke pail in comparison to the treatment cost after a person suffers a stroke.  If all strokes could be eliminated in the US, there is a potential savings of $56 billion per year.

## Methodology

The following methodology will use used:
1.   Problem Definition - Define the problem to be solved.
1.   Data Exploration - Exploratory Data Analysis and an investigation of what data is available and when it's available.
1.   Feature Engineering - Take raw data and make alterations (combining, finding interactions, recasting it, etc.)
1.   Modeling - Create the actual model.
1.   Assessment - Determine if the model is worth the investment/cost.


The following data science process will be used:
1.   Collection
1.   Cleaning
1.   Exploratory Data Analysis
1.   Model Building
1.   Deployment

I have not seen the future assignments and I'm not sure if we will be working all the way through #5 Deployment. 

## Data Source

The dataset used comes from the Stroke Prediction Dataset at kaggle (https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset).  It contains 5,111 rows and 12 features.  The source of the data has been listed as "confidential".  

While the source of the data cannot be confirmed and would not be good for a real-life model, it will be very suitable for this class.  The dataset has a diverse set of data including: continuous variables, count variables, categorical variables, and binary variables.  There is some missing data that will allow me to practice imputing values. 

## Anticipated Outcome(s)
The author anticipates the model will reflect the research that as done by the CDC, specifically that high blood pressure will be the leading indicator of stroke risk.  

The author anticipates he will be able to develop a model that allows him to determine which risks have higher weightings for stoke risk.

# Performance Metrics
I've attempted to read ahead (Week 4) to determine if I could determine performance metrics that I could use.  The lectures were a little too advanced for me (area under the curve, model scores, F0, RMSC, confusion matrix, regression metrics, etc.).

I found a page from neptune.ai that talked about metrics.  At this point it looks like I will be calculating True Positives, True Negatives, False Positvies, and False Negatives. I. can use this information to form a confusion matrix, the precision, recall, and F1-score.  

There appears to be some way of calculating the value of the model.  Professor Spencer-Smith lectured that value is the best metric to use (not F0, AUC, RMSC).  For this exercise a positive ROI would be a good goal for this initial assignment.  

The value can be looked at from various stakeholders:


*   Payers: Payers and insurance companies will want to significantly reduce the number of strokes.  If they can identify members (patients) with stroke risks, they can educate and incentivize the member to reduce their stroke.
*   People (patients/members): Strokes can be fatal or debilitating.  People find value in staying healthy.  
*   Providers: Providers make money from strokes.  They may be less likely to find value in preventing strokes as it will affect their revenue in a negative way.



# Load Libraries
Load the standard libraries for loading data and performing the exploratory data analysis

In [24]:
#tables and visualization
import pandas as pd

# Load Data
Load the data using pandas and read it as a pandas dataframe

In [28]:
responses = pd.read_csv('https://raw.githubusercontent.com/matthewdheilmanvanderbilt/CS5262_MachineLearning/main/healthcare-dataset-stroke-data.csv?token=GHSAT0AAAAAACCTI2NUPBJFTL6W6JFFO5ASZDFRPGA')
display(responses.head())
responses.info()
responses.describe()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


# Data cleaning and EDA
Performing some simple exploratory data analysis.

In [13]:
responses.isna().sum()



HeartDiseaseorAttack    0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
Diabetes                0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [14]:
responses.AnyHealthcare.unique()

array([1., 0.])

In [19]:
bmi = responses.BMI.unique()
print(sorted(bmi))

[12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0, 71.0, 72.0, 73.0, 74.0, 75.0, 76.0, 77.0, 78.0, 79.0, 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0, 90.0, 91.0, 92.0, 95.0, 96.0, 98.0]


# References


1.   https://www.cdc.gov/stroke/index.htm
1.   https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
1.   https://www.cdc.gov/stroke/facts.htm
1.   https://www.cdc.gov/vitalsigns/stroke/
1.   Strilciuc S, Grad DA, Radu C, Chira D, Stan A, Ungureanu M, Gheorghe A, Muresanu FD. The economic burden of stroke: a systematic review of cost of illness studies. J Med Life. 2021 Sep-Oct;14(5):606-619. doi: 10.25122/jml-2021-0361. PMID: 35027963; PMCID: PMC8742896.
1.   https://www.cdc.gov/stroke/prevention.htm


