# Classification Analysis of Patient Parameters on Previous Stroke Predictions Based on KNN Algorithm

In [None]:
library(tidyverse)
library(repr)
library(dplyr)
library(tidymodels)
install.packages("generics")
install.packages("themis")
require(themis)
library(ggplot2)

set.seed(1768)
options(repr.matrix.max.rows = 6)
options(repr.plot.width = 8, repr.plot.width= 10)

## Introduction

According to the World Health Organization (WHO), strokes are the second leading cause of death globally responsible for approximately 11% of total deaths. A stroke occurs when a blood vessel carrying oxygen and nutrients to the brain is either blocked by clots or bursts due to ruptures. This results in a lack of blood and oxygen flow to the brain and causes brain cells to die.

Given a patient data collection of parameters and measurements, can we predict whether a patient previously had a stroke? At large, the factors of interest are categorized as reported modifiable risk factors, unlike sex, age and ethnicity which are classified as unmodifiable, and are ultimately hereditary. Modifiable risk factors are usually environmental influences that might raise the risk for stroke in certain people. These modifiable risk factors will make up our predictors of interest.

By basing our predictive analytical model from a classification model, we can categorize information based on historical data in the case of stroke instances and factors. The purpose of this research analysis is to model a classifier that can be easily retrained with new data and provide a broad analysis for future prediction models. The application of the classification predictive model will aim to determine the best subsets of stroke attribute parameters in order to categorize previous stroke instances and their occurrences (stroke or no stroke). Using KNN (K-nearest neighbor) algorithm, it will map predictions of previous stroke instances based on our given attribute parameters. Furth more, the accuracy of the model predicative classifier model will be confirmed using a confusion matrix. 

The objective of classifying a predictive model using patient parameters to categorize previous stroke instances provides the purpose of identifying and confirming valuable measurements in documenting stroke instances. The investigation of potential parameters can be used to conduct further research in identifying stronger relationships in potentially predicting future stroke instances or providing improved medical attention.

The dataset obtained is used to predict whether a patient is likely to have had a stroke based on input parameters such as gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient. In total, there are 5110 observations with respect to 12 attribute parameters.

Dataset Attribute Information
- 1) id: unique identifier
- 2) gender: "Male", "Female" or "Other"
- 3) age: age of the patient
- 4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- 5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- 6) ever_married: "No" or "Yes"
- 7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- 8) Residence_type: "Rural" or "Urban"
- 9) avg_glucose_level: average glucose level in blood
- 10) bmi: body mass index
- 11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- 12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## Methods and Results

We will be conducting data analysis via classification on stroke using predictors of interest. We will model our analysis using predictors based on exploratory data anlysis and research articles to support our choice of predictors. Furthermore, we will be using forward selection (Eforymson 1966; Draper and Smith 1966) in order to obtain insight on finding the best subset of predictors. 

Accuracy will be evaluated via data splits of training and testing sets and resultingly evaluating the accuracy of the classifier. This will be done with randomness via seeds to model reproducibility along with analysis via cross-validation, K-nearest neighbor algorithm parameter selections, retraining, and model evaluation. The model will be graphed using a scatter plot to measure the accuracy of the predictions and classifier along with a confusion matrix.

Furthurmore, the class imbalances seen in the stroke dataset may cause poor performance in K-nearest neighbours classification. The dataset contains more cases of no strokes than cases of strokes and will cause problems in the K-nearest neighbor classification algorithm due to imbalances. Despite the knowledge required to correct for these imalances, we will still attempt to correct these imbalances by upsampling our classes for classification. This will lead to a fundamental issue that leads to overfitting of the model as it increase the number of cases in each class to match the most prevalent class which would be the case of no strokes. For this reason, cross-validation will not work as desired with upsampling and should be noted for future experimental research and analysis.

Hypertension is the most important modifiable risk factor for stroke, with a strong and continuous relationship between blood pressure and stroke risk (Hägg-Holmberg et al., 2019). Even among those who are not necessarily hypertensive, the higher the blood pressure, the higher the risk of stroke. Blood pressure (regardless of hypertension), rises with increasing age, thereby increasing the lifetime risk of developing this condition. Hence, hypertension (modifiable) is heavily correlated with age (unmodifiable).

Body weight and obesity are risk factors for stroke. Obesity is related to stroke risk factors such as hypertension and diabetes. Recent data found that 76% of the effect of BMI on stroke risk was mediated by blood pressure, cholesterol, and glucose levels (Boehme et al., 2017). Blood pressure solely accounted for 65% of the risk because of weight.

Stroke risk was nearly doubled in patients with impaired glucose tolerance (range from 140.4-198 mg/dL) compared with those with normal glucose levels (80 mg/dL- 126mg/dL), and nearly tripled in diabetic patients (glucose≥ 198 mg/dL) (Boehme et al., 2017). Patients with low glucose levels (<80mg/dL) had a 50% increased stroke risk compared with those with normal glucose levels.

Smoking remains a major risk factor for stroke, nearly doubling the risk and contributing to 15% of all stroke deaths per year (Boehme et al., 2017). Smoking cessation rapidly reduces the risk of stroke, with added risk nearly disappearing 2 to 4 years after smoking cessation.

From the supplementary research articles, we will be utaliizing the modifiable risk factors as predictors for determining whether a patient preiviously had a stroke instance.


In [None]:
# Load stroke dataset
raw_data_url <- "https://raw.githubusercontent.com/jordanjzhao/dsci-project-proposal/main/data/healthcare-dataset-stroke-data.csv"

# Foward Selection

# Looking at predictors -> age, gender, hypertension, glucose levels, bmi
stroke_data_fs <- read.csv(raw_data_url, na.strings = "N/A") %>%
    mutate(stroke = as_factor(stroke))

stroke_data_clean_fs <- na.omit(stroke_data_fs)

# Select data -> Foward Selection
stroke_data_select_fs <- stroke_data_clean_fs %>%
    select(stroke, bmi, avg_glucose_level, hypertension, age, heart_disease)

names <- colnames(stroke_data_select_fs %>% select(-stroke))

# Create training and testing set -> Foward Selection
stroke_split_fs <- initial_split(stroke_data_select_fs, prop = 0.75, strata = stroke)
stroke_train_fs <- training(stroke_split_fs)
stroke_test_fs <- testing(stroke_split_fs)

# Foward Selection in R
# Formula
fs_formula <- paste("stroke", "~", paste(names, collapse="+"))

# Code
fs_accuracies <- tibble(size = integer(),
                        model_string = character(),
                       accuracy = numeric())

knn_spec_fs <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    set_engine("kknn") %>%
    set_mode("classification")

fs_vfold <- vfold_cv(stroke_data_select_fs, v = 5, strate = stroke) 

n_total <- length(names) 

selected <- c()

for (i in 1:n_total) {
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        preds_new <- c(selected, names[[j]])
        model_string <- paste("stroke", "~", paste(preds_new, collapse="+"))
        
        fs_recipe <- recipe(as.formula(model_string), data = stroke_data_select_fs) %>%
                            step_scale(all_predictors()) %>%
                            step_center(all_predictors()) %>%
                            themis::step_upsample(stroke, over_ratio = 1, skip = FALSE) %>%
                            prep() %>%
                            bake(stroke_train_fs)

        fs_dummy_recipe <- recipe(stroke ~., data = fs_recipe)
        
        acc <- workflow() %>%
            add_recipe(fs_dummy_recipe) %>%
            add_model(knn_spec_fs) %>%
            tune_grid(resamples = fs_vfold, grid = 10) %>%
            collect_metrics() %>%
            filter(.metric == "accuracy") %>%
            summarize(mx = max(mean))
        acc <- acc$mx %>%
            unlist()
        
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    fs_accuracies <- fs_accuracies %>%
        add_row(size = i,
                model_string = models[[jstar]],
                accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}

Table 1. Stroke data set Preview

In [None]:
head(stroke_data_clean_fs)

Table 2. Forward Selection Procedure on Stroke Prediction Dataset

In [None]:
fs_accuracies

In [None]:
raw_data_url <- "https://raw.githubusercontent.com/jordanjzhao/dsci-project-proposal/main/data/healthcare-dataset-stroke-data.csv"
stroke_data <- read.csv(raw_data_url, na.strings = "N/A") %>%
    mutate(stroke = as_factor(stroke))

# Reload and wrangle data for Cross Validation
stroke_data_clean <- na.omit(stroke_data)

# Select data
stroke_data_select <- stroke_data_clean %>%
    select(stroke, bmi, avg_glucose_level)

# Create training and testing set
stroke_split <- initial_split(stroke_data_select, prop = 0.75, strata = stroke)
stroke_train <- training(stroke_split)
stroke_test <- testing(stroke_split)

# Cross Validation to choose best k set-up
stroke_recipe <- recipe(stroke ~ ., data = stroke_train) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors()) %>%
    themis::step_upsample(stroke, over_ratio = 1, skip = FALSE) %>%
    prep() %>%
    bake(stroke_train)

# Plot of training set dataset sample (scaled/balanced)
sb_bmi_avg_gluc_plot <- stroke_recipe %>%
    ggplot(aes(x = bmi, y = avg_glucose_level, color = stroke)) +
    geom_point(alpha = 0.6) +
    ggtitle("Body Mass Index Vs. Average Glucose Levels on Previous Stroke") +
    labs(x = "Body Mass Index (kg/m^2)", y = "Average Glucose Level (mg/dL)", color = "Previous Stroke Indication") +
    scale_color_manual(labels = c(0, 1),
                       values = c("steelblue2", "orange2")) +
    theme(text = element_text(size = 12))

# Balanced predictors confirmation
upsampled_stroke <- stroke_recipe %>%
    group_by(stroke) %>%
    summarize(n = n())

# Dummy recipe data workaround for cross validation
dummy_recipe <- recipe(stroke ~., data = stroke_recipe)

# 5-fold cross-valiation on training set
stroke_vfold <- vfold_cv(stroke_train, v = 5, strata = stroke)

# KNN classifier
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
    set_engine("kknn") %>%
    set_mode("classification")

# kvals to process
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

# Create workflow analysis with recipe and model specs 
stroke_knn_results <- workflow() %>%
    add_recipe(dummy_recipe) %>%
    add_model(knn_tune) %>%
    tune_grid(resamples = stroke_vfold, grid = k_vals) %>%
    collect_metrics()

# filter metrics for accuracy estimate
stroke_knn_accuracies <- stroke_knn_results %>%
    filter(.metric == "accuracy")

# plot k vs accuracy estimate for k selection
cross_val_plot <- stroke_knn_accuracies %>%
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    theme(text = element_text(size = 12))

# Rebuild model with best chosen K

# Using best chosen k = 11
best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 11) %>%
    set_engine("kknn") %>%
    set_mode("classification")

best_fit <- workflow() %>%
    add_recipe(dummy_recipe) %>%
    add_model(best_spec) %>%
    fit(data = stroke_train) 

# Use Final Model to Predict on Test Dataset
stroke_predictions <- predict(best_fit, stroke_test) %>%
    bind_cols(stroke_test)

stroke_metrics <- stroke_predictions %>%
    metrics(truth = stroke, estimate = .pred_class)

# Create Confusion Matrix to check 
stroke_conf_mat <- stroke_predictions %>%
    conf_mat(truth = stroke, estimate = .pred_class)

Table 3. Wrangled Data for Cross Validation

In [None]:
head(stroke_data_clean)

Table 4. Training and Testing Sets

In [None]:
glimpse(stroke_train)

In [None]:
glimpse(stroke_test)

Table 5. Cross Validation Recipe

In [None]:
stroke_recipe

In [None]:
dummy_recipe

In [None]:
sb_bmi_avg_gluc_plot

Figure 1. Scaled and balanced plot of Body Mass Index Vs. Average Glucose Levels sample set to predict previous strokes

In [None]:
cross_val_plot

Figure 2. Cross Validation K Vs Accuracy Estimate for best K (Choosing K = 11)

In [None]:
best_spec

In [None]:
best_fit

Table 6. Final Model Prediction on Stroke Testing Dataset Accuracy Score

In [None]:
stroke_metrics

Table 7. Confusion Matrix Classifier Predictions Vs Truth Label

In [None]:
stroke_conf_mat

In [None]:
TClass <- factor(c("No Stroke", "No Stroke", "Stroke", "Stroke"))
PClass <- factor(c("No Stroke", "Stroke", "No Stroke", "Stroke"))
Y      <- c(1171, 0, 56, 0)
df <- data.frame(TClass, PClass, Y)

library(ggplot2)
ggplot(data =  df, mapping = aes(x = TClass, y = PClass)) +
    ggtitle("Truth vs. Prediction Model of Confusion Matrix for KNN Classifier") +
    labs(x = "Truth", y = "Prediction") +
    geom_tile(aes(fill = Y), colour = "white") +
    geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
    scale_fill_gradient(low = "blue", high = "red") +
    theme_bw() + theme(legend.position = "none") + 
    theme(text = element_text(size = 12))

Figure 3. Confusion Matrix Model for KNN Classifer using mosaic plot

## Discussion

From the methods and resuts analyzed, the prediction of patients previous strokes using K-nearest neighbors classification was not a strong predictor due to various issues.

Table 2. illustrates the potential selection of a good subset of predictors using the method of forward selection. We see the highest accuracy estimate with predictors bmi and heart disease at a score 0.9578318. The lowest accuracy estimate with predictors bmi, heart disease, hypertension, age, and average glucose level appears at a score of 0.9570171. This model predictor provides only an estimate of the true accuracy, however, the dataset does not indicate the types of heart diseases as such, the use of average glucose levels has been integrated instead.

From Table 6., we can see that using a K-nearest neighbour classifcation tuned with the best K chosen at k = 11, the accuracy score was estimated to be at a value of 0.9453953. This is a very high estimation accuracy for the classifier, however, due to the upsampling used along with cross validation, the classification is more likely to over fit the points in classification to the majority.
         
From Table 7., the confusion matrix shows 1160 observations correctly predicted as cases of no strokes and no correct predictions of stroke cases. However the classifier did make mistakes, it predicted 67 cases of no strokes when they were truly stroke cases. In the cases of predicting strokes, it is not beneficial to have the classifier predict false negatives, as this can result in patients not receiving the appropriate medical attention required. Due to the majority seen in cases of no strokes, it must be the case that the accuracy must be further improved for better predictions of previous strokes.

The analysis models very similar to our predictions previously made such that we were certain there would be a positive relationship between the attribute parameters as predictors for previous stroke. This was seen in the case of the forward selection model. However, as mentioned previously there was a lack of certainty in classifying previous strokes due to the heavy upsampling required.

The impacts of these findings can further provide added certainty to the use of these attribute paramters of bmi, glucose levels, and hypertension to predict stroke. However, further research must be done to cross check accuracy estimates with respect to proper balancing of the stroke dataset.

In the cases proper balanced is implemented, there can be further predictions using further parameters to solidfy the prediction of previous strokes on patients. As a result, future research could be dedicated to determining if predictions of future stroke instances and concerns to alert patients for medical attention could be possible.

## Works Cited

Boehme, Amelia K., et al. “Stroke Risk Factors, Genetics, and Prevention.” Circulation Research, 3 Feb. 2017, https://www.ahajournals.org/doi/full/10.1161/CIRCRESAHA.116.308398. 

Draper, Norman, and Harry Smith. Applied Regression Analysis. Wiley. 1966.

Eforymson, M. “Stepwise Regression—a Backward and Forward Look.” In Eastern Regional Meetings of the Institute of Mathematical Statistics. 1966.

Hägg-Holmberg, Stefanie, et al. “The Role of Blood Pressure in Risk of Ischemic and Hemorrhagic Stroke in Type 1 Diabetes - Cardiovascular Diabetology.” BioMed Central, BioMed Central, 9 July 2019, https://cardiab.biomedcentral.com/articles/10.1186/s12933-019-0891-4. 

Timbers, Tiffany-Anne, et al. Data Science: An Introduction. CRC Press, 2022. 

Dataset source: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Author credentials: fedesoriano