# KNN Classification Analysis of Stroke Prediction 

In [6]:
library(tidyverse)
library(repr)
library(dplyr)
library(tidymodels)
install.packages("themis")
library(themis)

set.seed(2023)
options(repr.matrix.max.rows = 8) 

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



## Introduction

"insert introduction"

## Preliminary Data Analysis

Classification will be performed using the K-nearest-neighbor (KNN) algorithm. The target variable is the “stroke” column. The predictor variables will be determined using summary statistics and visualizations.  

First, the data set was uploaded onto Google Drive and read into R using the URL link.


In [7]:
brainstroke_data <- read_csv("https://drive.google.com/uc?export=download&id=1yBiO_qBE9_YBvnEyPe2bazH5ZCOBb1d6")

brainstroke_data

[1mRows: [22m[34m4981[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): gender, ever_married, work_type, Residence_type, smoking_status
[32mdbl[39m (6): age, hypertension, heart_disease, avg_glucose_level, bmi, stroke

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>
Male,67,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
Male,80,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
Female,49,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
Female,79,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Male,40,0,0,Yes,Private,Urban,191.15,31.1,smokes,0
Female,45,1,0,Yes,Govt_job,Rural,95.02,31.8,smokes,0
Male,40,0,0,Yes,Private,Rural,83.94,30.0,smokes,0
Female,80,1,0,Yes,Private,Urban,83.75,29.1,never smoked,0


The original data set contains variables such as marriage status (ever_married) that intuitively won't be useful in our analysis so it is taken out. Categorical variables that are in the "chr" format was changed to "fct" so that we can compute functions more easily later. The BMI variable was renamed so that it is more clear what the acronym stands for. 


In [10]:
brainstroke_data_v1 <- brainstroke_data |> 
    mutate(across(c(gender, hypertension:Residence_type, smoking_status:stroke),
                  as_factor)) |>
    select(1:4, 6:11)

colnames(brainstroke_data_v1)[8] <- 'body_mass_index'

brainstroke_data_v1

gender,age,hypertension,heart_disease,work_type,Residence_type,avg_glucose_level,body_mass_index,smoking_status,stroke
<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>
Male,67,0,1,Private,Urban,228.69,36.6,formerly smoked,1
Male,80,0,1,Private,Rural,105.92,32.5,never smoked,1
Female,49,0,0,Private,Urban,171.23,34.4,smokes,1
Female,79,1,0,Self-employed,Rural,174.12,24.0,never smoked,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Male,40,0,0,Private,Urban,191.15,31.1,smokes,0
Female,45,1,0,Govt_job,Rural,95.02,31.8,smokes,0
Male,40,0,0,Private,Rural,83.94,30.0,smokes,0
Female,80,1,0,Private,Urban,83.75,29.1,never smoked,0


Next, we checked for missing data so that we can exclude those observations in our data analysis. Luckily, the data set does not include any missing data. 

In [11]:
not_available <- brainstroke_data_v1 |>
    summarize(across(everything(), ~ sum(is.na(.))))
not_available

gender,age,hypertension,heart_disease,work_type,Residence_type,avg_glucose_level,body_mass_index,smoking_status,stroke
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,0,0,0,0


The first summary that was performed was to calculate the proportions of the class labels in the stroke variable so that we can check if the classes are balanced. As shown in the table, there are 4733 observations that did not get a stroke versus 248 observations that did get a stroke. The ratio of observations in stroke is about 1:20 which is extremely imbalanced. This is very likely to cause problems in KNN classification. 

In [15]:
total_obs <- nrow(brainstroke_data_v1)

stroke_proportions <- brainstroke_data_v1 |>
    group_by(stroke) |>
    summarize(stroke_count = n(),
              percentage = round((stroke_count / total_obs) * 100, 2))

stroke_proportions

stroke,stroke_count,percentage
<fct>,<int>,<dbl>
0,4733,95.02
1,248,4.98


To resolve this problem, we oversampled the rare class, which is the group that got a stroke. As shown, the classes are now more balanced. However, there may be the issue of overfitting the model to the data because it increases the number of observations in the class that does have a stroke. Thus, cross-validation may not work as well and this should be noted for future research to use a more balanced dataset. 

In [16]:
#input code for upsample