## **Diagnosis of Presence of Heart Disease Based on Age, Cholesterol, And Resting Blood Pressure**

**Names:** Srijan Sanghera, Sharon Feng, Annie Wang, Mairin Leitch

### **Introduction**
Coronary artery disease (CAD) poses a significant health challenge, marked by the narrowing of coronary arteries, necessitating precise diagnostic tools for timely intervention and treatment (Detrano, et al.,1989). Traditionally, invasive procedures like angiography have been employed for diagnosis, but they come with inherent risks such as arterial injury, stroke, and radiation exposure (Mayo Clinic, 2021). Recognizing the imperative for safer alternatives, there is an increasing interest in non-invasive diagnostic methods that leverage patient demographics and basic clinical information for CAD diagnosis. This data analysis aims to address this need by evaluating the diagnosis of the presence of heart disease using essential parameters: age, cholesterol levels, and resting blood pressure based on the database from the Hungarian Institute of Cardiology. <br> 

**Research Question:** Will the new patient have heart disease or not based on their age, cholesterol, and resting blood pressure?

The dataset used for this report comes from the Hungarian Institute of Cardiology in Budapest. The data consists solely of numerical values. A total of 76 columns of data exist, but only 14 columns are used in the given data tables (Detrano et al., 1989). The 14 available columns are titled ‘age’, ‘sex’, ‘cp’, ‘trestbps’,  ‘chol’, ‘fbs’, ‘restecg’ , ‘thalach’, ‘exang’, ‘oldpeak’, ‘slope’, ‘ca’, ‘thal’, and ‘num’ (Detrano et al., 1989). 

--------UNFINISHED

### **Method and Exploratory Analysis**

**Importing libraries:**
- Used library() to install tidyverse, tidymodels, and repr packages.
- Used install.packages() to install “kknn” and “yardstick”. 


In [None]:
#import packages:
library(tidyverse) 
library(tidymodels)
library(repr)
install.packages("kknn")

**Wrangling Data:**
- Loaded and renamed columns of the "processed.hungary.data" dataset.
- Cleaned and wrangled data: select predictive variables, filter out missing data, and correct data types.
- Splitted data into training (75%) and testing (25%) sets.



In [None]:
#loading data in:
URL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data"
heart <- read_delim(URL, delim = ",", col_names = FALSE) |>
        rename(age = X1, sex = X2, cp = X3, trestbps = X4, chol = X5, fbs = X6, restecg = X7, thalach = X8, exang = X9,
               oldpeak = X10, slope = X11, ca = X12, thal = X13, num_predicted = X14)

#Cleaning and wrangling data:
heart_data <- heart |>
    select(age, chol, trestbps, num_predicted) |>
    filter(chol != "?", trestbps != "?") |>
    mutate(num_predicted = as.factor(num_predicted),
           num_predicted = fct_recode(num_predicted, "present" = "1", "absent" = "0"),
           chol = as.numeric(chol),
           trestbps = as.numeric(trestbps))
print("Table 1. Heart Disease Predictors and Class of Interest")

head(heart_data)
#split data into training and testing: 
set.seed(1111) 
heart_split <- initial_split(heart_data, prop = 0.75, strata = num_predicted)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)


**Preliminary Exploratory Analysis (summary):**
- Use dplyr to summarize training data based on disease presence or absence.
- Compute counts and descriptive statistics for age, cholesterol, and blood pressure.
- Handle missing data: calculate and display the number of removed missing values.


In [None]:
#Summarize:
disease_vs_healthy <- heart_train |>
    group_by(num_predicted) |>
    summarize(count = n(), 
              min_age = min(age), max_age = max(age), mean_age = mean(age),
              min_chol = min(chol), max_chol = max(chol), mean_chol = mean(chol),
              min_trestbsp = min(trestbps), max_trestbps = max(trestbps), mean_trestbps = mean(trestbps))
print("Table 2. Summary of the Training Data Based on Absence or Presence of Disease")
disease_vs_healthy

NA_summary <- heart |>
    group_by(num_predicted) |>
    summarize(total_age_NA_deleted = sum(age == "?"),
              total_chol_NA_deleted = sum(chol == "?"),
              total_trestbps_NA_deleted = sum(trestbps == "?"))
print("Table 3. Total Number of NA's Removed From Each Predictor Column (0 = Absence of Disease, 1 = Presence of Disease)")
NA_summary

**Visualization of Training Data:**
- Create scatter plots with color-coded points based on heart disease presence.
- Age vs. Serum Cholesterol (Figure 1)
- Age vs. Resting Blood Pressure (Figure 2)
- Serum Cholesterol vs. Resting Blood Pressure (Figure 3)


In [None]:
#visualising training data
options(repr.plot.width = 9, repr.plot.height = 5) 


# Visualizing Age vs. Serum Cholesterol
age_vs_cholesterol <- heart_train |>
    ggplot(aes(x = age, y = chol, color = num_predicted)) +
    geom_point(alpha = 0.5) +
    labs(x = "Age (years)", y = "Serum cholesterol (mg/dl)", color = "Presence/absense of heart disease") +
    ggtitle("Age vs. Serum Cholesterol") +
    theme(text = element_text(size=15))
age_vs_cholesterol
print("Figure 1. Scatter Plot of Predictors Age vs. Serum Cholesterol. Presence and absense of disease are color coded")

# Visualizing Age vs. Resting Blood Pressure
age_vs_trestbps <- heart_train |>
    ggplot(aes(x = age, y = trestbps, color = num_predicted)) +
    geom_point() +
    labs(x = "Age (years)", y = "Resting blood pressure (mm Hg)", color = "Presence/absense of heart disease") +
    ggtitle("Age vs. Resting Blood Pressure") +
    theme(text = element_text(size=15))
age_vs_trestbps
print("Figure 2. Scatter Plot of Predictors Age vs. Resting Blood Pressure. Presence and absense of disease are color coded")

# Visualizing Serum Cholesterol (mg/dl) vs. Resting Blood Pressure
chol_vs_trestbps <- heart_train |>
    ggplot(aes(x = chol, y = trestbps, color = num_predicted)) +
    geom_point(alpha = 0.5) +
    labs(x = "Serum cholesterol (mg/dl)", y = "Resting blood pressure (mm Hg)", color = "Presence/absense of heart disease") +
    ggtitle("Serum Cholesterol (mg/dl) vs. Resting Blood Pressure") +
    theme(text = element_text(size=15))
chol_vs_trestbps
print("Figure 3. Scatter Plot of Predictors Serum Cholesterol vs. Resting Blood Pressure.")
print("          Presence and absence of disease are color coded")


**K-Selection for k-NN Model:**
- Create a recipe for feature scaling and centering.
- Define a k-NN model specification with tuning parameters.
- Employ 10-fold cross-validation with stratification.
- Specify k values from 1 to 30.
- Develop a workflow integrating the recipe and model.
- Tune hyperparameters, filter for accuracy metric, and visualize k-selection.


In [None]:
#Data analysis (Model)

#recipe:
heart_recipe <- recipe(num_predicted ~ age + chol + trestbps, data = heart_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

#Selecting K:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

set.seed(444)
heart_vfold <- vfold_cv(heart_train, v = 10, strata = num_predicted)
k_vals <- tibble(neighbors = seq(from = 1, to = 30))

workflow_for_k_selection <- workflow() |>
    add_recipe(heart_recipe) |>
    add_model(knn_spec) |>
    tune_grid(resamples = heart_vfold, grid = k_vals) |>
    collect_metrics() |>
    filter(.metric == "accuracy")
print("Table 4. Accuracy of The Classifier Across 10 folds Using K values from 1 to 30") 
workflow_for_k_selection

k_plot <- ggplot(workflow_for_k_selection, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors (K)", y = "Accuracy estimate (mean)") +
    ggtitle("Neighbors vs. Estimated Accuracy of Model") +
    theme(text = element_text(size=15))
print("Figure 4. Estimated Accuracy Plot For the CLassifier Based on K values 1 to 30") 
print("          Highest accuracy (~65.2%) resulted from K = 17 and 18")
k_plot


**Heart Disease Classification Model:**
- Define a k-NN model specification with a fixed neighbor value (k = 18).
- Construct a workflow incorporating the recipe and the specified model.
- Fit the model to the training data.


In [None]:
#Heart disease classification model: 
heart_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 18) |>
      set_engine("kknn") |>
      set_mode("classification")

heart_workflow <- workflow() |>
    add_recipe(heart_recipe) |>
    add_model(heart_spec) |>
    fit(data = heart_train)

**Model Testing:**
- Evaluate the trained model using the testing dataset.
- Generate predictions and combine them with the original testing dataset.
- Assess model accuracy and construct a confusion matrix.
- Summarize accuracy in Table 5 and display the confusion matrix in Table 6.


In [None]:
#Testing model with testing data:
heart_model_test <- predict(heart_workflow, heart_test) |>
    bind_cols(heart_test)

heart_model_accuracy <- heart_model_test |>
    metrics(truth = num_predicted, estimate = .pred_class) |>
        filter(.metric == "accuracy")
print("Table 5. Estimated Accuracy of The Classifier Based on The Testing Data Set")
heart_model_accuracy 

heart_model_mat <- heart_model_test |>
    conf_mat(truth = num_predicted, estimate = .pred_class)
print("Table 6. Confusion matrix Based on The Testing Data Set Where Truth = Actual Class of The Test Set")
print("         and Prediction = Predicted Class of The Test Set")
heart_model_mat 

#accuracy is ~65%

### **Methods**

We utilized the "processed.hungary.data" from the Heart Disease Database. Our predictive variables include age (years), chol (serum cholesterol in mg/dL), and trestbps (resting blood pressure in mmHg) to predict if heart disease is present (1) or absent (0) in Hungarian patients. We imported the dataset into a Jupyter Notebook from the website and conducted data preprocessing to filter the relevant variables, handle missing data, and adjust data types. A 75% training and 25% testing data split was performed. Preliminary exploratory analysis included summary statistics and visualization using scatter plots: Age vs. Serum Cholesterol, Age vs. Resting Blood Pressure, and Serum Cholesterol vs. Resting Blood Pressure.

Then we need to find the ideal "k" value for k-nearest neighbors, maximizing prediction accuracy through the recipe function, cross-validation, the k-nearest neighbor algorithm, and a workflow on our training data. Then, we will predict our category (1 or 0) using our model and the testing data, and analyze our findings in a report. 

We plan to create a scatterplot to assess potential overfitting or underfitting. We will color code the graph's background based on the model's predictions. We will repeat this for all of our variables (age vs. blood pressure, age vs. cholesterol, cholesterol vs. blood pressure). By comparing the graphs, we can determine which variables are more accurate in predicting the presence of heart disease, and the strengths and limitations of our model. 


### **Expected outcomes and significance**
From this data analysis of heart disease in Hungary, conclusions can be drawn as to how likely a person will be diagnosed with heart disease based on their resting blood pressure, serum cholesterol levels, and age. We expect to find a positive relationship between our predictor values (blood pressure, cholesterol levels, age) and the presence of heart disease, and a relatively accurate prediction using this classification model. 

These findings can help doctors in Hungary accurately diagnose heart disease patients without using invasive procedures. Also doctors can start to place a higher emphasis on more related parameters during symptom check-ups using the conclusion from this study. Focusing on region-specific data (Hungary) can help doctors benefit the health of the populations they treat.

Our analysis can prompt the exploration of new questions. For example, how can we use this data to diagnose specific heart conditions (i.e. atherosclerosis, angina, coronary heart disease, etc.) rather than just the general presence of a disease? Are other factors more accurate in a heart disease diagnosis than the ones we are identifying? More specific research questions can be pursued as a result of this analysis. 


##### **References**:
Detrano, Robert C. et al. “International application of a new probability algorithm for the diagnosis of coronary artery disease.” The American journal of cardiology 64 5 (1989): 304-10 .