# Data Challenge Project Work CO2

## 1. Introduction

###### Ka Men Ho, Luana Aido da Silva, Michèle Pfister

The project explores the performance of machine learning (ML) algorithms for the prediction and diagnosis of heart disease. As heart disease remains one of the leading causes of mortality worldwide, in this data challenge project we aim to understand, how early detection strategies can be improved using these tools, which have the potential to enhance patient outcomes and reduce disease-related mortality (Alshenawy, 2024).

The underlying biological mechanisms of heart attacks and strokes involve obstruction of blood flow to the heart or brain due to arterial plaque accumulation or thrombus formation. A major clinical challenge is that symptoms of heart disease are often nonspecific, overlapping with those of other conditions or being misattributed to normal aging, which complicates preventive and accurate diagnosis (Quah et al., 2014).

Machine learning has become an increasingly important tool in healthcare, enhancing clinical decision-making in disease prediction and diagnosis. Traditional approaches relied largely on practitioners’ interpretation of a patient’s medical history, reported symptoms, and physical examination findings (Karthick et al., 2022).

The dataset used in this project was obtained from the University of California Irvine (UCI) Machine Learning Repository and is widely employed for heart disease prediction tasks. Patient outcomes were determined using cardiac catheterization, considered the clinical gold standard, where individuals exhibiting more than 50% narrowing of a coronary artery were classified as having heart disease.
The dataset comprises 270 patient records and includes 13 independent predictive variables. Detailed descriptions of these attributes are provided in the UCI repository documentation (University of California, Irvine, n.d.)

An updated version of the heart disease dataset includes 303 consecutive patients referred for coronary angiography at the Cleveland Clinic in Cleveland, Ohio, between May 1981 and September 1984. This cohort was used to develop the Cleveland algorithm, a computerized diagnostic model whose regression coefficients were later validated using independent patient populations from Budapest, Long Beach, and Switzerland.
The Cleveland cohort had a mean age of 54 years, consisted of 68% men, and showed a disease prevalence of 46%. The model was derived from 13 clinical and test-related variables, with age, sex, chest pain type, and systolic blood pressure identified as key predictors. Chest pain was categorized as typical anginal, atypical anginal, nonanginal, or asymptomatic, and inclusion of age, sex, and chest pain type was required for clinically relevant disease probability estimation.

Because complete joint distributions of clinical variables were rarely available, the original model assumed independence among predictors. However, previous research has shown that ignoring interdependencies between symptoms can result in overconfident predictions and inaccurate disease probability estimates (Detrano et al., 1989). 

To mitigate the overconfidence that can arise from assuming independence among clinical variables, the study of Kathleen employs an ensemble learning approach using the Adaptive Boosting (AdaBoost) algorithm. AdaBoost is a meta-learning method that combines multiple weak classifiers into a single, more robust predictive model. Through 100 iterative boosting rounds, the algorithm adaptively increases the weight of observations that were misclassified in previous iterations, encouraging subsequent classifiers to focus on complex or interacting symptom patterns that are difficult to capture with a single model. The final prediction is produced via a weighted majority vote of all component classifiers, resulting in a classifier that is less prone to overconfident assumptions and better aligned with the true diagnostic outcome (Kathleen et al., 2016).

Alshenawy (2024) evaluated several machine learning algorithms, both individually and in ensemble settings, to identify reliable approaches for heart disease diagnosis. The models analyzed included Support Vector Machines, Random Forest, Decision Trees, Naïve Bayes, and Logistic Regression as a baseline. The data were divided into training (189 observations) and testing (81 observations) sets, and model performance was assessed using multiple metrics, including accuracy, sensitivity, specificity, and AUC.

Among the individual models, Random Forest achieved the highest overall performance, while ensemble approaches using bagging produced comparable but not superior results. Building on this framework, the data challenge project work CO2 applies a similar comparative evaluation of multiple machine learning models, with particular emphasis on ensemble methods, to assess their effectiveness in heart disease prediction.


### Table of contents

- 1. Introduction

- 2. Exploratory data analysis
- 2. 1 Importing Libraries and load the data
- 3. Preprocessing
- 3. 1 Cleaning the data 
- 3. 2 Handling missing values
- 3. 3 Converting text labels to numbers (feature encoding)
- 4. Modelling
- 4. 1 symple base-line model
- 4. 2 two sophisticated model approaches
- 4. 3 Experiment and testing the model
- 5. Results
- 6. Discussion


## 2. Exploratory data analysis

Analyse your data. Visualise and explain the data features you deem to be relevant for
the project.

### 2.1 Importing Libraries and load data

In [15]:
# Importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
import numpy as np

heart_data = pd.read_csv("Heart_Disease_Prediction.csv")

heart_data.head()
heart_data.shape

(270, 15)

## 3. Preprocessing

Explain what kind of preprocessing, feature encoding you are applying

## 3.1 Cleaning the data

In [7]:
#drop the 'id' column
heart_data = heart_data.drop(columns=['index'])




In [9]:
heart_data.dtypes

Age                          int64
Sex                          int64
Chest pain type              int64
BP                           int64
Cholesterol                  int64
FBS over 120                 int64
EKG results                  int64
Max HR                       int64
Exercise angina              int64
ST depression              float64
Slope of ST                  int64
Number of vessels fluro      int64
Thallium                     int64
Heart Disease               object
dtype: object

In [11]:
heart_data.describe()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


In [13]:
heart_data.isna().sum()

Age                        0
Sex                        0
Chest pain type            0
BP                         0
Cholesterol                0
FBS over 120               0
EKG results                0
Max HR                     0
Exercise angina            0
ST depression              0
Slope of ST                0
Number of vessels fluro    0
Thallium                   0
Heart Disease              0
dtype: int64

## 3.3 Converting text labels to numbers (feature encoding)

In [None]:
##Target encoding for absence and presence of heart disease

## Feature encoding for binary categorical features and continuous features

## Encoding

what needs to be encoded:
    Sex binary          category
    Chest pain type     unordered categories
    FBS over 120        binary
    EKG results         categories
    Exercise angina     binary
    Slope of ST         categories
    Thallium            categories

do not one hot enocde:
Age                         continuous
BP
Cholesterol
Max HR
ST depression
Number of vessels fluro     ordered count (0–3)




## 4. Modelling

### 4. 1 symple base-line model

perfect model setup for baseline: Logicstic regression

is interpretable, medically standard and easy to explain coefficients

### 4. 2 two sophisticated model approaches

number 1 random forest: captures feature interactions, non linear splits, robust on small datasets

numeber 2 Gradient Boosting or SVM (RBF)


#### Sources

Alshenawy, F. (2024). Using Machine Learning Algorithms to improve heart disease diagnoses. المجلة العلمية للدراسات والبحوث المالية والتجارية, 5(1), 417–442. https://doi.org/10.21608/cfdj.2024.324103

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.-J., Sandhu, S., Guppy, K. H., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology, 64(5), 304–310. https://doi.org/10.1016/0002-9149(89)90524-9

Kathleen, H., H., J., & J., G. (2016). Diagnosing Coronary Heart Disease using Ensemble Machine Learning. International Journal of Advanced Computer Science and Applications, 7(10). https://doi.org/10.14569/IJACSA.2016.071004

Karthick, K., Aruna, S. K., & Manikandan, R. (2022). Development and evaluation of the bootstrap resampling technique based statistical prediction model for Covid-19 real time data : A data driven approach. Journal of Interdisciplinary Mathematics, 25(3), 615–627. https://doi.org/10.1080/09720502.2021.2012890

Quah, J. L. J., Yap, S., Cheah, S. O., Ng, Y. Y., Goh, E. S., Doctor, N., Leong, B. S.-H., Tiah, L., Chia, M. Y. C., & Ong, M. E. H. (2014). Knowledge of Signs and Symptoms of Heart Attack and Stroke among Singapore Residents. BioMed Research International, 2014, 1–8. https://doi.org/10.1155/2014/572425

University of California, Irvine. (n.d.). Heart disease data set. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Heart+Disease