# Machine Learning Project (ML-2025): Heart Disease Risk Prediction

Data & AI ( DIA 4 )

Author : Nassim LOUDIYI, Paul-Adrien LU-YEN-TUNG

Objective: Predict the presence of heart disease using classification models and SHAP interpretability.

## 0. Introduction

This project aims to build a supervised machine learning model capable of predicting the presence of heart disease based on structured clinical features (age, blood pressure, cholesterol, ECG results, chest pain type, etc.).

The goal is to follow the complete ML pipeline taught in the Data & AI major:
- data exploration and quality checks  
- preprocessing (encoding, scaling, pipelines)  
- baseline models  
- standard and advanced algorithms (RF, XGBoost, CatBoost, etc.)  
- hyperparameter tuning  
- ensemble learning  
- model evaluation and comparison  
- explainability using SHAP  

The final objective is to identify the most accurate and reliable model while ensuring interpretability, which is essential for medical decision-support applications.

## 0.1 Dataset Description

### 0.1.1 Target Variable

The target represents the presence of diagnosed heart disease:

| Value | Meaning |
|-------|---------|
| `0` | No heart disease |
| `1` | Presence of heart disease |

This formulation makes the problem a **binary supervised classification task**.

---

### 0.1.2 Feature Dictionary

The dataset contains 13 clinical features commonly used in cardiology for assessing heart disease risk.
They include demographic information (age, sex), physiological measurements (blood pressure, cholesterol),
electrocardiogram results, and exercise-related symptoms. These indicators are routinely used by clinicians
because they reflect key cardiovascular mechanisms, making them relevant predictors for machine learning models.

The table below summarizes the 13 input features, their medical meaning, and their data types.

| Feature | Description | Type |
|---------|-------------|------|
| **`age`** | Patient age in years | Continuous |
| **`gender`** | Sex (1 = Male, 0 = Female) | Binary categorical |
| **`chestpain`** | Chest pain type: 0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic | Multi-class categorical |
| **`restingBP`** | Resting blood pressure (mm Hg) | Continuous |
| **`serumcholestrol`** | Serum cholesterol concentration (mg/dl) | Continuous |
| **`fastingbloodsugar`** | Fasting blood sugar > 120 mg/dl (1 = True, 0 = False) | Binary |
| **`restingelectro`** | Resting electrocardiogram results: 0 = normal, 1 = ST-T abnormality, 2 = left ventricular hypertrophy | Multi-class categorical |
| **`maxheartrate`** | Maximum heart rate achieved during exercise | Continuous |
| **`exerciseangia`** | Exercise-induced angina (1 = Yes, 0 = No) | Binary |
| **`oldpeak`** | ST depression induced by exercise relative to rest | Continuous |
| **`slope`** | Slope of the ST segment: 1 = upsloping, 2 = flat, 3 = downsloping | Multi-class categorical |
| **`noofmajorvessels`** | Number of major blood vessels (0â€“3) visualized using fluoroscopy | Discrete categorical |
| **`target`** | Diagnosis result (0 = No heart disease, 1 = Heart disease) | Target |
