# Title: Heart Disease Risk Prediction Using Machine Learning

#### Group Member Names :

 Kelvin Ikrokoto

 Clinton Avornu


### INTRODUCTION:

Cardiovascular diseases are one of the leading causes of death globally. Early prediction of heart disease can significantly reduce mortality through timely diagnosis and intervention. Machine Learning (ML) provides a powerful approach to detect patterns in clinical data and predict health risks.

This project focuses on predicting heart disease using multiple ML models. The work includes reproducing results from a selected research paper and implementing a new contribution model to evaluate improvements in accuracy and performance.
*********************************************************************************************************************
#### AIM :

To reproduce the methodology of a heart disease prediction research paper and introduce a new machine learning model (Random Forest) as a significant contribution to improve predictive performance.
*********************************************************************************************************************
#### Github Repo:
https://github.com/kelvinikros/AIDI1002_FinalProject_HeartDiseaseProject
*********************************************************************************************************************
#### DESCRIPTION OF PAPER:

The selected research paper, “Heart Disease Prediction using Machine Learning” (2023), examines the use of machine learning models such as Logistic Regression, Decision Trees, SVM, and KNN to predict heart disease based on clinical attributes including age, cholesterol levels, chest pain type, resting blood pressure, maximum heart rate, and more.

The paper highlights challenges in early diagnosis and demonstrates that ML models can support clinical decision-making. It also suggests that ensemble methods or neural networks may improve prediction accuracy, which aligns with the contribution implemented in this project.
*********************************************************************************************************************
#### PROBLEM STATEMENT :

To develop a machine learning model that accurately predicts whether a patient is at risk of heart disease (target = 1) or not at risk (target = 0) using structured clinical data.
*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:
*Early detection of heart disease allows for better treatment planning and reduces the likelihood of severe outcomes such as heart attacks. However, manual assessment based on symptoms and lab values can be inconsistent. Machine learning models can analyze complex patterns in clinical features and assist healthcare providers in risk assessment.
*********************************************************************************************************************
#### SOLUTION:

*This project implements three machine learning models:
1. Decision Tree (baseline)
2. MLP Neural Network (reproduced model)
3. Random Forest (contribution model)

The models are evaluated using accuracy, confusion matrix, and ROC curves. The Random Forest model serves as the significant contribution and demonstrates improved performance over the baseline.


# Background
*********************************************************************************************************************


### Reference
Heart Disease Prediction using Machine Learning (2023)
### Explanation
The paper explores classical ML models for predicting heart disease and analyzes feature importance, preprocessing steps, and model performance.
### Dataset/Input
The study uses clinical data including chest pain type, cholesterol level, resting blood pressure, fasting blood sugar, age, and ECG results.
### Weakness
The paper does not include ensemble models such as Random Forest which could offer better performance and robustness. Limited hyperparameter tuning is applied.




*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************

## 1. DATA LOADING CODE
import numpy as np

import pandas as pd

### Visualization
import matplotlib.pyplot as plt

import seaborn as sns

### Machine learning tools
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (

    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_curve,
    auc
    
)

from sklearn.tree import DecisionTreeClassifier

from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier  # for your contribution

### Make charts look good

sns.set(style="whitegrid")

plt.rcParams["figure.figsize"] = (8, 5)

### 2. EDA CHARTS

print("Shape of dataset:", df.shape)

print("\nData types:\n", df.dtypes)

print("\nMissing values per column:\n", df.isnull().sum())

### Target distribution

print("\nTarget Value Counts:")

print(df["target"].value_counts())

### ---- BAR CHART: Target Distribution ----
plt.figure(figsize=(6,4))

sns.countplot(x="target", data=df, palette="viridis")

plt.title("Distribution of Heart Disease (Target)")

plt.xlabel("Target (0 = No Disease, 1 = Disease)")

plt.ylabel("Count")

plt.show()

numeric_cols = ["age", "trestbps", "chol", "thalachh", "oldpeak"]

df[numeric_cols].hist(bins=20, figsize=(12, 7), color="#4C72B0")

plt.suptitle("Distributions of Numeric Features", fontsize=16)

plt.show()


plt.figure(figsize=(10, 8))

corr = df.corr()

sns.heatmap(

    corr,
    cmap="coolwarm",
    linewidths=0.5,
    annot=False
    
)

plt.title("Correlation Heatmap of Features")

plt.show()

### 3. Separate features and target
X = df.drop("target", axis=1)

y = df["target"]

### . Train-test split (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42, stratify=y
)

### 4. Scale the data (helps MLP and improves accuracy)
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

print("Training data shape:", X_train.shape)

print("Test data shape:", X_test.shape)

## 5. DECISION TREE CODE
dt = DecisionTreeClassifier(random_state=42)

dt.fit(X_train, y_train)

### Predictions
y_pred_dt = dt.predict(X_test)

### Evaluation
dt_accuracy = accuracy_score(y_test, y_pred_dt)

print("Decision Tree Accuracy:", dt_accuracy)

print("\nClassification Report:")

print(classification_report(y_test, y_pred_dt))

### Confusion Matrix
cm = confusion_matrix(y_test, y_pred_dt)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")

plt.title("Decision Tree - Confusion Matrix")

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.show()


### Calculate ROC curve
y_prob_dt = dt.predict_proba(X_test)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_prob_dt)

roc_auc = auc(fpr, tpr)

### Plot the curve
plt.figure(figsize=(6,5))

plt.plot(fpr, tpr, label="AUC = {:.3f}".format(roc_auc))

plt.plot([0,1], [0,1], linestyle="--", color="gray")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("Decision Tree - ROC Curve")

plt.legend()

plt.show()



mlp = MLPClassifier(

    hidden_layer_sizes=(64, 32),
    activation='relu',
    solver='adam',
    max_iter=500,
    random_state=42
    
)

mlp.fit(X_train_scaled, y_train)

### Predictions

y_pred_mlp = mlp.predict(X_test_scaled)

## 6. MLP (NEURAL NETWORK) CODE
mlp_accuracy = accuracy_score(y_test, y_pred_mlp)

print("MLP Accuracy:", mlp_accuracy)

print("\nClassification Report:")

print(classification_report(y_test, y_pred_mlp))

### Confusion Matrix
cm = confusion_matrix(y_test, y_pred_mlp)

plt.figure(figsize=(5,4))

sns.heatmap(cm, annot=True, fmt="d", cmap="Greens")

plt.title("MLP Classifier - Confusion Matrix")

plt.xlabel("Predicted")

plt.ylabel("Actual")

plt.show()

### ROC Curve
y_prob_mlp = mlp.predict_proba(X_test_scaled)[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_prob_mlp)

roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6,5))

plt.plot(fpr, tpr, label="AUC = {:.3f}".format(roc_auc))

plt.plot([0,1], [0,1], linestyle="--", color="gray")

plt.xlabel("False Positive Rate")

plt.ylabel("True Positive Rate")

plt.title("MLP - ROC Curve")

plt.legend()

plt.show()

*********************************************************************************************************************
### Contribution  Code :
*rf = RandomForestClassifier(

    n_estimators=200,
    max_depth=None,
    random_state=42  
)

rf.fit(X_train, y_train)

### Predictions

y_pred_rf = rf.predict(X_test)

### Evaluation

rf_accuracy = accuracy_score(y_test, y_pred_rf)

print("Random Forest Accuracy:", rf_accuracy)

print("\nClassification Report:")

print(classification_report(y_test, y_pred_rf))

### Results :
Decision Tree Accuracy: 0.976190
MLP (Neural Network) Accuracy: 0.965608
Random Forest Accuracy: 0.978836

The Random Forest model achieved the highest accuracy among all models, demonstrating superior generalization and stability. Although the Decision Tree performed surprisingly well, Random Forest provided a slight performance improvement due to its ensemble structure, reducing overfitting and variance. The MLP model also performed competitively, showing strong predictive ability after feature scaling.

Overall, the results confirm that ensemble methods such as Random Forest outperform single-tree models and match or surpass neural network performance on structured tabular medical data.
*******************************************************************************************************************************


#### Observations :
• Age, cholesterol, and maximum heart rate showed noticeable correlations with heart disease.  
• The Random Forest model generalized better than single tree models.  
• Scaling significantly improved MLP performance.  
• The dataset was balanced, resulting in stable evaluation metrics.
*******************************************************************************************************************************
*


### Conclusion and Future Direction :
The project successfully reproduced the selected research paper and extended it with a Random Forest model as a significant contribution. Random Forest achieved superior accuracy and proved to be a more robust model for heart disease prediction.

Future work may include:
• Hyperparameter tuning  
• Feature engineering  
• Using deep learning architectures  
• Testing the model on larger real-world datasets
*******************************************************************************************************************************
#### Learnings :
I learned how to reproduce research work using GitHub repositories, implement multiple ML models, compare model performance, and evaluate results using metrics and visualizations. This project improved my understanding of ensemble methods and preprocessing techniques.
*******************************************************************************************************************************
#### Results Discussion :
The results clearly show that ensemble learning methods outperform individual classifiers like Decision Trees. The MLP performed moderately well, while Random Forest demonstrated the highest predictive ability, confirming the hypothesis that ensemble methods provide better generalization.
*******************************************************************************************************************************
#### Limitations :
• Dataset size is relatively small compared to real clinical datasets.  
• Hyperparameters were not fully optimized.  
• Only structured tabular data was used — adding imaging or ECG waveforms could improve performance.
*******************************************************************************************************************************
#### Future Extension :
Future extensions include hyperparameter tuning using GridSearchCV, applying gradient boosting models like XGBoost, and integrating deep learning architectures for multi-modal medical analysis.

# References:

[1]: Heart Disease Prediction using Machine Learning (2023).  
PDF: https://repository.kaust.edu.sa/bitstreams/6a5b958c-dc87-4a05-a8ec-fc7c5e0e0ae0/download  

[2] GitHub Repository — Heart Attack Prediction Using Machine Learning.  
https://github.com/nfarhaan/Heart-Attack-Prediction-Using-Machine-Learning  

[3] UCI Heart Disease Dataset (Merged).  
https://github.com/nfarhaan/Heart-Attack-Prediction-Using-Machine-Learning/tree/main/dataset
