# Heart Disease Prediction – final_Report

## 1. Introduction
Heart disease is one of the major health problems affecting people worldwide.  
Early detection can help prevent serious medical complications.  
In this project, we use machine learning techniques to predict whether a patient has heart disease based on different medical parameters.

The main goal of this project is:
- To study the dataset
- To perform data cleaning and visualization
- To train machine learning models
- To predict heart disease with good accuracy

---

## 2. Dataset Overview

## - **Dataset Shape**

#### A. values.csv
- Shape: **(180 rows × 14 columns)**
- Meaning:
  - 180 patients
  - 14 clinical features

#### B. labels.csv
- Shape: **(180 rows × 2 columns)**
- Meaning:
  - 180 patients
  - 2 label columns (patient ID + diagnosis)

#### C. Merged Dataset
After merging:
- 14 + 2 = 16 columns  
- 1 duplicate ID column removed  
- Final shape: **(180 rows × 15 columns)**

**Meaning:**
- 180 patients  
- 15 final features (clinical + target label
  
**The dataset contains important clinical features that influence a patient’s heart condition.**
Below are the key attributes:

- **Age:** Patient’s age in years  
- **Sex (Gender):** Male or Female  
- **Chest Pain Type:** Indicates the nature of chest pain (4 categories)  
- **Resting Blood Pressure:** Blood pressure measured in mmHg  
- **Cholesterol Level:** Serum cholesterol in mg/dl  
- **Fasting Blood Sugar:** Whether fasting sugar > 120 mg/dl  
- **Resting ECG Results:** Heartbeat electrical activity  
- **Max Heart Rate Achieved:** Maximum heart rate during exercise  
- **Exercise-Induced Angina:** Pain during exercise (yes/no)  
- **Oldpeak:** ST depression value  
- **Slope:** Slope of ST segment  
- **Major Vessels (ca):** Number of vessels colored by fluoroscopy  
- **Thal Result:** Thalassemia test result  
- **Target:**  
  - 0 = No Heart Disease  
  - 1 = Heart Disease Present  

### - Dataset Condition
- No missing values  
- Clean, structured, ready for analysis  
- Balanced distribution of target values  

---
## 3. Data Cleaning
Steps performed during data cleaning:
- Checked for missing values
- Removed duplicates
- Converted data types where needed
- Handled outliers using boxplots
- Standardized category values
- Ensured the dataset is clean and ready for modeling

After cleaning, the dataset became stable and suitable for visualization and machine learning.

---
## 4. Data Preprocessing

Before applying machine learning models, several preprocessing steps were performed:

### - Step 1: Checking Missing Values
The dataset was checked thoroughly, and **no null values** were found.  
This ensures data quality and prevents biased results.

### - Step 2: Feature–Target Separation
- **Input Features (X):** All medical parameters  
- **Target (y):** The “target” column representing disease status  

### - Step 3: Train–Test Split
The dataset was divided:
- **Training Set:** Used to train the ML models  
- **Testing Set:** Used to evaluate performance  
This prevents overfitting and ensures reliable evaluation.

### - Step 4: Scaling (Where Required)
Models like Logistic Regression, and SVM need feature scaling.  
**StandardScaler** was used to normalize values for better performance.

---
## 5. Exploratory Data Analysis (EDA)

### 5.1 Target Variable Distribution
Shows how many patients have heart disease.
- Helps check dataset balance  
- Important for model training  
- Gives basic understanding of prediction class

### 5.2 Age Distribution
Shows how age is spread across patients.
- Age is a major risk factor  
- Helps understand which age groups dominate the dataset
- Majority of patients are between **40–60 years**.  
- Risk increases with age, especially above 50.


### 5.3 Gender vs Heart Disease
Compares heart disease rate for males and females.
- Useful for medical insights  
- Identifies gender-related risk patterns

### 5.4 Cholesterol Boxplot
Shows cholesterol spread and outliers.
- Helps detect abnormal values  
- Important for data quality

### 5.5 Correlation Heatmap
Shows relationship between numerical features.
- High cholesterol levels are commonly seen in patients with heart disease.  
- Helps in feature selection  
- Identifies strong and weak predictors  
- Shows multicollinearity

### 5.6 Scatter Plot: Age vs Max Heart Rate
Shows how age impacts maximum heart rate.
- Useful clinical pattern  
- Younger people usually have higher heart rate capacity

### 5.7 Chest Pain Type vs Target
- **Asymptomatic chest pain** has the highest number of heart disease cases.  
- Proves chest pain type is an important predictive feature.

### 5.8 Resting Blood Pressure
- Patients with higher resting blood pressure show increased risk.

### 5.9 Gender Comparison
- Males exhibit more heart disease cases compared to females.

### 6.0 Correlation Heatmap
Strong correlations found:
- **Oldpeak ↗ disease chance**  
- **Thal ↗ disease**  
- **Max heart rate ↘ disease**  
These insights helped in selecting strong features for the model.

EDA clarified which variables are significant and how they behave, improving model decisions.

---
## 6. Machine Learning Models Used

Multiple models were trained to compare performance and select the most accurate one:

1. **Logistic Regression**   
2. **Support Vector Machine (SVM)**  
3. **GradientBoosting**  
4. **Random Forest**

Each model was evaluated using the same train-test split to get fair and equal comparison results.

---

Among these, **Random Forest** performed the best because:
- Works well with mixed data  
- Handles non-linear relationships  
- Reduces overfitting using multiple trees  
- Gives better accuracy and stability  

---
## 7. Model Evaluation

To measure how well each model performs, these metrics were used:

### - Accuracy
How many total predictions were correct.

### - Precision
Out of all predicted positive cases, how many were correct.  
Important to avoid false positives.

### - Recall
Out of all actual positive cases, how many were correctly identified.  
Very important in medical diagnosis.

### - F1 Score
Balance between precision and recall.

### - ROC–AUC Score
Measures model’s ability to distinguish between classes.  
Higher AUC = better performance.

These metrics give a complete understanding of model quality.

---
**The best model (Random Forest) showed:**
- High accuracy  
- Good sensitivity (recall)  
- Strong AUC score  

This makes it a reliable model for predicting heart disease.

---
## 8.Model Performance Summary 

After evaluating all models, the performance was analyzed.  
**Random Forest achieved the strongest metrics overall.**

### Random Forest Results:
- **Accuracy:** 0.8611  
- **Precision:** 0.7857  
- **Recall:** 0.9375  
- **F1 Score:** 0.8571  
- **AUC Score:** 0.9625  
- **Cross-Validation AUC:** 0.8659  

These results show that Random Forest is highly effective and consistent.

---
## 9. ROC Curve Analysis

ROC curve helps in visual comparison of all models.

### - Interpretation:
- Higher curve = better performance  
- Random Forest’s ROC curve covers the largest area  
- AUC of **0.96+** indicates excellent accuracy in class separation  
- Confirms Random Forest is the most dependable model for heart disease prediction  

This ensures strong diagnostic capability.

---
## 10. Why Random Forest Was Selected as the Best Model

Random Forest outperformed all other models because:

- It achieved the **highest AUC score (0.96+)**  
- Very high **recall**, meaning fewer missed disease cases  
- Works well even when data relationships are nonlinear  
- Less risk of overfitting compared to Decision Tree  
- Strong performance across cross-validation folds  
- Can handle feature importance and complex interactions  

For medical predictions, recall and AUC are crucial — both are excellent in Random Forest.

---
## 11. Final Prediction System

The finalized model is a **Random Forest Classifier**.  
It predicts:

- **1 → Heart Disease Present**  
- **0 → Heart Disease Not Present**

Doctors can use this model to analyze patient risk based on clinical data, enabling faster diagnosis and preventive treatment.

---
## 11. Conclusion

**This project successfully:**
- Cleaned and prepared the dataset  
- Performed meaningful visualizations  
- Trained multiple ML models  
- Identified Random Forest as the best-performing model  
- Demonstrated strong prediction capability for heart disease  
- EDA revealed important insights like age distribution, chest pain impact, cholesterol patterns, and correlations  
- Multiple ML models were trained and evaluated  
- Performance metrics and ROC curves were used for a fair comparison  
- **Random Forest achieved the best performance overall**  
- The model can be used for real-world medical prediction and decision support  

This project successfully built a reliable heart disease prediction model that can assist healthcare professionals in early diagnosis.

---
## 12. Summary
This project provides a complete pipeline:
- Data Cleaning  
- EDA  
- Visualization  
- Machine Learning  
- Conclusion  

It gives valuable medical insights and supports early detection of heart disease.

#  End of Detailed Report