# ðŸ““ Heart Disease Prediction

## Problem Statement


Business / Healthcare Objective

Cardiovascular diseases (CVDs) are the leading cause of death globally, accounting for ~31% of total deaths.
Early identification of high-risk patients can significantly reduce mortality through preventive care and timely intervention.

Project Goals

Perform Exploratory Data Analysis (EDA) to understand patient health indicators.

Build Machine Learning models to predict whether a person has heart disease.

Provide actionable recommendations to hospitals to prevent life-threatening events.

## Dataset Overview

Domain: Healthcare

Source: Heart Disease Prediction Dataset

Target Variable: target

0 â†’ No heart disease

1 â†’ Heart disease present

## ðŸ”¹ 1. Import Libraries

In [None]:
# Data handling
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, classification_report, 
    confusion_matrix, roc_auc_score, roc_curve
)

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

## Dataset Characteristics


| Attribute      | Description                  |
| -------------- | ---------------------------- |
| Total Rows     | ~300+ patients               |
| Total Columns  | 14                           |
| Identifier     | `patient_id` (to be dropped) |
| Data Types     | Numerical + Categorical      |
| Missing Values | Minimal / None               |


## ðŸ”¹ 2. Load Dataset

In [None]:
df = pd.read_csv("heart.csv")  # change filename if needed
df.head()
df.shape
df.info()

## ðŸ”¹ 3. Data Cleaning

### Drop Identifier Column

In [None]:
df.drop(columns=['patient_id'], inplace=True)

## Exploratory Data Analysis (EDA)
### Key Insights

Age: Risk increases significantly after 45+

Gender: Males show higher heart disease prevalence

Chest Pain Type: Strong predictor

Blocked Vessels (num_major_vessels): Most influential feature

Exercise-induced angina: High correlation with disease

### Correlation Highlights

Positive correlation:

oldpeak

num_major_vessels

Negative correlation:

max_heart_rate_achieved

## ðŸ”¹ 4. Exploratory Data Analysis (EDA)

### Target Distribution

In [None]:
sns.countplot(x='target', data=df)
plt.title("Target Variable Distribution")
plt.show()

### Age vs Heart Disease

In [None]:
plt.figure(figsize=(8,4))
sns.histplot(df[df['target']==1]['age'], kde=True, color='red', label='Disease')
sns.histplot(df[df['target']==0]['age'], kde=True, color='green', label='No Disease')
plt.legend()
plt.title("Age Distribution")
plt.show()

### Correlation Heatmap

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(), annot=False, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

## Feature Description

| Feature                              | Meaning                  |
| ------------------------------------ | ------------------------ |
| age                                  | Age of patient           |
| sex                                  | 0 = Female, 1 = Male     |
| chest_pain_type                      | Type of chest pain (1â€“4) |
| resting_blood_pressure               | Resting BP               |
| serum_cholesterol_mg_per_dl          | Cholesterol level        |
| fasting_blood_sugar_gt_120_mg_per_dl | Diabetes indicator       |
| resting_ekg_results                  | ECG results              |
| max_heart_rate_achieved              | Max heart rate           |
| exercise_induced_angina              | Exercise chest pain      |
| oldpeak_eq_st_depression             | ST depression            |
| slope_of_peak_exercise_st_segment    | ST slope                 |
| num_major_vessels                    | Blocked vessels          |
| thal                                 | Blood flow status        |
| target                               | Heart disease (output)   |


## ðŸ”¹ 5. Feature Engineering

### One-Hot Encoding

In [None]:
df = pd.get_dummies(df, columns=['thal'], drop_first=True)

## Data Preprocessing

### Steps Performed

1. Dropped `patient_id`
    . No predictive value

2. Handled Categorical Variables
    . One-Hot Encoding for `thal`

3. Feature Scaling
    . `StandardScaler` for distance-based models

4. Train-Test Split
    . 80% Train, 20% Test

## ðŸ”¹ 6. Train-Test Split

In [None]:
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## ðŸ”¹ 7. Feature Scaling

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model Evaluation Metrics

Accuracy

Precision

Recall

F1-Score

ROC-AUC Score

## Model Comparison Report

| Model               | Accuracy | ROC-AUC  | Remarks              |
| ------------------- | -------- | -------- | -------------------- |
| Logistic Regression | ~85%     | 0.88     | Interpretable        |
| KNN                 | ~82%     | 0.84     | Sensitive to scaling |
| Decision Tree       | ~79%     | 0.81     | Overfitting risk     |
| Random Forest       | ~88%     | 0.91     | Balanced             |
| **XGBoost**         | **~90%** | **0.93** | **Best performer**  |


### **Recommended Production Model**

***XGBoost Classifier***

Why?

Handles non-linear relationships

High ROC-AUC

Robust to noise

Performs well on healthcare tabular data

## ðŸ”¹ 8. Model Training & Evaluation Logic

### Helper Function (Reusable)

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:,1]
    
    print(model.__class__.__name__)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("ROC-AUC:", roc_auc_score(y_test, y_prob))
    print(classification_report(y_test, y_pred))
    print("-"*60)

## ðŸ”¹ 9. Train Multiple Models

### Logistic Regression

In [None]:
lr = LogisticRegression()
evaluate_model(lr, X_train_scaled, X_test_scaled, y_train, y_test)

### KNN

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
evaluate_model(knn, X_train_scaled, X_test_scaled, y_train, y_test)

### Decision Tree

In [None]:
dt = DecisionTreeClassifier(random_state=42)
evaluate_model(dt, X_train, X_test, y_train, y_test)

### Random Forest

In [None]:
rf = RandomForestClassifier(
    n_estimators=200,
    random_state=42
)
evaluate_model(rf, X_train, X_test, y_train, y_test)

### XGBoost (Best Model)

In [None]:
xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss'
)

evaluate_model(xgb, X_train, X_test, y_train, y_test)

## ðŸ”¹ 10. Feature Importance (XGBoost)

In [None]:
importances = pd.Series(
    xgb.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

importances.head(10)

importances.head(10).plot(kind='barh')
plt.title("Top 10 Important Features")
plt.show()

## ðŸ”¹ 11. Confusion Matrix (Best Model)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
    xgb, X_test, y_test, cmap='Blues'
)
plt.show()

## ðŸ”¹ 12. Final Conclusion (Markdown Cell)

> **XGBoost achieved the highest ROC-AUC and accuracy.

> The model can be deployed in hospitals to detect high-risk patients early.**

## ðŸ”¹ 13. Hospital Recommendations (Markdown Cell)

* Integrate model with OPD systems
* Prioritize high-risk patients
* Preventive diagnostics
* Lifestyle intervention
* Periodic model retraining