
# **Essential Machine Learning for Medicine - Day 3**
### **Using the Heart Disease UCI Dataset for Regression, Classification, and Clustering**
In this notebook, we will:
1. Load an open-source dataset (Heart Disease UCI dataset).
2. Perform exploratory data analysis (EDA) and visualization.
3. Train a regression model.
4. Train a classification model.
5. Train a clustering model.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
df = pd.read_csv(url, names=columns, na_values="?")

# Handle missing values
df = df.dropna()

# Convert target column to binary classification (1: Disease, 0: No Disease)
df['target'] = df['target'].apply(lambda x: 1 if x > 0 else 0)

# Display dataset information
df.head()


## **Exploratory Data Analysis and Visualization**

In [None]:

# Summary statistics
df.describe()


In [None]:

# Check for missing values
df.isnull().sum()


In [None]:

plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()


In [None]:

df.hist(figsize=(12,10), bins=20)
plt.suptitle("Feature Distributions", fontsize=16)
plt.show()


## **Train a Regression Model - Predicting Cholesterol Levels**

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Define features and target for regression
X_reg = df.drop(columns=["chol", "target"])
y_reg = df["chol"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Train the model
reg_model = LinearRegression()
reg_model.fit(X_train, y_train)

# Predictions
y_pred = reg_model.predict(X_test)

# Evaluation
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Squared Error: {mse:.2f}")


In [None]:

plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, alpha=0.7, color='blue', label="Predictions")
plt.plot(y_test, y_test, color='red', label="Ideal Fit")
plt.xlabel("Actual Cholesterol Levels")
plt.ylabel("Predicted Cholesterol Levels")
plt.title("Actual vs Predicted Cholesterol Levels")
plt.legend()
plt.show()


## **Train a Classification Model - Predicting Heart Disease Presence**

In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define features and target for classification
X_cls = df.drop(columns=["target"])
y_cls = df["target"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_cls, y_cls, test_size=0.2, random_state=42)

# Train the model
cls_model = LogisticRegression(max_iter=1000)
cls_model.fit(X_train, y_train)

# Predictions
y_pred_cls = cls_model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, y_pred_cls)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:
", classification_report(y_test, y_pred_cls))


In [None]:

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_cls)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["No Disease", "Disease"], yticklabels=["No Disease", "Disease"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Heart Disease Prediction")
plt.show()


## **Train a Clustering Model - Patient Risk Clustering**

In [None]:

from sklearn.cluster import KMeans

# Define features for clustering
X_clust = df.drop(columns=["target"])

# Train K-Means model with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
df["cluster"] = kmeans.fit_predict(X_clust)

# Display cluster counts
df["cluster"].value_counts()


In [None]:

plt.figure(figsize=(8,6))
sns.scatterplot(x=df["age"], y=df["thalach"], hue=df["cluster"], palette="viridis")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate (thalach)")
plt.title("Patient Clusters based on Age and Max Heart Rate")
plt.show()



## **Summary**
We successfully applied three machine learning models using the Heart Disease UCI dataset:

- **Regression**: Predicted cholesterol levels using patient attributes.
- **Classification**: Predicted the presence of heart disease using a logistic regression model.
- **Clustering**: Grouped patients into clusters based on clinical features.

Each step involved data preprocessing, model training, evaluation, and meaningful visualization.

This notebook provides a hands-on introduction to fundamental machine learning techniques in medicine.
