# Task 3 — Heart Disease Prediction (Classification)

This notebook trains a classifier to predict risk of heart disease from tabular health data.

**Dataset:** Use the widely-used Kaggle/UCI-style `heart.csv` (303 rows). Place the file next to this notebook.

**How to run (locally):**
1. Download `heart.csv` (e.g., from Kaggle's *Heart Disease UCI* dataset).
2. Put `heart.csv` in the same folder as this notebook.
3. Install requirements: `pip install scikit-learn matplotlib pandas`
4. Run all cells.

**Metrics:** Accuracy, Confusion Matrix, ROC Curve, and feature importances.


In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

pd.set_option('display.max_columns', None)

In [None]:
# Load data (expects 'heart.csv' in the same directory)
df = pd.read_csv('heart.csv')
print('Shape:', df.shape)
df.head()

In [None]:
# Basic EDA
display(df.info())
display(df.describe())
print('Missing values per column:')
print(df.isna().sum())

In [None]:
# Train/Test split
target_col = 'target'  # adjust if your file uses a different name
X = df.drop(columns=[target_col])
y = df[target_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numeric features for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model 1: Logistic Regression
logreg = LogisticRegression(max_iter=200, n_jobs=None)
logreg.fit(X_train_scaled, y_train)
proba_lr = logreg.predict_proba(X_test_scaled)[:,1]
pred_lr = (proba_lr >= 0.5).astype(int)
acc_lr = accuracy_score(y_test, pred_lr)
roc_lr = roc_auc_score(y_test, proba_lr)
print('Logistic Regression -> Accuracy:', acc_lr, 'ROC-AUC:', roc_lr)

# Model 2: Decision Tree
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)
proba_tree = tree.predict_proba(X_test)[:,1]
pred_tree = (proba_tree >= 0.5).astype(int)
acc_tree = accuracy_score(y_test, pred_tree)
roc_tree = roc_auc_score(y_test, proba_tree)
print('Decision Tree -> Accuracy:', acc_tree, 'ROC-AUC:', roc_tree)

In [None]:
# Confusion Matrix (best model based on ROC-AUC)
best_proba = proba_lr if roc_lr >= roc_tree else proba_tree
best_pred = pred_lr if roc_lr >= roc_tree else pred_tree
cm = confusion_matrix(y_test, best_pred)
print('Confusion Matrix:\n', cm)

In [None]:
# ROC Curve (best model)
fpr, tpr, thr = roc_curve(y_test, best_proba)
plt.figure()
plt.plot(fpr, tpr, label='ROC')
plt.plot([0,1],[0,1],'--', label='Chance')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

In [None]:
# Feature Importance: logistic regression coefficients & decision tree importances
feat_names = X.columns
coef_series = pd.Series(logreg.coef_[0], index=feat_names).sort_values(key=abs, ascending=False)
imp_series = pd.Series(tree.feature_importances_, index=feat_names).sort_values(ascending=False)

print('Top Logistic Regression coefficients (absolute):\n', coef_series.head(10))
print('\nDecision Tree importances:\n', imp_series.head(10))

## Insights
- Compare ROC-AUC between logistic regression and decision tree; pick the better one.
- Review top coefficients/importances to understand influential features.
