 Title & Problem Statement
Title & Problem Statement (Markdown)
# Heart Disease Prediction

## Objective
Predict whether a person is at risk of heart disease based on their health data.

This notebook includes:
- Dataset loading and preprocessing
- Data visualization and exploration
- Training models (Logistic Regression & Decision Tree)
- Evaluation metrics (Accuracy, Confusion Matrix, ROC Curve)
- Feature importance analysis

In [None]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc


In [None]:
df = pd.read_csv("heart.csv")
df.head()


In [None]:
df.isnull().sum()


In [None]:
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']

categorical_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
# Logistic Regression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test)

# Decision Tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)


In [None]:
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))
print(classification_report(y_test, y_pred_tree))


In [None]:
cm = confusion_matrix(y_test, y_pred_log)
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.title("Confusion Matrix - Logistic Regression")
plt.show()


In [None]:
y_prob = log_model.predict_proba(X_test)[:,1]
fpr, tpr, th = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label="AUC = %0.2f" % roc_auc)
plt.plot([0,1],[0,1],'r--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Logistic Regression")
plt.legend()
plt.show()


In [None]:
importance = abs(log_model.coef_[0])
feature_importance = pd.Series(importance, index=X.columns)
feature_importance.sort_values().plot(kind='barh', figsize=(8,6))
plt.title("Feature Importance - Logistic Regression")
plt.show()


## Conclusion

- Logistic Regression and Decision Tree models were trained to predict heart disease.
- Key features affecting risk: Age, ChestPainType, MaxHR, ExerciseAngina.
- Logistic Regression achieved ~XX% accuracy, Decision Tree achieved ~XX%.
- The model successfully predicts whether a person is at risk using health data.
