# Logistic Regression for Heart Disease Prediction

Dataset Source: Kaggle (Neurocipher â€“ Heart Disease Dataset)

This notebook implements the full Machine Learning pipeline using Logistic Regression for binary classification.

## Background: AI, ML, DL, and Data Science

- **AI**: Systems that simulate human intelligence.
- **ML**: Algorithms that learn patterns from data.
- **DL**: ML using deep neural networks.
- **Data Science**: Extracting insights from data using ML and statistics.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

## Data Retrieval and Collection

In [None]:
df = pd.read_csv('heartdisease.csv')
df.head()

df.shape, df.columns

##  Data Cleaning

In [None]:
df.isnull().sum()

# Remove invalid cholesterol values
df = df[df['Cholesterol'] > 0]

df['HeartDisease'].unique()

## Task 1: Single Feature Logistic Regression (Cholesterol)

In [None]:
X1 = df[['Cholesterol']]
y = df['HeartDisease']

X1_train, X1_test, y_train, y_test = train_test_split(
    X1, y, test_size=0.2, random_state=42
)

model1 = LogisticRegression()
model1.fit(X1_train, y_train)

In [None]:
y1_pred = model1.predict(X1_test)

print('Accuracy:', accuracy_score(y_test, y1_pred))
print('Precision:', precision_score(y_test, y1_pred))
print('Recall:', recall_score(y_test, y1_pred))
print('F1 Score:', f1_score(y_test, y1_pred))

print('\nConfusion Matrix:\n', confusion_matrix(y_test, y1_pred))
print('\nClassification Report:\n', classification_report(y_test, y1_pred))

### Sigmoid Curve

In [None]:
X_range = np.linspace(df['Cholesterol'].min(), df['Cholesterol'].max(), 300).reshape(-1,1)
probs = model1.predict_proba(X_range)[:,1]

plt.figure(figsize=(7,5))
plt.scatter(df['Cholesterol'], df['HeartDisease'], alpha=0.3)
plt.plot(X_range, probs)
plt.xlabel('Cholesterol')
plt.ylabel('Probability of Heart Disease')
plt.title('Sigmoid Curve (Task 1)')
plt.show()

##  Task 2: Multi-Feature Logistic Regression

In [None]:
X2 = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']

cat_cols = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
num_cols = [c for c in X2.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(drop='first'), cat_cols)
    ]
)

X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y, test_size=0.2, random_state=42
)

model2 = Pipeline(steps=[
    ('preprocess', preprocess),
    ('classifier', LogisticRegression(max_iter=1000))
])

model2.fit(X2_train, y2_train)

In [None]:
y2_pred = model2.predict(X2_test)

print('Accuracy:', accuracy_score(y2_test, y2_pred))
print('Precision:', precision_score(y2_test, y2_pred))
print('Recall:', recall_score(y2_test, y2_pred))
print('F1 Score:', f1_score(y2_test, y2_pred))

print('\nConfusion Matrix:\n', confusion_matrix(y2_test, y2_pred))
print('\nClassification Report:\n', classification_report(y2_test, y2_pred))

##  Conclusion

- Multi-feature logistic regression performs better than single-feature.
- Logistic Regression predicts probabilities, not direct class labels.
- Using more clinical features improves prediction accuracy.