Steps:

1. Load the dataset.
2. Preprocess the data.
3. Train a logistic regression model.
4. Set a threshold for classification and risk levels.
5. Output predictions and risk level.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

Step 1: Load the dataset

In [5]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", 
           "oldpeak", "slope", "ca", "thal", "target"]

# Load data
data = pd.read_csv(url, header=None, names=columns, na_values="?")

# Drop rows with missing values
data = data.dropna()

# Target encoding: 0 (no disease), 1-4 (disease present)
data['target'] = data['target'].apply(lambda x: 1 if x > 0 else 0)

Step 2: Preprocess the data

In [8]:
X = data.drop('target', axis=1)
y = data['target']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Step 3: Train Logistic Regression model

In [11]:
model = LogisticRegression()
model.fit(X_train, y_train)

Step 4: Predictions and threshold for risk levels

In [14]:
threshold = 0.5
y_pred_prob = model.predict_proba(X_test)[:, 1]
y_pred = (y_pred_prob >= threshold).astype(int)

# Risk level based on probability
def risk_level(prob):
    if prob < 0.4:
        return "Low Risk"
    elif 0.4 <= prob < 0.7:
        return "Moderate Risk"
    else:
        return "High Risk"

# Prepare final output
results = pd.DataFrame({
    'Probability': y_pred_prob,
    'Heart Disease': ["Found" if pred else "Not Found" for pred in y_pred],
    'Risk Level': [risk_level(prob) for prob in y_pred_prob]
})

Step 5: Evaluate and display results

In [30]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nSample Predictions:")
print(results)

Accuracy: 0.8666666666666667

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.89      0.89        36
           1       0.83      0.83      0.83        24

    accuracy                           0.87        60
   macro avg       0.86      0.86      0.86        60
weighted avg       0.87      0.87      0.87        60


Sample Predictions:
    Probability Heart Disease     Risk Level
0      0.018711     Not Found       Low Risk
1      0.253681     Not Found       Low Risk
2      0.011895     Not Found       Low Risk
3      0.991753         Found      High Risk
4      0.049528     Not Found       Low Risk
5      0.400738     Not Found  Moderate Risk
6      0.121717     Not Found       Low Risk
7      0.609405         Found  Moderate Risk
8      0.784103         Found      High Risk
9      0.169022     Not Found       Low Risk
10     0.622870         Found  Moderate Risk
11     0.120420     Not Found       Low Risk
12     0.0124

Explanation:

1. Thresholds:
Probability < 0.4: Low Risk
0.4 ≤ Probability < 0.7: Moderate Risk
Probability ≥ 0.7: High Risk

2. Output:
The model predicts whether heart disease is found.
It also assigns a risk level based on the predicted probability.

3. Metrics:
Accuracy and classification report are provided to evaluate the model.

Print all the given data with predictions

In [28]:
print("\nComplete Dataset with Predictions and Risk Levels:")
print(results.to_string(index=False))


Complete Dataset with Predictions and Risk Levels:
 Probability Heart Disease    Risk Level
    0.018711     Not Found      Low Risk
    0.253681     Not Found      Low Risk
    0.011895     Not Found      Low Risk
    0.991753         Found     High Risk
    0.049528     Not Found      Low Risk
    0.400738     Not Found Moderate Risk
    0.121717     Not Found      Low Risk
    0.609405         Found Moderate Risk
    0.784103         Found     High Risk
    0.169022     Not Found      Low Risk
    0.622870         Found Moderate Risk
    0.120420     Not Found      Low Risk
    0.012432     Not Found      Low Risk
    0.230716     Not Found      Low Risk
    0.403695     Not Found Moderate Risk
    0.008973     Not Found      Low Risk
    0.230135     Not Found      Low Risk
    0.476621     Not Found Moderate Risk
    0.493323     Not Found Moderate Risk
    0.067746     Not Found      Low Risk
    0.829276         Found     High Risk
    0.738757         Found     High Risk
    0