## CVD Prediction - Heart Failure Prediction Dataset (Source: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data)
Model Training and Evaluation

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_50_50.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,61,1,3,146.0,241.0,0,0,148.0,1,3.0,0,1
1,39,1,1,130.0,215.0,0,0,120.0,0,0.0,2,0
2,60,0,0,150.0,240.0,0,0,171.0,0,0.9,2,0
3,49,1,3,128.0,212.0,0,0,96.0,1,0.0,1,1
4,50,0,2,140.0,288.0,0,0,140.0,1,0.0,1,1


In [2]:
TARGET = "HeartDisease"
SENSITIVE = "Sex"   # 1 = Male, 0 = Female

categorical_cols = ['Sex','ChestPainType','FastingBS','RestingECG','ExerciseAngina','ST_Slope']
continuous_cols  = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']

In [3]:
# Split train into X / y and keep sensitive feature for fairness evaluation
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [4]:
# scale numeric features only, fit on train, transform test
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_num_scaled = pd.DataFrame(
    scaler.fit_transform(X_train[continuous_cols]),
    columns=continuous_cols, index=X_train.index
)
X_test_num_scaled = pd.DataFrame(
    scaler.transform(X_test[continuous_cols]),
    columns=continuous_cols, index=X_test.index
)

In [5]:
#one-hot encode categoricals; numeric are kept as is 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

In [6]:
# Assemble final matrices
X_train_ready = pd.concat([X_train_cat, X_train_num_scaled], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test_num_scaled],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)

Final feature shapes: (600, 18) (184, 18)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [7]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function 
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

In [8]:
# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_ready, y_train)
y_pred_knn = knn.predict(X_test_ready)
y_prob_knn = knn.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_knn, "KNN")

# Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_ready, y_train)
y_pred_dt = dt.predict(X_test_ready)
y_prob_dt = dt.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_dt, "Decision Tree")

=== KNN Evaluation ===
Accuracy : 0.8043478260869565
Precision: 0.8586956521739131
Recall   : 0.7745098039215687
F1 Score : 0.8144329896907216

Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.84      0.79        82
           1       0.86      0.77      0.81       102

    accuracy                           0.80       184
   macro avg       0.80      0.81      0.80       184
weighted avg       0.81      0.80      0.80       184

Confusion Matrix:
 [[69 13]
 [23 79]]


=== Decision Tree Evaluation ===
Accuracy : 0.7663043478260869
Precision: 0.8172043010752689
Recall   : 0.7450980392156863
F1 Score : 0.7794871794871795

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.79      0.75        82
           1       0.82      0.75      0.78       102

    accuracy                           0.77       184
   macro avg       0.77      0.77      0.77       184
weighted avg   

## KNN Model Evaluation

### Overall Metrics
- **Accuracy**: 80.4%  
- **Precision**: 85.9%  
- **Recall**: 77.5%  
- **F1 Score**: 81.4%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 69           | 13           |
| **Actual: 1** | 23           | 79           |

### Interpretation
- The model shows **strong precision (86%)**, meaning most predicted CVD cases are correct.  
- **Recall (77.5%)** indicates that some CVD patients are still missed (**23 false negatives**).  
- Non-CVD cases are identified with solid accuracy (84%), with **13 false positives**.  
- Overall, this KNN model achieves **balanced performance**, leaning slightly toward precision while still maintaining reasonable recall.  

---

## Decision Tree Model Evaluation

### Overall Metrics
- **Accuracy**: 76.6%  
- **Precision**: 81.7%  
- **Recall**: 74.5%  
- **F1 Score**: 77.9%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 65           | 17           |
| **Actual: 1** | 26           | 76           |

### Interpretation
- The Decision Tree achieves **precision of 82%**, showing that positive (CVD) predictions are mostly correct.  
- **Recall is lower (74.5%)**, with **26 CVD cases missed** (false negatives).  
- Non-CVD patients are correctly identified 79% of the time, though **17 false positives** occurred.  
- Overall, this model delivers **moderate performance**, but its lower accuracy and recall compared to other configurations suggest limitations in sensitivity.  

---

### KNN Improvement
The code improves the KNN model by performing a **grid search** over key hyperparameters (`n_neighbors`, `weights`, and `distance metric`) to find the configuration that yields the best performance. After selecting the optimal model, it further explores **decision threshold tuning** to boost recall, which is critical in medical prediction tasks. 

In [9]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# 1) Hyperparameter tuning for KNN 
param_grid = {
    "n_neighbors": list(range(1, 31)),
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan", "minkowski"],  # minkowski with p=2 is euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=cv,
    scoring="f1",        
    n_jobs=-1,
    verbose=0,
    refit=True
)

# Fit 
grid.fit(X_train_ready, y_train)

print("Best KNN params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_knn = grid.best_estimator_

# 2) Evaluate best KNN on TEST 
y_pred_knn_best = best_knn.predict(X_test_ready)
y_prob_knn_best = best_knn.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_knn_best, "KNN (best params)")

Best KNN params: {'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'distance'}
Best CV F1: 0.905773457143949
=== KNN (best params) Evaluation ===
Accuracy : 0.8206521739130435
Precision: 0.8791208791208791
Recall   : 0.7843137254901961
F1 Score : 0.8290155440414507

Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.87      0.81        82
           1       0.88      0.78      0.83       102

    accuracy                           0.82       184
   macro avg       0.82      0.83      0.82       184
weighted avg       0.83      0.82      0.82       184

Confusion Matrix:
 [[71 11]
 [22 80]]




## KNN (Best Params) Model Evaluation

### Best Parameters
- **Metric**: Manhattan  
- **Neighbors**: 5  
- **Weights**: Distance  

### Overall Metrics
- **Accuracy**: 82.1%  
- **Precision**: 87.9%  
- **Recall**: 78.4%  
- **F1 Score**: 82.9%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 71           | 11           |
| **Actual: 1** | 22           | 80           |

### Interpretation
- The tuned KNN achieves **the best balance so far**, improving accuracy (82.1%) and F1 score (82.9%).  
- **Precision is high (87.9%)**, meaning positive (CVD) predictions are very reliable.  
- **Recall (78.4%)** indicates that most CVD cases are detected, though **22 patients are still missed** (false negatives).  
- For non-CVD cases, performance is also strong, with **87% correctly identified** and only **11 false positives**.  
- Overall, this tuned KNN provides a **robust and well-balanced trade-off** between precision and recall, making it more effective than the baseline configuration.  

---

### Further KNN Improvement - Implementing PCA 

In [10]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

#1) PCA + KNN pipeline (on one-hot encoded + scaled features)
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()
print(f"PCA components: {n_comp} | Explained variance retained: {expl_var:.3f}")

# 2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]
  
evaluate_model(y_test, y_pred_pca_knn, "KNN (best params)")

PCA components: 12 | Explained variance retained: 0.969
=== KNN (best params) Evaluation ===
Accuracy : 0.875
Precision: 0.9247311827956989
Recall   : 0.8431372549019608
F1 Score : 0.882051282051282

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.91      0.87        82
           1       0.92      0.84      0.88       102

    accuracy                           0.88       184
   macro avg       0.87      0.88      0.87       184
weighted avg       0.88      0.88      0.88       184

Confusion Matrix:
 [[75  7]
 [16 86]]




## KNN Model Comparison (Baseline vs Tuned vs PCA)

| Model                     | Accuracy | Precision | Recall | F1 Score | Key Notes |
|----------------------------|----------|-----------|--------|----------|-----------|
| **Baseline KNN**           | 0.804    | 0.859     | 0.775  | 0.814    | Solid starting point; 23 false negatives and 13 false positives. |
| **Tuned KNN (best params)**| 0.821    | 0.879     | 0.784  | 0.829    | Improved precision and slightly higher recall; 22 false negatives and 11 false positives. |
| **Tuned KNN + PCA (12 comps)** | 0.875 | 0.925     | 0.843  | 0.882    | Best performer: highest accuracy, precision, and recall; only 16 false negatives and 7 false positives. |

---

### Interpretation
- **Baseline KNN** provides balanced performance but misses 23 CVD cases, limiting sensitivity.  
- **Tuned KNN** improves precision and recall slightly, showing a better trade-off with fewer errors.  
- **Tuned KNN with PCA** clearly outperforms both:  
  - **Accuracy jumps to 87.5%** (+7% over baseline).  
  - **Precision (92.5%) and recall (84.3%)** both increase, leading to fewer false negatives (16) and false positives (7).  
  - This makes it the **most reliable and robust KNN configuration** for detecting CVD.  

➡️ **Conclusion**: The **Tuned KNN with PCA** is the best variant, combining strong sensitivity with excellent precision and overall predictive power.  

---

In [11]:
#saving best performing KNN Model for fairness evaluation
import joblib, pandas as pd, numpy as np

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "pca_knn.pkl"
joblib.dump(pca_knn, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from PCA+KNN
y_pred_knn = pca_knn.predict(X_test_ready)
y_prob_knn = pca_knn.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_knn,
    "y_pred": y_pred_knn
})

preds_filename = "HeartFailureData_50_50_PCA_KNN_predictions.csv"
results.to_csv(preds_filename, index=False)


print(f"Saved PCA+KNN model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved PCA+KNN model → pca_knn.pkl
Saved predictions → HeartFailureData_50_50_PCA_KNN_predictions.csv


### Improvement - Decision Tree (DT)

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score
)

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="f1",      
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
tuned_dt = grid_dt.best_estimator_
y_pred_dt_best = tuned_dt.predict(X_test_ready)
y_prob_dt_best = tuned_dt.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree (best params)")

Best Decision Tree params: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best CV F1: 0.8758095173030324
=== Tuned Decision Tree (best params) Evaluation ===
Accuracy : 0.7336956521739131
Precision: 0.7912087912087912
Recall   : 0.7058823529411765
F1 Score : 0.7461139896373057

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.77      0.72        82
           1       0.79      0.71      0.75       102

    accuracy                           0.73       184
   macro avg       0.73      0.74      0.73       184
weighted avg       0.74      0.73      0.73       184

Confusion Matrix:
 [[63 19]
 [30 72]]




## Decision Tree (Best Params) Model Evaluation

### Overall Metrics
- **Accuracy**: 73.4%  
- **Precision**: 79.1%  
- **Recall**: 70.6%  
- **F1 Score**: 74.6%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 63           | 19           |
| **Actual: 1** | 30           | 72           |

### Interpretation
- The tuned Decision Tree achieves **moderate precision (79%)**, meaning most predicted CVD cases are correct.  
- **Recall is weaker (71%)**, with **30 CVD cases missed** (false negatives), which is a notable limitation in medical applications.  
- Non-CVD cases are recognized with fair accuracy (77%), though **19 healthy patients** were misclassified as CVD (false positives).  
- Both accuracy (73%) and F1 score (75%) are relatively **lower compared to other Decision Tree variant**, suggesting that this parameter set does not generalize well.  

➡️ Overall, this tuned Decision Tree provides **limited performance**, with recall being too low for reliable use in CVD detection. It is less suitable compared to recall-focused alternatives.  

---

In [13]:
# Alternative DT tuning: simpler trees + class balancing + cost-complexity pruning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# bias toward simpler trees with class_weight="balanced" 
base_dt = DecisionTreeClassifier(random_state=42, class_weight="balanced")

param_grid_simple = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 4, 5, 6, 7],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],  # tiny regularization
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",    # balanced focus
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

print("Stage A — Best simple DT params:", grid_simple.best_params_)
print("Stage A — Best CV F1:", grid_simple.best_score_)
simple_dt = grid_simple.best_estimator_

# cost-complexity pruning
path = simple_dt.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

unique_alphas = np.unique(np.round(ccp_alphas, 6))
candidate_alphas = np.linspace(unique_alphas.min(), unique_alphas.max(), num=min(20, len(unique_alphas)))
candidate_alphas = np.unique(np.concatenate([candidate_alphas, [0.0]]))  

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(
        random_state=42,
        class_weight="balanced",
        criterion=simple_dt.criterion,
        max_depth=simple_dt.max_depth,
        min_samples_split=simple_dt.min_samples_split,
        min_samples_leaf=simple_dt.min_samples_leaf,
        min_impurity_decrease=simple_dt.min_impurity_decrease,
        ccp_alpha=alpha
    )
    f1_cv = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, f1_cv))

best_alpha, best_cv_f1 = sorted(cv_scores, key=lambda x: x[1], reverse=True)[0]
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV F1: {best_cv_f1:.4f}")

best_dt = DecisionTreeClassifier(
    random_state=42,
    class_weight="balanced",
    criterion=simple_dt.criterion,
    max_depth=simple_dt.max_depth,
    min_samples_split=simple_dt.min_samples_split,
    min_samples_leaf=simple_dt.min_samples_leaf,
    min_impurity_decrease=simple_dt.min_impurity_decrease,
    ccp_alpha=best_alpha
).fit(X_train_ready, y_train)

# Evaluate on test set 
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]  

evaluate_model(y_test, y_pred_dt, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best simple DT params: {'criterion': 'entropy', 'max_depth': 6, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 4, 'min_samples_split': 5}
Stage A — Best CV F1: 0.8800000000000001
Stage B — Best ccp_alpha: 0.000000 | CV F1: 0.8800
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.8206521739130435
Precision: 0.896551724137931
Recall   : 0.7647058823529411
F1 Score : 0.8253968253968254

Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.89      0.82        82
           1       0.90      0.76      0.83       102

    accuracy                           0.82       184
   macro avg       0.82      0.83      0.82       184
weighted avg       0.83      0.82      0.82       184

Confusion Matrix:
 [[73  9]
 [24 78]]




## Alternative Tuned & Pruned Decision Tree Evaluation

### Best Parameters
- **Criterion**: Entropy  
- **Max Depth**: 6  
- **Min Samples Split**: 5  
- **Min Samples Leaf**: 4  
- **Min Impurity Decrease**: 0.0  
- **Pruning (ccp_alpha)**: 0.0  

### Overall Metrics
- **Accuracy**: 82.1%  
- **Precision**: 89.7%  
- **Recall**: 76.5%  
- **F1 Score**: 82.5%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 73           | 9            |
| **Actual: 1** | 24           | 78           |

### Interpretation
- The model achieves **very strong precision (89.7%)**, showing that most predicted CVD cases are correct.  
- **Recall is moderate (76.5%)**, meaning **24 CVD patients are missed** (false negatives).  
- Non-CVD patients are well detected (89% recall), with only **9 false positives**, indicating strong reliability in ruling out healthy cases.  
- Both accuracy (82.1%) and F1 score (82.5%) show this model is **one of the better Decision Tree configurations**, offering a good balance.  
- Pruning keeps the model simpler and more interpretable while maintaining competitive performance.  

➡️ Overall, this **tuned & pruned Decision Tree** provides a **robust balance** between precision and recall, making it a reliable and interpretable option for CVD detection, though it still sacrifices some sensitivity compared to recall-optimized variants.  

---

In [14]:
# Alternative DT tuning focused on higher recall
# Changes vs previous:
#  - Remove calibration (predict uses raw tree probs at 0.5)
#  - Tune class_weight (heavier positive weights allowed)
#  - Broaden depth a bit but keep regularization via min_samples_* and tiny impurity decrease
#  - Prune only with very small ccp_alphas to avoid killing recall

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# Simpler-but-expressive trees + tuned class weights
base_dt = DecisionTreeClassifier(random_state=42)

param_grid_simple = {
    "criterion": ["gini", "entropy"],                  # add "log_loss" if your sklearn supports it
    "max_depth": [4, 5, 6, 7, 8, 9, 10],               # a bit deeper to help recall
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],
    "class_weight": ["balanced", {0:1,1:2}, {0:1,1:3}, {0:1,1:4}],  # stronger push toward positives
}

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",      # prioritize sensitivity for class 1
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

best_params = grid_simple.best_params_
print("Stage A — Best DT params:", best_params)
print("Stage A — Best CV Recall:", round(grid_simple.best_score_, 4))

# Train a zero-pruned model with best params to get the pruning path
dt0 = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=0.0).fit(X_train_ready, y_train)


# Stage B — Gentle cost-complexity pruning (favor small alphas)
path = dt0.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

# Focus on tiny alphas only + 0.0 to avoid big recall loss
small_slice = ccp_alphas[: min(30, len(ccp_alphas))]  # first 30 values are typically the smallest
candidate_alphas = np.unique(np.r_[0.0, small_slice])

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=alpha)
    rec = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, rec))

best_alpha, best_cv_recall = max(cv_scores, key=lambda x: x[1])
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV Recall: {best_cv_recall:.4f}")

alt_best_dt = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=best_alpha).fit(X_train_ready, y_train)


# Evaluation
y_pred = alt_best_dt.predict(X_test_ready)               
y_prob = alt_best_dt.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best DT params: {'class_weight': {0: 1, 1: 3}, 'criterion': 'entropy', 'max_depth': 4, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 6, 'min_samples_split': 5}
Stage A — Best CV Recall: 0.9633
Stage B — Best ccp_alpha: 0.035320 | CV Recall: 0.9650
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.7880434782608695
Precision: 0.7404580152671756
Recall   : 0.9509803921568627
F1 Score : 0.8326180257510729

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.59      0.71        82
           1       0.74      0.95      0.83       102

    accuracy                           0.79       184
   macro avg       0.82      0.77      0.77       184
weighted avg       0.81      0.79      0.78       184

Confusion Matrix:
 [[48 34]
 [ 5 97]]




## Alternative Tuned & Pruned Decision Tree (Recall-focused) Evaluation

### Best Parameters
- **Criterion**: Entropy  
- **Max Depth**: 4  
- **Min Samples Split**: 5  
- **Min Samples Leaf**: 6  
- **Class Weights**: {0: 1, 1: 3} (favoring CVD cases)  
- **Min Impurity Decrease**: 0.0  
- **Pruning (ccp_alpha)**: 0.0353  

### Overall Metrics
- **Accuracy**: 78.8%  
- **Precision**: 74.0%  
- **Recall**: 95.1%  
- **F1 Score**: 83.3%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 48           | 34           |
| **Actual: 1** | 5            | 97           |

### Interpretation
- The model achieves **very high recall (95.1%)**, meaning nearly all CVD cases are detected (**only 5 false negatives**).  
- **Precision is lower (74%)**, indicating a higher number of false alarms (**34 non-CVD patients misclassified as CVD**).  
- Accuracy (78.8%) is moderate, reflecting the trade-off between strong sensitivity and reduced specificity.  
- The weighted class setting biases the model toward detecting positives, which explains the high sensitivity but lower precision.  
- This configuration is **recall-focused**, making it highly suitable when the main priority is to avoid missed diagnoses, even at the cost of more false positives.  

➡️ Overall, this **recall-optimized DT** is a good choice for medical screening, as it minimizes the risk of undetected CVD cases. However, it does so by accepting a substantial increase in false positives, which may lead to additional follow-up checks for healthy patients.  

---

## Decision Tree Model Comparison

| Model                                | Accuracy | Precision | Recall | F1 Score | Key Notes |
|--------------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline DT**                      | 0.766    | 0.817     | 0.745  | 0.779    | Balanced but modest; 26 false negatives, 17 false positives. |
| **Tuned DT (best params)**           | 0.734    | 0.791     | 0.706  | 0.746    | Weakest variant; recall drops further (30 missed CVD cases). |
| **Alt. Tuned & Pruned DT (Balanced)**| 0.821    | 0.897     | 0.765  | 0.825    | Strongest balanced model; high precision, fewer false positives (9), but 24 false negatives. |
| **Alt. Tuned & Pruned DT (Recall-focused)** | 0.788 | 0.740 | 0.951  | 0.833    | Best recall (95%); only 5 missed CVD cases, but many false positives (34). |

---

### Interpretation
- **Baseline DT** provides moderate, balanced performance but is outperformed by the pruned alternatives.  
- **Tuned DT (best params)** performs the weakest overall, with reduced accuracy and recall, making it the least suitable.  
- **Alt. Tuned & Pruned DT (Balanced)** offers the **best trade-off**: highest accuracy (82.1%), very high precision (89.7%), and stable recall (76.5%). False positives are kept low (9), avoiding excessive misclassification of healthy patients.  
- **Alt. Tuned & Pruned DT (Recall-focused)** minimizes missed CVD cases (95% recall) but at the cost of precision (74%) and a surge in false positives (34). This may burden healthcare systems with unnecessary follow-ups and cause anxiety for many healthy patients.  

---

### Conclusion
The **Alt. Tuned & Pruned DT (Balanced)** is the preferred model.  
It achieves the **highest accuracy and strong precision**, while maintaining a reasonable recall. Importantly, it keeps the number of false alarms **very low (only 9)**, which is crucial in medical screening: A balanced model ensures a **reliable detection rate** without overwhelming the system or patients with excessive misclassifications.  

➡️ This makes the **balanced DT** the most practical and trustworthy option for CVD detection.  

---

In [15]:
#saving best performing DT Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "alt_tuned_pruned_DT.pkl"
joblib.dump(best_dt, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from Decision Tree 
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_dt,
    "y_pred": y_pred_dt
})

preds_filename = "HeartFailureData_50_50_AltTunedDT_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Alternative DT tuning → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Alternative DT tuning → alt_tuned_pruned_DT.pkl
Saved predictions → HeartFailureData_50_50_AltTunedDT_predictions.csv


### Ensemble Model - Random Forest (RF)

In [16]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.8315217391304348
Precision: 0.8585858585858586
Recall   : 0.8333333333333334
F1 Score : 0.845771144278607

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.83      0.81        82
           1       0.86      0.83      0.85       102

    accuracy                           0.83       184
   macro avg       0.83      0.83      0.83       184
weighted avg       0.83      0.83      0.83       184

Confusion Matrix:
 [[68 14]
 [17 85]]




## Random Forest Model Evaluation

### Overall Metrics
- **Accuracy**: 83.2%  
- **Precision**: 85.9%  
- **Recall**: 83.3%  
- **F1 Score**: 84.6%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 68           | 14           |
| **Actual: 1** | 17           | 85           |

### Interpretation
- The Random Forest model achieves **strong and balanced performance** across all metrics.  
- **Precision (85.9%)** indicates that most predicted CVD cases are correct, while **recall (83.3%)** shows that the majority of true CVD cases are detected.  
- **17 CVD cases are missed** (false negatives), while **14 healthy patients are misclassified as CVD** (false positives).  
- Both classes are represented fairly evenly:  
  - Non-CVD detection: 68 out of 82 correctly classified (83%).  
  - CVD detection: 85 out of 102 correctly classified (83%).  
- The model demonstrates **consistency** with no major imbalance between sensitivity and precision, making it reliable for practical use.  

➡️ Overall, this Random Forest configuration provides a **robust and well-balanced trade-off**, achieving high accuracy while keeping both false positives and false negatives at manageable levels.  

---

### Improvement Random Forest (RF)

In [17]:
# Random Forest: hyperparameter tuning

# 1) GridSearchCV over impactful RF params
rf = RandomForestClassifier(random_state=42)

param_grid = {
    "n_estimators": [200, 400, 600],
    "max_depth": [None, 8, 12, 16],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2", 0.8],  # 0.8 = 80% of features
    "class_weight": [None, "balanced"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=cv,
    scoring="recall",          
    n_jobs=-1,
    verbose=1,
    refit=True
)

grid.fit(X_train_ready, y_train)
best_rf = grid.best_estimator_
print("Best RF params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

# 2) Evaluate best RF 
y_pred_tuned_rf = best_rf.predict(X_test_ready)
y_prob_tuned_rf = best_rf.predict_proba(X_test_ready)[:, 1]
evaluate_model(y_test, y_pred_tuned_rf, "Tuned Random Forest")

Fitting 5 folds for each of 648 candidates, totalling 3240 fits
Best RF params: {'class_weight': None, 'max_depth': None, 'max_features': 0.8, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}
Best CV F1: 0.9166666666666667
=== Tuned Random Forest Evaluation ===
Accuracy : 0.8206521739130435
Precision: 0.8556701030927835
Recall   : 0.8137254901960784
F1 Score : 0.8341708542713567

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.83      0.80        82
           1       0.86      0.81      0.83       102

    accuracy                           0.82       184
   macro avg       0.82      0.82      0.82       184
weighted avg       0.82      0.82      0.82       184

Confusion Matrix:
 [[68 14]
 [19 83]]




## Random Forest Model Comparison

| Model              | Accuracy | Precision | Recall | F1 Score | Key Notes |
|--------------------|----------|-----------|--------|----------|-----------|
| **Baseline RF**    | 0.832    | 0.859     | 0.833  | 0.846    | Strong and balanced; 17 false negatives, 14 false positives. |
| **Tuned RF**       | 0.821    | 0.856     | 0.814  | 0.834    | Slightly lower performance; 19 false negatives, 14 false positives. |

---

### Interpretation
- **Baseline RF** shows **very balanced performance** with high precision (85.9%) and recall (83.3%), achieving solid overall accuracy (83.2%).  
  - Correctly identifies **85 of 102 CVD cases**, missing 17.  
  - Correctly classifies **68 of 82 non-CVD cases**, with 14 false alarms.  

- **Tuned RF** delivers **similar but slightly weaker results**: accuracy falls to 82.1%, recall drops slightly to 81.4%, and F1 decreases to 83.4%.  
  - It misses a few more CVD cases (**19 false negatives**) while keeping false positives unchanged (14).  

---

### Conclusion
The **Baseline Random Forest** is the stronger option here, offering the **best balance of sensitivity and precision**.  
Tuning did not lead to improvement — instead, it slightly reduced both recall and accuracy. This suggests that the **default Random Forest is already well-suited** for this dataset, providing a reliable trade-off between correctly identifying CVD patients and minimizing false alarms.  

---

In [18]:
#saving best performing RF Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "rf.pkl"
joblib.dump(rf, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from RF
rf.fit(X_train_ready, y_train)
y_pred_rf = rf.predict(X_test_ready)
y_prob_rf = rf.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_rf,
    "y_pred": y_pred_rf
})

preds_filename = "HeartFailureData_50_50_RF_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved RF → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved RF → rf.pkl
Saved predictions → HeartFailureData_50_50_RF_predictions.csv


### Deep Learning - Multi-layer Perceptron

In [19]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [20]:
# Initialize MLP model
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),   # one hidden layer with 100 neurons
    activation='relu',           # or 'tanh'
    solver='adam',               # optimizer
    max_iter=1000,                # increase if convergence warning appears
    random_state=42
)

# Train the model
mlp.fit(X_train_ready, y_train)

# Predict
y_pred_mlp = mlp.predict(X_test_ready)
y_prob = mlp.predict_proba(X_test_ready)[:, 1]

evaluate_model(y_test, y_pred_mlp, "Multilayer Perceptron (MLP)")

=== Multilayer Perceptron (MLP) Evaluation ===
Accuracy : 0.7880434782608695
Precision: 0.8620689655172413
Recall   : 0.7352941176470589
F1 Score : 0.7936507936507936

Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.85      0.78        82
           1       0.86      0.74      0.79       102

    accuracy                           0.79       184
   macro avg       0.79      0.79      0.79       184
weighted avg       0.80      0.79      0.79       184

Confusion Matrix:
 [[70 12]
 [27 75]]




## Multilayer Perceptron (MLP) Model Evaluation

### Overall Metrics
- **Accuracy**: 78.8%  
- **Precision**: 86.2%  
- **Recall**: 73.5%  
- **F1 Score**: 79.4%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 70           | 12           |
| **Actual: 1** | 27           | 75           |

### Interpretation
- The model achieves **high precision (86%)**, meaning most positive (CVD) predictions are correct.  
- **Recall is lower (73.5%)**, with **27 CVD cases missed** (false negatives), showing reduced sensitivity.  
- Non-CVD cases are well identified (**85% recall**), with **12 false positives**.  
- Overall performance (accuracy 78.8%, F1 79.4%) indicates the model leans more toward **precision** than recall, making it conservative in detecting positives.  

➡️ While this MLP is reliable when predicting CVD, it sacrifices sensitivity and may fail to capture a significant portion of true cases, which limits its usefulness in medical screening.  

---

### Improvements - MLP

In [21]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.8478260869565217
Precision: 0.9021739130434783
Recall   : 0.8137254901960784
F1 Score : 0.8556701030927835

Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.89      0.84        82
           1       0.90      0.81      0.86       102

    accuracy                           0.85       184
   macro avg       0.85      0.85      0.85       184
weighted avg       0.85      0.85      0.85       184

Confusion Matrix:
 [[73  9]
 [19 83]]




## Multilayer Perceptron (MLP – Adam + EarlyStopping) Evaluation

### Overall Metrics
- **Accuracy**: 84.8%  
- **Precision**: 90.2%  
- **Recall**: 81.4%  
- **F1 Score**: 85.6%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 73           | 9            |
| **Actual: 1** | 19           | 83           |

### Interpretation
- This model achieves **the strongest performance among MLP variants tested so far**, with accuracy of **84.8%** and an F1 score of **85.6%**.  
- **Precision is very high (90.2%)**, meaning that when the model predicts CVD, it is usually correct.  
- **Recall (81.4%)** shows that most CVD cases are detected, though **19 patients were missed** (false negatives).  
- Non-CVD patients are also well recognized (89% correctly identified), with only **9 false positives**.  
- The use of **Adam with EarlyStopping** prevents overfitting and provides a strong balance between sensitivity and precision.  

➡️ Overall, this MLP configuration is **robust and reliable**, offering a balanced trade-off: it reduces false alarms while still detecting the majority of true CVD cases. This makes it a strong candidate for practical use.  

---

In [22]:
# LBFGS solver - converges fast & well on small datasets
# LBFGS ignores batch_size, early_stopping, learning_rate. It optimizes the full-batch loss.
mlp_lbfgs = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='tanh',         # tanh + lbfgs often works nicely on tabular data
    solver='lbfgs',            # quasi-Newton optimizer
    alpha=1e-3,
    max_iter=1000,
    random_state=42
)

mlp_lbfgs.fit(X_train_ready, y_train)
y_pred_lbfgs = mlp_lbfgs.predict(X_test_ready)
y_prob_lbfgs = mlp_lbfgs.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_lbfgs, "MLP (LBFGS) ")

=== MLP (LBFGS)  Evaluation ===
Accuracy : 0.7934782608695652
Precision: 0.8478260869565217
Recall   : 0.7647058823529411
F1 Score : 0.8041237113402062

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.83      0.78        82
           1       0.85      0.76      0.80       102

    accuracy                           0.79       184
   macro avg       0.79      0.80      0.79       184
weighted avg       0.80      0.79      0.79       184

Confusion Matrix:
 [[68 14]
 [24 78]]




## Multilayer Perceptron (MLP – LBFGS) Evaluation

### Overall Metrics
- **Accuracy**: 79.3%  
- **Precision**: 84.8%  
- **Recall**: 76.5%  
- **F1 Score**: 80.4%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 68           | 14           |
| **Actual: 1** | 24           | 78           |

### Interpretation
- The model achieves **good precision (84.8%)**, so most predicted CVD cases are correct.  
- **Recall is weaker (76.5%)**, resulting in **24 missed CVD cases** (false negatives).  
- For non-CVD patients, 68 out of 82 are classified correctly (83%), while **14 false positives** occur.  
- With accuracy of 79.3% and F1 score of 80.4%, this version shows **modest performance** compared to other MLP configurations.  
- The LBFGS optimizer does not provide clear advantages here, as both sensitivity and accuracy remain limited.  

➡️ Overall, the **MLP (LBFGS)** performs reliably but is **less competitive** than the Adam + EarlyStopping variant, which delivers higher accuracy and recall.  

---

In [23]:
#  Improved MLP pipeline: recall-first tuning  
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import (
    f1_score, recall_score, fbeta_score, make_scorer
)


# 1) Recall-first search (Adam + early_stopping)
base_mlp = MLPClassifier(
    solver="adam",
    early_stopping=True,          # uses internal 15% validation
    validation_fraction=0.15,
    n_iter_no_change=20,
    max_iter=2000,                # allow convergence
    random_state=42
)

param_dist = {
    "hidden_layer_sizes": [(64,), (128,), (64, 32), (128, 64)],
    "activation": ["relu", "tanh"],
    "alpha": [1e-5, 1e-4, 3e-4, 1e-3],
    "learning_rate_init": [1e-3, 5e-4, 3e-4, 1e-4],
    "batch_size": [16, 32, 64],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# multi-metric scoring; refit on recall-oriented F-beta
scoring = {
    "f1": make_scorer(f1_score),
    "recall": make_scorer(recall_score),
    "fbeta2": make_scorer(fbeta_score, beta=2)  # emphasize recall
}

rs = RandomizedSearchCV(
    estimator=base_mlp,
    param_distributions=param_dist,
    n_iter=30,
    scoring=scoring,
    refit="fbeta2",
    cv=cv,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

rs.fit(X_train_ready, y_train)
recallfirst_best_mlp = rs.best_estimator_
print("Best MLP params:", rs.best_params_)

# 2) Evaluation
y_prob = recallfirst_best_mlp.predict_proba(X_test_ready)[:, 1]
y_pred_best_mlp = recallfirst_best_mlp.predict(X_test_ready)
evaluate_model(y_test, y_pred_best_mlp, model_name=f"Best MLP")

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best MLP params: {'learning_rate_init': 0.001, 'hidden_layer_sizes': (64, 32), 'batch_size': 16, 'alpha': 0.001, 'activation': 'relu'}
=== Best MLP Evaluation ===
Accuracy : 0.8532608695652174
Precision: 0.9120879120879121
Recall   : 0.8137254901960784
F1 Score : 0.8601036269430051

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.90      0.85        82
           1       0.91      0.81      0.86       102

    accuracy                           0.85       184
   macro avg       0.85      0.86      0.85       184
weighted avg       0.86      0.85      0.85       184

Confusion Matrix:
 [[74  8]
 [19 83]]




## Multilayer Perceptron (MLP) Model Comparison

| Model                      | Accuracy | Precision | Recall | F1 Score | Key Notes |
|-----------------------------|----------|-----------|--------|----------|-----------|
| **Baseline MLP**            | 0.788    | 0.862     | 0.735  | 0.794    | Decent baseline; 27 false negatives, 12 false positives. |
| **MLP (Adam + EarlyStopping)** | 0.848 | 0.902     | 0.814  | 0.856    | Strong variant; fewer false negatives (19) and false positives (9). |
| **MLP (LBFGS)**             | 0.793    | 0.848     | 0.765  | 0.804    | Modest performance; 24 false negatives, 14 false positives. |
| **Best Tuned MLP**          | 0.853    | 0.912     | 0.814  | 0.860    | Best performer; highest accuracy and precision, only 19 false negatives and 8 false positives. |

---

### Interpretation
- **Baseline MLP** provides a reasonable starting point but struggles with recall (73.5%), missing many CVD cases.  
- **MLP (Adam + EarlyStopping)** improves across all metrics, achieving better balance and stability, thanks to overfitting prevention.  
- **MLP (LBFGS)** performs similarly to the baseline but does not offer clear improvements, showing that this optimizer is less effective for the dataset.  
- **Best Tuned MLP** delivers the **strongest results**, with the highest accuracy (85.3%) and precision (91.2%), while keeping recall solid at 81.4%. It minimizes both false negatives and false positives, making it the most reliable variant.  

---

### Conclusion
The **Best Tuned MLP** is the most effective configuration, offering the **highest accuracy and precision** with stable recall, making it well-suited for CVD detection.  
The **Adam + EarlyStopping** model follows closely and is also a strong choice, while the **Baseline** and **LBFGS** variants underperform in comparison.  

---

In [24]:
#saving best performing MLP Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "RecallFirstTunedMLP.pkl"
joblib.dump(recallfirst_best_mlp, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from MLP
y_prob = recallfirst_best_mlp.predict_proba(X_test_ready)[:, 1]
y_pred_best_mlp = recallfirst_best_mlp.predict(X_test_ready)

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_mlp,
    "y_pred": y_pred_mlp
})

preds_filename = "HeartFailureData_50_50_RecallFirstTunedMLP_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Recall First Tuned MLP→ {model_filename}")
print(f"Saved predictions → {preds_filename}")


Saved Recall First Tuned MLP→ RecallFirstTunedMLP.pkl
Saved predictions → HeartFailureData_50_50_RecallFirstTunedMLP_predictions.csv
