## CVD Prediction - Heart Failure Prediction Dataset (Source: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data)
Model Training and Evaluation

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_75M_25F.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,61,1,3,146.0,241.0,0,0,148.0,1,3.0,0,1
1,52,1,1,120.0,284.0,0,0,118.0,0,0.0,2,0
2,48,0,3,150.0,227.0,0,0,130.0,1,1.0,1,0
3,49,1,3,128.0,212.0,0,0,96.0,1,0.0,1,1
4,56,1,3,120.0,236.0,0,1,148.0,0,0.0,1,1


In [2]:
TARGET = "HeartDisease"
SENSITIVE = "Sex"   # 1 = Male, 0 = Female

categorical_cols = ['Sex','ChestPainType','FastingBS','RestingECG','ExerciseAngina','ST_Slope']
continuous_cols  = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']

In [3]:
# Split train into X / y and keep sensitive feature for fairness evaluation
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [4]:
# scale numeric features only, fit on train, transform test
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_num_scaled = pd.DataFrame(
    scaler.fit_transform(X_train[continuous_cols]),
    columns=continuous_cols, index=X_train.index
)
X_test_num_scaled = pd.DataFrame(
    scaler.transform(X_test[continuous_cols]),
    columns=continuous_cols, index=X_test.index
)

In [5]:
#one-hot encode categoricals; numeric are kept as is 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

In [6]:
# Assemble final matrices
X_train_ready = pd.concat([X_train_cat, X_train_num_scaled], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test_num_scaled],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)

Final feature shapes: (600, 18) (184, 18)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [7]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function 
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

In [8]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_ready, y_train)
y_pred_knn = knn.predict(X_test_ready)
y_prob_knn = knn.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_knn, "KNN")

# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_ready, y_train)
y_pred_dt = dt.predict(X_test_ready)
y_prob_dt = dt.predict_proba(X_test_ready)[:, 1]   
evaluate_model(y_test, y_pred_dt, "Decision Tree")

=== KNN Evaluation ===
Accuracy : 0.8369565217391305
Precision: 0.875
Recall   : 0.8235294117647058
F1 Score : 0.8484848484848485

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.85      0.82        82
           1       0.88      0.82      0.85       102

    accuracy                           0.84       184
   macro avg       0.84      0.84      0.84       184
weighted avg       0.84      0.84      0.84       184

Confusion Matrix:
 [[70 12]
 [18 84]]


=== Decision Tree Evaluation ===
Accuracy : 0.8152173913043478
Precision: 0.8469387755102041
Recall   : 0.8137254901960784
F1 Score : 0.83

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.82      0.80        82
           1       0.85      0.81      0.83       102

    accuracy                           0.82       184
   macro avg       0.81      0.82      0.81       184
weighted avg       0.82      0.82      0.8

### KNN Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 83.7%  
- **Precision**: 87.5%  
- **Recall**: **82.4%**  
- **F1 Score**: 84.8%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 70           | 12           |
| **Actual: 1** | 18           | 84           |

- **False Negatives (missed CVD cases)**: **18**  
- **True Positives (correct CVD detections)**: **84**  
- **Recall (class 1 / CVD)** = 84 / (84 + 18) = **82.4%**

---

### Interpretation

- **Precision** is strong (87.5%), indicating most positive predictions are correct.  
- **Recall** at 82.4% shows the model captures most positive cases while missing some (18 FN).  
- The error profile is balanced, with moderate **FP (12)** and **FN (18)**.

---

### Decision Tree Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 81.5%  
- **Precision**: 84.7%  
- **Recall**: **81.4%**  
- **F1 Score**: 83.0%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 67           | 15           |
| **Actual: 1** | 19           | 83           |

- **False Negatives (missed CVD cases)**: **19**  
- **True Positives (correct CVD detections)**: **83**  
- **Recall (class 1 / CVD)** = 83 / (83 + 19) = **81.4%**

---

### Interpretation

- **Precision** above 80% suggests positive flags are generally reliable.  
- **Recall** at 81.4% indicates solid sensitivity with **19 missed positives (FN)**.  
- Errors are split between **FP (15)** and **FN (19)**, reflecting a moderately conservative decision boundary.

---

### KNN Improvement
The code improves the KNN model by performing a **grid search** over key hyperparameters (`n_neighbors`, `weights`, and `distance metric`) to find the configuration that yields the best performance. After selecting the optimal model, it further explores **decision threshold tuning** to boost recall, which is critical in medical prediction tasks. 

In [9]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# 1) Hyperparameter tuning for KNN 
param_grid = {
    "n_neighbors": list(range(1, 31)),
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan", "minkowski"],  # minkowski with p=2 is euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=cv,
    scoring="f1",        
    n_jobs=-1,
    verbose=0,
    refit=True
)

# Fit 
grid.fit(X_train_ready, y_train)

print("Best KNN params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_knn = grid.best_estimator_

# 2) Evaluate best KNN on TEST 
y_pred_knn_best = best_knn.predict(X_test_ready)
y_prob_knn_best = best_knn.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_knn_best, "KNN (best params)")

Best KNN params: {'metric': 'manhattan', 'n_neighbors': 25, 'weights': 'distance'}
Best CV F1: 0.8909542505768512
=== KNN (best params) Evaluation ===
Accuracy : 0.8641304347826086
Precision: 0.9139784946236559
Recall   : 0.8333333333333334
F1 Score : 0.8717948717948718

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.90      0.86        82
           1       0.91      0.83      0.87       102

    accuracy                           0.86       184
   macro avg       0.86      0.87      0.86       184
weighted avg       0.87      0.86      0.86       184

Confusion Matrix:
 [[74  8]
 [17 85]]




### Tuned KNN (best params) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 86.4%  
- **Precision**: 91.4%  
- **Recall**: **83.3%**  
- **F1 Score**: 87.2%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 74           | 8            |
| **Actual: 1** | 17           | 85           |

- **False Negatives (missed CVD cases)**: **17**  
- **True Positives (correct CVD detections)**: **85**  
- **Recall (class 1 / CVD)** = 85 / (85 + 17) = **83.3%**

---

### Interpretation

- The model is **precision-leaning**: high precision (91.4%) indicates most positive predictions are correct.  
- **Recall at 83.3%** shows it captures most positives, though **some cases are missed (17 FN)**.  
- Errors are distributed as a modest number of **false positives (8)** and **false negatives (17)**.

---

### Further KNN Improvement - Implementing PCA 

In [10]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

#1) PCA + KNN pipeline (on one-hot encoded + scaled features)
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()
print(f"PCA components: {n_comp} | Explained variance retained: {expl_var:.3f}")

# 2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]
  
evaluate_model(y_test, y_pred_pca_knn, "KNN (best params)")

PCA components: 12 | Explained variance retained: 0.967
=== KNN (best params) Evaluation ===
Accuracy : 0.8858695652173914
Precision: 0.9090909090909091
Recall   : 0.8823529411764706
F1 Score : 0.8955223880597015

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.89      0.87        82
           1       0.91      0.88      0.90       102

    accuracy                           0.89       184
   macro avg       0.88      0.89      0.88       184
weighted avg       0.89      0.89      0.89       184

Confusion Matrix:
 [[73  9]
 [12 90]]




### KNN (best params) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 88.6%  
- **Precision**: 90.9%  
- **Recall**: **88.2%**  
- **F1 Score**: 89.6%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 73           | 9            |
| **Actual: 1** | 12           | 90           |

- **False Negatives (missed CVD cases)**: **12**  
- **True Positives (correct CVD detections)**: **90**  
- **Recall (class 1 / CVD)** = 90 / (90 + 12) = **88.2%**

---

### Interpretation

- The model shows **strong, balanced performance** with **slightly higher precision** than recall, indicating confident positive predictions while capturing most positive cases.  
- Errors are dominated by a modest number of **false negatives (12)** and **false positives (9)**, consistent with a balanced operating point.

---

**Conclusion**: Appropriate for scenarios requiring a **good balance of sensitivity and precision**.  

---

## KNN Model Comparison

| Model                         | Accuracy | Precision | Recall | F1 Score | Key Notes |
|-------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline KNN**              | 0.837    | 0.875     | 0.824  | 0.848    | Decent balance, but more false negatives (18) |
| **KNN (Best Params)**         | 0.864    | 0.914     | 0.833  | 0.872    | Higher precision, fewer false positives (8), stable recall |
| **KNN (Best Params + PCA)**   | 0.886    | 0.909     | 0.882  | 0.896    | Best overall: higher recall, fewer false negatives (12), strong balance |

---

### Interpretation
- **Baseline KNN**: Good starting point but recall is limited.  
- **Tuned KNN**: Improves accuracy and precision, reduces false positives.  
- **KNN with PCA**: Highest accuracy and F1, with the best trade-off between recall and precision.  

➡️ **KNN with PCA** is the most robust version, achieving the best overall performance.

---

In [11]:
#saving best performing KNN Model for fairness evaluation
import joblib, pandas as pd, numpy as np

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "pca_knn.pkl"
joblib.dump(pca_knn, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from PCA+KNN
y_pred_knn = pca_knn.predict(X_test_ready)
y_prob_knn = pca_knn.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_knn,
    "y_pred": y_pred_knn
})

preds_filename = "HeartFailureData_75M25F_PCA_KNN_predictions.csv"
results.to_csv(preds_filename, index=False)


print(f"Saved PCA+KNN model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved PCA+KNN model → pca_knn.pkl
Saved predictions → HeartFailureData_75M25F_PCA_KNN_predictions.csv


### Improvement - Decision Tree (DT)

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score
)

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="f1",      
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
tuned_dt = grid_dt.best_estimator_
y_pred_dt_best = tuned_dt.predict(X_test_ready)
y_prob_dt_best = tuned_dt.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree (best params)")

Best Decision Tree params: {'criterion': 'entropy', 'max_depth': 9, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best CV F1: 0.8593494246061409
=== Tuned Decision Tree (best params) Evaluation ===
Accuracy : 0.8097826086956522
Precision: 0.819047619047619
Recall   : 0.8431372549019608
F1 Score : 0.8309178743961353

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.77      0.78        82
           1       0.82      0.84      0.83       102

    accuracy                           0.81       184
   macro avg       0.81      0.81      0.81       184
weighted avg       0.81      0.81      0.81       184

Confusion Matrix:
 [[63 19]
 [16 86]]




### Tuned Decision Tree (best params) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 81.0%  
- **Precision**: 81.9%  
- **Recall**: **84.3%**  
- **F1 Score**: 83.1%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 63           | 19           |
| **Actual: 1** | 16           | 86           |

- **False Negatives (missed CVD cases)**: **16**  
- **True Positives (correct CVD detections)**: **86**  
- **Recall (class 1 / CVD)** = 86 / (86 + 16) = **84.3%**

---

### Interpretation

- The model shows **solid sensitivity** for class 1, correctly identifying most positive cases (TP=86) while missing some (FN=16).  
- **Precision above 80%** indicates a good proportion of predicted positives are correct, limiting unnecessary follow-ups.  
- Overall, the metrics suggest a **slightly recall-leaning** operating point with balanced performance.

---

**Conclusion**: Appropriate when **catching positive cases** is important while keeping precision reasonable.  

---

In [13]:
# Alternative DT tuning: simpler trees + class balancing + cost-complexity pruning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# bias toward simpler trees with class_weight="balanced" 
base_dt = DecisionTreeClassifier(random_state=42, class_weight="balanced")

param_grid_simple = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 4, 5, 6, 7],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],  # tiny regularization
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",    # balanced focus
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

print("Stage A — Best simple DT params:", grid_simple.best_params_)
print("Stage A — Best CV F1:", grid_simple.best_score_)
simple_dt = grid_simple.best_estimator_

# cost-complexity pruning
path = simple_dt.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

unique_alphas = np.unique(np.round(ccp_alphas, 6))
candidate_alphas = np.linspace(unique_alphas.min(), unique_alphas.max(), num=min(20, len(unique_alphas)))
candidate_alphas = np.unique(np.concatenate([candidate_alphas, [0.0]]))  

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(
        random_state=42,
        class_weight="balanced",
        criterion=simple_dt.criterion,
        max_depth=simple_dt.max_depth,
        min_samples_split=simple_dt.min_samples_split,
        min_samples_leaf=simple_dt.min_samples_leaf,
        min_impurity_decrease=simple_dt.min_impurity_decrease,
        ccp_alpha=alpha
    )
    f1_cv = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, f1_cv))

best_alpha, best_cv_f1 = sorted(cv_scores, key=lambda x: x[1], reverse=True)[0]
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV F1: {best_cv_f1:.4f}")

best_dt = DecisionTreeClassifier(
    random_state=42,
    class_weight="balanced",
    criterion=simple_dt.criterion,
    max_depth=simple_dt.max_depth,
    min_samples_split=simple_dt.min_samples_split,
    min_samples_leaf=simple_dt.min_samples_leaf,
    min_impurity_decrease=simple_dt.min_impurity_decrease,
    ccp_alpha=best_alpha
).fit(X_train_ready, y_train)

# Evaluate on test set 
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]  

evaluate_model(y_test, y_pred_dt, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best simple DT params: {'criterion': 'gini', 'max_depth': 4, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 6, 'min_samples_split': 20}
Stage A — Best CV F1: 0.8633333333333333
Stage B — Best ccp_alpha: 0.000000 | CV F1: 0.8633
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.8043478260869565
Precision: 0.851063829787234
Recall   : 0.7843137254901961
F1 Score : 0.8163265306122449

Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.83      0.79        82
           1       0.85      0.78      0.82       102

    accuracy                           0.80       184
   macro avg       0.80      0.81      0.80       184
weighted avg       0.81      0.80      0.80       184

Confusion Matrix:
 [[68 14]
 [22 80]]




### Decision Tree (Tuned & Pruned) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 80.4%  
- **Precision**: 85.1%  
- **Recall**: **78.4%**  
- **F1 Score**: 81.6%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 68           | 14           |
| **Actual: 1** | 22           | 80           |

- **False Negatives (missed CVD cases)**: **22**  
- **True Positives (correct CVD detections)**: **80**  
- **Recall (class 1 / CVD)** = 80 / (80 + 22) = **78.4%**

---

### Interpretation

- This DT is **precision-leaning**: higher precision (85.1%) than recall (78.4%) means it’s more conservative with positive predictions, at the cost of **more missed positives (22 FN)**.  
- Class 0 is identified reasonably well (TN=68), but false positives (14) remain. Overall, performance is **moderate**.

---

In [14]:
# Alternative DT tuning focused on higher recall
# Changes vs previous:
#  - Remove calibration (predict uses raw tree probs at 0.5)
#  - Tune class_weight (heavier positive weights allowed)
#  - Broaden depth a bit but keep regularization via min_samples_* and tiny impurity decrease
#  - Prune only with very small ccp_alphas to avoid killing recall

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# Simpler-but-expressive trees + tuned class weights
base_dt = DecisionTreeClassifier(random_state=42)

param_grid_simple = {
    "criterion": ["gini", "entropy"],                  # add "log_loss" if your sklearn supports it
    "max_depth": [4, 5, 6, 7, 8, 9, 10],               # a bit deeper to help recall
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],
    "class_weight": ["balanced", {0:1,1:2}, {0:1,1:3}, {0:1,1:4}],  # stronger push toward positives
}

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",      # prioritize sensitivity for class 1
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

best_params = grid_simple.best_params_
print("Stage A — Best DT params:", best_params)
print("Stage A — Best CV Recall:", round(grid_simple.best_score_, 4))

# Train a zero-pruned model with best params to get the pruning path
dt0 = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=0.0).fit(X_train_ready, y_train)


# Stage B — Gentle cost-complexity pruning (favor small alphas)
path = dt0.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

# Focus on tiny alphas only + 0.0 to avoid big recall loss
small_slice = ccp_alphas[: min(30, len(ccp_alphas))]  # first 30 values are typically the smallest
candidate_alphas = np.unique(np.r_[0.0, small_slice])

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=alpha)
    rec = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, rec))

best_alpha, best_cv_recall = max(cv_scores, key=lambda x: x[1])
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV Recall: {best_cv_recall:.4f}")

alt_best_dt = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=best_alpha).fit(X_train_ready, y_train)


# Evaluation
y_pred = alt_best_dt.predict(X_test_ready)               
y_prob = alt_best_dt.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best DT params: {'class_weight': {0: 1, 1: 4}, 'criterion': 'entropy', 'max_depth': 4, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 6, 'min_samples_split': 5}
Stage A — Best CV Recall: 0.97
Stage B — Best ccp_alpha: 0.006006 | CV Recall: 0.9717
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.7880434782608695
Precision: 0.7692307692307693
Recall   : 0.8823529411764706
F1 Score : 0.821917808219178

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.67      0.74        82
           1       0.77      0.88      0.82       102

    accuracy                           0.79       184
   macro avg       0.80      0.78      0.78       184
weighted avg       0.79      0.79      0.78       184

Confusion Matrix:
 [[55 27]
 [12 90]]




## Decision Tree Model Comparison

| Model                                      | Accuracy | Precision | Recall | F1 Score | Key Notes |
|--------------------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline Decision Tree**                  | 0.815    | 0.847     | 0.814  | 0.830    | Balanced but moderate errors (15 FP, 19 FN). |
| **Tuned Decision Tree (best params)**       | 0.810    | 0.819     | 0.843  | 0.831    | Recall improves slightly, but precision drops; more false positives (19). |
| **Alt. Tuned & Pruned DT (Recall-focused)** | 0.788    | 0.769     | 0.882  | 0.822    | Highest recall, but lowest precision; many false positives (27), fewer false negatives (12). |
| **Alt. Tuned & Pruned DT (Balanced)**       | 0.804    | 0.851     | 0.784  | 0.816    | Strong precision, fewer false positives (14), but recall is lower, meaning more missed cases (22). |

---

### Interpretation
- **Baseline DT** provides the most stable and balanced performance (Acc 81.5%, Prec 84.7%, Rec 81.4%).  
- **Tuned DT** raises recall slightly (+3%) but loses precision, so the net gain is limited.  
- **Alt. DT (Recall-focused)** achieves the **highest recall (88.2%)**, but the trade-off in **accuracy (78.8%)** and **precision (76.9%)** is too steep, generating many false positives (27).  
- **Alt. DT (Balanced)** improves precision to 85.1% with fewer false alarms, but its recall (78.4%) drops too much, missing more true cases (22).  

---

➡️ **Conclusion**: The **Recall-focused DT** cannot be recommended despite its sensitivity, since its trade-off is too costly. For a practical choice, the **Baseline DT** remains the most balanced, while **KNN with PCA** (Acc 88.6%, Rec 88.2%) actually outperforms all DT variants in both recall and accuracy.  But for fairness evaluation, the **Tuned Decision Tree** is chosen. It offers a favorable balance by **improving recall**—critical in CVD prediction—while accuracy and precision only decrease slightly compared to the baseline.

---

In [15]:
#saving best performing DT Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "tuned_dt.tpkl"
joblib.dump(tuned_dt, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from Decision Tree 
y_pred_dt = y_pred_dt_best
y_prob_dt = y_prob_dt_best

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_dt,
    "y_pred": y_pred_dt
})

preds_filename = "HeartFailureData_75M25F_TunedDT_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tuned Decision Tree → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tuned Decision Tree → tuned_dt.tpkl
Saved predictions → HeartFailureData_75M25F_TunedDT_predictions.csv


### Ensemble Model - Random Forest (RF)

In [16]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
y_prob_rf = rf.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.8804347826086957
Precision: 0.8703703703703703
Recall   : 0.9215686274509803
F1 Score : 0.8952380952380953

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.83      0.86        82
           1       0.87      0.92      0.90       102

    accuracy                           0.88       184
   macro avg       0.88      0.88      0.88       184
weighted avg       0.88      0.88      0.88       184

Confusion Matrix:
 [[68 14]
 [ 8 94]]




## Random Forest Evaluation

---

### Performance Metrics 

- **Accuracy**: 88.0%  
- **Precision**: 87.0%  
- **Recall**: **92.2%**  
- **F1 Score**: 89.5%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 68           | 14           |
| **Actual: 1** | 8            | 94           |

- **False Negatives (missed CVD cases)**: **8**  
- **True Positives (correct CVD detections)**: **94**  
- **Recall (class 1 / CVD)** = 94 / (94 + 8) = **92.2%**

---

### Interpretation

- The Random Forest is **recall-oriented for CVD (class 1)**: it catches most positives (only 8 FN).  
- Trade-off: **more false positives** on class 0 (14 FP), consistent with strong recall and slightly lower precision.

---

### Conclusion

- This Random Forest is a **strong baseline** with **high sensitivity** and solid precision.  

---


### Improvement Random Forest (RF)

In [17]:
# Random Forest: hyperparameter tuning

# 1) GridSearchCV over impactful RF params
rf = RandomForestClassifier(random_state=42)

param_grid = {
    "n_estimators": [200, 400, 600],
    "max_depth": [None, 8, 12, 16],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2", 0.8],  # 0.8 = 80% of features
    "class_weight": [None, "balanced"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=cv,
    scoring="recall",          
    n_jobs=-1,
    verbose=1,
    refit=True
)

grid.fit(X_train_ready, y_train)
best_rf = grid.best_estimator_
print("Best RF params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

# 2) Evaluate best RF 
y_pred = best_rf.predict(X_test_ready)
y_prob = best_rf.predict_proba(X_test_ready)[:, 1]
evaluate_model(y_test, y_pred, "Tuned Random Forest")

Fitting 5 folds for each of 648 candidates, totalling 3240 fits
Best RF params: {'class_weight': None, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best CV F1: 0.9133333333333333
=== Tuned Random Forest Evaluation ===
Accuracy : 0.842391304347826
Precision: 0.8348623853211009
Recall   : 0.8921568627450981
F1 Score : 0.8625592417061612

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.78      0.82        82
           1       0.83      0.89      0.86       102

    accuracy                           0.84       184
   macro avg       0.84      0.84      0.84       184
weighted avg       0.84      0.84      0.84       184

Confusion Matrix:
 [[64 18]
 [11 91]]




### Tuned Random Forest Evaluation

---

### Performance Metrics

- **Accuracy**: 84.2%  
- **Precision**: 83.5%  
- **Recall**: **89.2%**  
- **F1 Score**: 86.3%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 64           | 18           |
| **Actual: 1** | 11           | 91           |

- **False Negatives (missed CVD cases)**: **11**  
- **True Positives (correct CVD detections)**: **91**  
- **Recall (class 1 / CVD)** = 91 / (91 + 11) = **89.2%**

---

➡️ **Interpretation**: The tuned Random Forest achieves strong overall performance, with particularly high **recall**, which is crucial in CVD prediction to avoid missed cases. Although the number of false positives (18) is moderate, the model correctly identifies the majority of CVD cases, making it a suitable candidate for fairness evaluation.


---

## Random Forest Model Comparison

| Model                      | Accuracy | Precision | Recall | F1 Score | Key Notes |
|-----------------------------|----------|-----------|--------|----------|-----------|
| **Baseline RF**             | 0.880    | 0.870     | 0.922  | 0.895    | Strong performance; high recall (92.2%) with balanced precision; only 8 false negatives. |
| **Tuned RF (best params)**  | 0.842    | 0.835     | 0.892  | 0.863    | Recall remains strong (89.2%) but accuracy and precision drop; more false positives (18). |

---

### Interpretation
- **Baseline RF** achieves **the strongest overall balance**: high accuracy (88.0%) and excellent recall (92.2%), minimizing missed CVD cases.  
- **Tuned RF** keeps recall fairly high but suffers in accuracy (-4%) and precision (-3.5%), making it less reliable.  
- Unlike DT, tuning **did not improve performance** — the baseline RF is already optimized enough for this dataset.  

---

### Ranking
1. **Baseline RF** – Best choice: excellent recall, solid accuracy, and reliable F1.  
2. **Tuned RF** – Still performs well, but worse than baseline due to extra false positives.  

➡️ **Conclusion**: Random Forest is one of the **top-performing models** overall, comparable to **KNN with PCA**. Both achieve high recall and accuracy, which is crucial for CVD detection.  

---

In [18]:
#saving best performing RF Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "baseline_rf.pkl"
joblib.dump(rf, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from Random Forest
rf.fit(X_train_ready, y_train)

y_pred_rf = rf.predict(X_test_ready)
y_prob_rf = rf.predict_proba(X_test_ready)[:, 1]  

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_rf,
    "y_pred": y_pred_rf
})

preds_filename = "HeartFailureData_75M25F_BaselineRF_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Baseline RF → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Baseline RF → baseline_rf.pkl
Saved predictions → HeartFailureData_75M25F_BaselineRF_predictions.csv


### Deep Learning - Multi-layer Perceptron

In [19]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [20]:
# Initialize MLP model
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),   # one hidden layer with 100 neurons
    activation='relu',           # or 'tanh'
    solver='adam',               # optimizer
    max_iter=1000,                # increase if convergence warning appears
    random_state=42
)

# Train the model
mlp.fit(X_train_ready, y_train)

# Predict
y_pred_mlp = mlp.predict(X_test_ready)
y_prob = mlp.predict_proba(X_test_ready)[:, 1]

evaluate_model(y_test, y_pred_mlp, "Multilayer Perceptron (MLP)")

=== Multilayer Perceptron (MLP) Evaluation ===
Accuracy : 0.8043478260869565
Precision: 0.8235294117647058
Recall   : 0.8235294117647058
F1 Score : 0.8235294117647058

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.78      0.78        82
           1       0.82      0.82      0.82       102

    accuracy                           0.80       184
   macro avg       0.80      0.80      0.80       184
weighted avg       0.80      0.80      0.80       184

Confusion Matrix:
 [[64 18]
 [18 84]]




## Multilayer Perceptron (MLP) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 80.4%  
- **Precision**: 82.4%  
- **Recall**: **82.4%**  
- **F1 Score**: 82.4%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 64           | 18           |
| **Actual: 1** | 18           | 84           |

- **False Negatives (missed CVD cases)**: **18**  
- **True Positives (correct CVD detections)**: **84**  
- **Recall (class 1 / CVD)** = 84 / (84 + 18) = **82.4%**

---

### Interpretation

- The model operates at a **balanced point**: precision and recall are essentially equal, suggesting symmetric treatment of positive predictions and missed cases.  
- The error profile shows **18 false negatives** and **18 false positives**, indicating a relatively even trade-off between missing positives and raising false alarms.

---

### Improvements - MLP

In [21]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.8586956521739131
Precision: 0.8877551020408163
Recall   : 0.8529411764705882
F1 Score : 0.87

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.87      0.85        82
           1       0.89      0.85      0.87       102

    accuracy                           0.86       184
   macro avg       0.86      0.86      0.86       184
weighted avg       0.86      0.86      0.86       184

Confusion Matrix:
 [[71 11]
 [15 87]]




### MLP (Adam + EarlyStopping) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 85.9%  
- **Precision**: 88.8%  
- **Recall**: **85.3%**  
- **F1 Score**: 87.0%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 71           | 11           |
| **Actual: 1** | 15           | 87           |

- **False Negatives (missed CVD cases)**: **15**  
- **True Positives (correct CVD detections)**: **87**  
- **Recall (class 1 / CVD)** = 87 / (87 + 15) = **85.3%**

---

### Interpretation

- The model operates at a **slightly precision-leaning** point: positive predictions are generally reliable while still capturing most positives.  
- The error profile shows **balanced but non-trivial** false negatives (15) and false positives (11), indicating a fairly even trade-off between missing positives and triggering false alarms.

---

In [22]:
# LBFGS solver - converges fast & well on small datasets
# LBFGS ignores batch_size, early_stopping, learning_rate. It optimizes the full-batch loss.
mlp_lbfgs = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='tanh',         # tanh + lbfgs often works nicely on tabular data
    solver='lbfgs',            # quasi-Newton optimizer
    alpha=1e-3,
    max_iter=1000,
    random_state=42
)

mlp_lbfgs.fit(X_train_ready, y_train)
y_pred_lbfgs = mlp_lbfgs.predict(X_test_ready)
y_prob_lbfgs = mlp_lbfgs.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_lbfgs, "MLP (LBFGS) ")

=== MLP (LBFGS)  Evaluation ===
Accuracy : 0.8097826086956522
Precision: 0.8316831683168316
Recall   : 0.8235294117647058
F1 Score : 0.8275862068965517

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.79      0.79        82
           1       0.83      0.82      0.83       102

    accuracy                           0.81       184
   macro avg       0.81      0.81      0.81       184
weighted avg       0.81      0.81      0.81       184

Confusion Matrix:
 [[65 17]
 [18 84]]




### MLP (LBFGS) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 81.0%  
- **Precision**: 83.2%  
- **Recall**: **82.4%**  
- **F1 Score**: 82.8%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 65           | 17           |
| **Actual: 1** | 18           | 84           |

- **False Negatives (missed CVD cases)**: **18**  
- **True Positives (correct CVD detections)**: **84**  
- **Recall (class 1 / CVD)** = 84 / (84 + 18) = **82.4%**

---

### Interpretation

- The model operates at a **balanced point**, with precision and recall both around the low 80s.  
- Errors are relatively even: **17 false positives** and **18 false negatives**, indicating a symmetric trade-off between missed positives and false alarms.

---

### Further Improvement MLP 

In [23]:
# Improved MLP pipeline: recall-first tuning  
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import (
    f1_score, recall_score, fbeta_score, make_scorer
)


# 1) Recall-first search (Adam + early_stopping)
base_mlp = MLPClassifier(
    solver="adam",
    early_stopping=True,          # uses internal 15% validation
    validation_fraction=0.15,
    n_iter_no_change=20,
    max_iter=2000,                
    random_state=42
)

param_dist = {
    "hidden_layer_sizes": [(64,), (128,), (64, 32), (128, 64)],
    "activation": ["relu", "tanh"],
    "alpha": [1e-5, 1e-4, 3e-4, 1e-3],
    "learning_rate_init": [1e-3, 5e-4, 3e-4, 1e-4],
    "batch_size": [16, 32, 64],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# multi-metric scoring; refit on recall-oriented F-beta
scoring = {
    "f1": make_scorer(f1_score),
    "recall": make_scorer(recall_score),
    "fbeta2": make_scorer(fbeta_score, beta=2)  # emphasize recall
}

rs = RandomizedSearchCV(
    estimator=base_mlp,
    param_distributions=param_dist,
    n_iter=30,
    scoring=scoring,
    refit="fbeta2",
    cv=cv,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

rs.fit(X_train_ready, y_train)
recall_best_mlp = rs.best_estimator_
print("Best MLP params:", rs.best_params_)

# 2) Evaluation
y_prob = recall_best_mlp.predict_proba(X_test_ready)[:, 1]
y_pred_best_mlp = recall_best_mlp.predict(X_test_ready)
evaluate_model(y_test, y_pred_best_mlp, model_name=f"Best MLP")

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best MLP params: {'learning_rate_init': 0.0001, 'hidden_layer_sizes': (128,), 'batch_size': 16, 'alpha': 0.0001, 'activation': 'relu'}
=== Best MLP Evaluation ===
Accuracy : 0.8478260869565217
Precision: 0.87
Recall   : 0.8529411764705882
F1 Score : 0.8613861386138614

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.84      0.83        82
           1       0.87      0.85      0.86       102

    accuracy                           0.85       184
   macro avg       0.85      0.85      0.85       184
weighted avg       0.85      0.85      0.85       184

Confusion Matrix:
 [[69 13]
 [15 87]]




## Multilayer Perceptron (MLP) Model Comparison

| Model                     | Accuracy | Precision | Recall | F1 Score | Key Notes |
|----------------------------|----------|-----------|--------|----------|-----------|
| **Baseline MLP**           | 0.804    | 0.824     | 0.824  | 0.824    | Moderate performance; balanced but weaker than other models (18 FP, 18 FN). |
| **MLP (Adam + EarlyStopping)** | 0.859 | 0.888     | 0.853  | 0.870    | Strongest variant; higher accuracy and F1, reduced errors (11 FP, 15 FN). |
| **MLP (LBFGS)**            | 0.810    | 0.832     | 0.824  | 0.828    | Similar to baseline; small improvements but still modest overall. |
| **Best MLP (Grid Search)** | 0.848    | 0.870     | 0.853  | 0.861    | Well-optimized; close to EarlyStopping variant, but slightly lower accuracy. |

---

### Interpretation
- **Baseline MLP** starts off with modest performance (~80% across all metrics), weaker compared to tree-based and KNN models.  
- **MLP with Adam + EarlyStopping** performs **best overall** (Acc 85.9%, F1 0.87), showing improved stability and reduced overfitting.  
- **MLP with LBFGS** brings little benefit compared to the baseline, suggesting this optimizer is not well-suited here.  
- **Best MLP (Grid Search)** is competitive (Acc 84.8%, F1 0.861) and confirms that tuning improves MLP, but still slightly underperforms compared to the **Adam + EarlyStopping** variant.  

--

➡️ **Conclusion**: MLP can be tuned to perform decently, but even its best variants do not surpass **Random Forest** or **KNN with PCA**, which both achieve higher recall and accuracy. For fairness evaluation **MLP (Adam + EarlyStopping)** is the choice to proceed with.

---

In [24]:
#saving best performing MLP Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "adamMLP.pkl"
joblib.dump(adammlp, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from MLP
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]   

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_mlp,
    "y_pred": y_pred_mlp
})

preds_filename = "HeartFailureData_75M25F_AdamMLP_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Adam MLP→ {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Adam MLP→ adamMLP.pkl
Saved predictions → HeartFailureData_75M25F_AdamMLP_predictions.csv
