## CVD Prediction - Heart Failure Prediction Dataset (Source: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data)
Model Training and Evaluation

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_25M_75F.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,33,0,3,100.0,246.0,0,0,150.0,1,1.0,1,1
1,48,0,1,120.0,284.0,0,0,120.0,0,0.0,2,0
2,49,0,3,130.0,269.0,0,0,163.0,0,0.0,2,0
3,62,0,3,140.0,268.0,0,2,160.0,0,3.6,0,1
4,38,0,3,105.0,236.0,1,0,166.0,0,2.8,2,1


In [2]:
TARGET = "HeartDisease"
SENSITIVE = "Sex"   # 1 = Male, 0 = Female

categorical_cols = ['Sex','ChestPainType','FastingBS','RestingECG','ExerciseAngina','ST_Slope']
continuous_cols  = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']

In [3]:
# Split train into X / y and keep sensitive feature for fairness evaluation
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [4]:
# scale numeric features only, fit on train, transform test
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_num_scaled = pd.DataFrame(
    scaler.fit_transform(X_train[continuous_cols]),
    columns=continuous_cols, index=X_train.index
)
X_test_num_scaled = pd.DataFrame(
    scaler.transform(X_test[continuous_cols]),
    columns=continuous_cols, index=X_test.index
)

In [None]:
#one-hot encode categoricals; numeric are kept as is 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

In [6]:
# Assemble final matrices
X_train_ready = pd.concat([X_train_cat, X_train_num_scaled], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test_num_scaled],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)


Final feature shapes: (600, 18) (184, 18)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [7]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function for model evaluation
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

In [15]:
# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_ready, y_train)
y_pred_knn = knn.predict(X_test_ready)
y_prob_knn = knn.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_knn, "KNN")

# Decision Tree
baseline_dt = DecisionTreeClassifier(random_state=42)
baseline_dt.fit(X_train_ready, y_train)
y_pred_dt = baseline_dt.predict(X_test_ready)
y_prob_dt = baseline_dt.predict_proba(X_test_ready)[:, 1]   
evaluate_model(y_test, y_pred_dt, "Decision Tree")

=== KNN Evaluation ===
Accuracy : 0.782608695652174
Precision: 0.8690476190476191
Recall   : 0.7156862745098039
F1 Score : 0.7849462365591398

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.87      0.78        82
           1       0.87      0.72      0.78       102

    accuracy                           0.78       184
   macro avg       0.79      0.79      0.78       184
weighted avg       0.80      0.78      0.78       184

Confusion Matrix:
 [[71 11]
 [29 73]]


=== Decision Tree Evaluation ===
Accuracy : 0.7934782608695652
Precision: 0.82
Recall   : 0.803921568627451
F1 Score : 0.8118811881188119

Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.78      0.77        82
           1       0.82      0.80      0.81       102

    accuracy                           0.79       184
   macro avg       0.79      0.79      0.79       184
weighted avg       0.79      0.

## KNN Model Evaluation

### Overall Metrics
- **Accuracy**: 78.3%  
- **Precision**: 86.9%  
- **Recall**: 71.6%  
- **F1 Score**: 78.5%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 71           | 11           |
| **Actual: 1** | 29           | 73           |

### Interpretation
- The model achieves **high precision (87%)**, meaning its positive predictions are usually correct.  
- However, **recall is lower (72%)**, with **29 missed CVD cases (false negatives)**.  
- The confusion matrix confirms the model is more conservative: fewer false alarms but more missed cases.  
- Overall, this KNN model is **precision-focused**, which may limit its usefulness in medical screening where recall (sensitivity) is critical.  

---

## Decision Tree Model Evaluation

### Overall Metrics
- **Accuracy**: 79.3%  
- **Precision**: 82.0%  
- **Recall**: 80.4%  
- **F1 Score**: 81.2%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 64           | 18           |
| **Actual: 1** | 20           | 82           |

### Interpretation
- The model achieves a **balanced trade-off** between precision (82%) and recall (80%).  
- **20 CVD cases were missed** (false negatives), while **18 healthy cases were incorrectly flagged** (false positives).  
- This indicates the Decision Tree provides **stable, well-rounded performance**, detecting most CVD cases while keeping false alarms at a moderate level.  

---

### KNN Improvement
The code improves the KNN model by performing a **grid search** over key hyperparameters (`n_neighbors`, `weights`, and `distance metric`) to find the configuration that yields the best performance. After selecting the optimal model, it further explores **decision threshold tuning** to boost recall, which is critical in medical prediction tasks. 

In [16]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# 1) Hyperparameter tuning for KNN 
param_grid = {
    "n_neighbors": list(range(1, 31)),
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan", "minkowski"],  # minkowski with p=2 is euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=cv,
    scoring="f1",        
    n_jobs=-1,
    verbose=0,
    refit=True
)

# Fit 
grid.fit(X_train_ready, y_train)

print("Best KNN params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_knn = grid.best_estimator_

# 2) Evaluate best KNN on TEST 
y_pred_knn_best = best_knn.predict(X_test_ready)
y_prob_knn_best = best_knn.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_knn_best, "KNN (best params)")

Best KNN params: {'metric': 'manhattan', 'n_neighbors': 11, 'weights': 'distance'}
Best CV F1: 0.9505663280906125
=== KNN (best params) Evaluation ===
Accuracy : 0.8043478260869565
Precision: 0.8666666666666667
Recall   : 0.7647058823529411
F1 Score : 0.8125

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.85      0.80        82
           1       0.87      0.76      0.81       102

    accuracy                           0.80       184
   macro avg       0.81      0.81      0.80       184
weighted avg       0.81      0.80      0.80       184

Confusion Matrix:
 [[70 12]
 [24 78]]




## KNN (Best Params) Model Evaluation

### Overall Metrics
- **Accuracy**: 80.4%  
- **Precision**: 86.7%  
- **Recall**: 76.5%  
- **F1 Score**: 81.3%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 70           | 12           |
| **Actual: 1** | 24           | 78           |

### Interpretation
- The model achieves **strong precision (87%)**, indicating that predicted CVD cases are usually correct.  
- **Recall is moderate (76%)**, with **24 CVD cases missed (false negatives)**.  
- For non-CVD patients, performance is solid (85% correctly identified), though some misclassifications remain (**12 false positives**).  
- Overall, this KNN configuration provides **reliable precision** but still sacrifices sensitivity, making it more conservative in detecting CVD cases.  

---

### Further KNN Improvement - Implementing PCA 

In [17]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

#1) PCA + KNN pipeline (on one-hot encoded + scaled features)
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()
print(f"PCA components: {n_comp} | Explained variance retained: {expl_var:.3f}")

# 2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]
  
evaluate_model(y_test, y_pred_pca_knn, "KNN (best params)")

PCA components: 11 | Explained variance retained: 0.952
=== KNN (best params) Evaluation ===
Accuracy : 0.8315217391304348
Precision: 0.9080459770114943
Recall   : 0.7745098039215687
F1 Score : 0.8359788359788359

Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.90      0.83        82
           1       0.91      0.77      0.84       102

    accuracy                           0.83       184
   macro avg       0.84      0.84      0.83       184
weighted avg       0.84      0.83      0.83       184

Confusion Matrix:
 [[74  8]
 [23 79]]




## KNN Model Comparison (Baseline vs Tuned vs PCA)

| Model                     | Accuracy | Precision | Recall | F1 Score | Key Notes |
|----------------------------|----------|-----------|--------|----------|-----------|
| **Baseline KNN**           | 0.783    | 0.869     | 0.716  | 0.785    | Strong precision, but recall is low (29 missed CVD cases). |
| **Tuned KNN**              | 0.804    | 0.867     | 0.765  | 0.813    | Small gains in recall (+5%) and F1; still misses 24 CVD cases. |
| **Tuned KNN + PCA (11 comps)** | 0.832 | 0.908     | 0.775  | 0.836    | Best overall: higher accuracy (+5% vs baseline), improved precision (91%), fewer false positives. |

---

### Interpretation
- **Baseline KNN** provides decent precision (87%) but underperforms in recall, missing many true CVD cases.  
- **Tuned KNN** improves recall and F1 modestly, reducing false negatives from 29 → 24, but gains are limited.  
- **Tuned KNN with PCA** achieves the **best balance**, with the highest accuracy (83.2%) and precision (91%), while maintaining solid recall. False positives drop to **8**, making it more reliable in detecting healthy cases correctly.  

➡️ **Conclusion**: Incorporating PCA into the tuned KNN delivers the **strongest and most balanced performance**, outperforming both the baseline and standard tuned models.  

---

In [18]:
#saving best performing KNN Model for fairness evaluation
import joblib, pandas as pd, numpy as np

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "pca_knn.pkl"
joblib.dump(pca_knn, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from PCA+KNN
y_pred_knn = pca_knn.predict(X_test_ready)
y_prob_knn = pca_knn.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_knn,
    "y_pred": y_pred_knn
})

preds_filename = "HeartFailureData_75F25M_PCA_KNN_predictions.csv"
results.to_csv(preds_filename, index=False)


print(f"Saved PCA+KNN model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved PCA+KNN model → pca_knn.pkl
Saved predictions → HeartFailureData_75F25M_PCA_KNN_predictions.csv


### Improvement - Decision Tree (DT)

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score
)

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="f1",      
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
tuned_dt = grid_dt.best_estimator_
y_pred_dt_best = tuned_dt.predict(X_test_ready)
y_prob_dt_best = tuned_dt.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree (best params)")

Best Decision Tree params: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5}
Best CV F1: 0.9336502933840469
=== Tuned Decision Tree (best params) Evaluation ===
Accuracy : 0.7989130434782609
Precision: 0.8735632183908046
Recall   : 0.7450980392156863
F1 Score : 0.8042328042328042

Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.87      0.79        82
           1       0.87      0.75      0.80       102

    accuracy                           0.80       184
   macro avg       0.80      0.81      0.80       184
weighted avg       0.81      0.80      0.80       184

Confusion Matrix:
 [[71 11]
 [26 76]]




### Tuned Decision Tree (best params) Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 79.9%  
- **Precision**: 87.4%  
- **Recall**: **74.5%**  
- **F1 Score**: 80.4%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 71           | 11           |
| **Actual: 1** | 26           | 76           |

- **False Negatives (missed CVD cases)**: **26**  
- **True Positives (correct CVD detections)**: **76**  
- **Recall (class 1 / CVD)** = 76 / (76 + 26) = **74.5%**


---

In [20]:
# Alternative DT tuning: simpler trees + class balancing + cost-complexity pruning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# bias toward simpler trees with class_weight="balanced" 
base_dt = DecisionTreeClassifier(random_state=42, class_weight="balanced")

param_grid_simple = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 4, 5, 6, 7],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],  # tiny regularization
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",    # balanced focus
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

print("Stage A — Best simple DT params:", grid_simple.best_params_)
print("Stage A — Best CV F1:", grid_simple.best_score_)
simple_dt = grid_simple.best_estimator_

# cost-complexity pruning
path = simple_dt.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

unique_alphas = np.unique(np.round(ccp_alphas, 6))
candidate_alphas = np.linspace(unique_alphas.min(), unique_alphas.max(), num=min(20, len(unique_alphas)))
candidate_alphas = np.unique(np.concatenate([candidate_alphas, [0.0]]))  

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(
        random_state=42,
        class_weight="balanced",
        criterion=simple_dt.criterion,
        max_depth=simple_dt.max_depth,
        min_samples_split=simple_dt.min_samples_split,
        min_samples_leaf=simple_dt.min_samples_leaf,
        min_impurity_decrease=simple_dt.min_impurity_decrease,
        ccp_alpha=alpha
    )
    f1_cv = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, f1_cv))

best_alpha, best_cv_f1 = sorted(cv_scores, key=lambda x: x[1], reverse=True)[0]
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV F1: {best_cv_f1:.4f}")

best_dt = DecisionTreeClassifier(
    random_state=42,
    class_weight="balanced",
    criterion=simple_dt.criterion,
    max_depth=simple_dt.max_depth,
    min_samples_split=simple_dt.min_samples_split,
    min_samples_leaf=simple_dt.min_samples_leaf,
    min_impurity_decrease=simple_dt.min_impurity_decrease,
    ccp_alpha=best_alpha
).fit(X_train_ready, y_train)

# Evaluate on test set 
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]  

evaluate_model(y_test, y_pred_dt, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best simple DT params: {'criterion': 'entropy', 'max_depth': 7, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 5}
Stage A — Best CV F1: 0.9433333333333334
Stage B — Best ccp_alpha: 0.000000 | CV F1: 0.9433
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.7663043478260869
Precision: 0.8641975308641975
Recall   : 0.6862745098039216
F1 Score : 0.7650273224043715

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.87      0.77        82
           1       0.86      0.69      0.77       102

    accuracy                           0.77       184
   macro avg       0.78      0.78      0.77       184
weighted avg       0.79      0.77      0.77       184

Confusion Matrix:
 [[71 11]
 [32 70]]




### Alternative Tuned & Pruned Decision Tree Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 76.6%  
- **Precision**: 86.4%  
- **Recall**: **68.6%**  
- **F1 Score**: 76.5%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 71           | 11           |
| **Actual: 1** | 32           | 70           |

- **False Negatives (missed CVD cases)**: **32**  
- **True Positives (correct CVD detections)**: **70**  
- **Recall (class 1 / CVD)** = 70 / (70 + 32) = **68.6%**

---

### Interpretation

- The tree is **precision-leaning** for class 1 (CVD): relatively few false alarms (**11 FP**), but **32 FN** lowers sensitivity (**recall 68.6%**).


---

In [21]:
# Alternative DT tuning focused on higher recall
# Changes vs previous:
#  - Remove calibration (predict uses raw tree probs at 0.5)
#  - Tune class_weight (heavier positive weights allowed)
#  - Broaden depth a bit but keep regularization via min_samples_* and tiny impurity decrease
#  - Prune only with very small ccp_alphas to avoid killing recall

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# Simpler-but-expressive trees + tuned class weights
base_dt = DecisionTreeClassifier(random_state=42)

param_grid_simple = {
    "criterion": ["gini", "entropy"],                  # add "log_loss" if your sklearn supports it
    "max_depth": [4, 5, 6, 7, 8, 9, 10],               # a bit deeper to help recall
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],
    "class_weight": ["balanced", {0:1,1:2}, {0:1,1:3}, {0:1,1:4}],  # stronger push toward positives
}

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",      # prioritize sensitivity for class 1
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

best_params = grid_simple.best_params_
print("Stage A — Best DT params:", best_params)
print("Stage A — Best CV Recall:", round(grid_simple.best_score_, 4))

# Train a zero-pruned model with best params to get the pruning path
dt0 = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=0.0).fit(X_train_ready, y_train)


# Stage B — Gentle cost-complexity pruning (favor small alphas)
path = dt0.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

# Focus on tiny alphas only + 0.0 to avoid big recall loss
small_slice = ccp_alphas[: min(30, len(ccp_alphas))]  # first 30 values are typically the smallest
candidate_alphas = np.unique(np.r_[0.0, small_slice])

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=alpha)
    rec = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, rec))

best_alpha, best_cv_recall = max(cv_scores, key=lambda x: x[1])
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV Recall: {best_cv_recall:.4f}")

alt_best_dt = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=best_alpha).fit(X_train_ready, y_train)


# Evaluation
y_pred = alt_best_dt.predict(X_test_ready)               
y_prob = alt_best_dt.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best DT params: {'class_weight': {0: 1, 1: 4}, 'criterion': 'entropy', 'max_depth': 4, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 4, 'min_samples_split': 5}
Stage A — Best CV Recall: 0.9733
Stage B — Best ccp_alpha: 0.000000 | CV Recall: 0.9733
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.8260869565217391
Precision: 0.7966101694915254
Recall   : 0.9215686274509803
F1 Score : 0.8545454545454545

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.71      0.78        82
           1       0.80      0.92      0.85       102

    accuracy                           0.83       184
   macro avg       0.84      0.81      0.82       184
weighted avg       0.83      0.83      0.82       184

Confusion Matrix:
 [[58 24]
 [ 8 94]]




## Decision Tree Model Comparison

| Model                                | Accuracy | Precision | Recall | F1 Score | Key Notes |
|--------------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline DT**                      | 0.793    | 0.820     | 0.804  | 0.812    | Balanced overall; 20 false negatives and 18 false positives. |
| **Tuned DT (best params)**           | 0.799    | 0.874     | 0.745  | 0.804    | Precision improves, but recall drops (26 false negatives). |
| **Alt. Tuned & Pruned DT (F1-focused)** | 0.766 | 0.864     | 0.686  | 0.765    | Strong precision, but recall is lowest; misses many positives (32). |
| **Alt. Tuned & Pruned DT (Recall-focused)** | 0.826 | 0.797 | 0.922  | 0.855    | Highest recall; only 8 false negatives, but more false positives (24). |

---

### Interpretation
- **Baseline DT** offers a **well-balanced trade-off**, with solid precision and recall (~80% each).  
- **Tuned DT** prioritizes **precision (87%)**, but recall suffers (74.5%), leading to more missed CVD cases.  
- **Alt. Tuned & Pruned DT (F1-focused)** maintains high precision (86%) but shows the **weakest recall (68.6%)**, missing 32 true CVD patients — problematic in medical contexts.  
- **Alt. Tuned & Pruned DT (Recall-focused)** achieves the **best recall (92%)** and also the **highest accuracy among DT variants (82.6%)**, minimizing missed CVD cases (**only 8 false negatives**). The trade-off is a moderate increase in false positives (24 cases).  

---

### Conclusion
The **Alt. Tuned & Pruned DT (Recall-focused)** model was selected as the preferred Decision Tree variant since it provides the **highest accuracy and recall**. This combination ensures strong overall performance while minimizing the number of missed CVD cases, which is especially important in a medical context.  

---


In [22]:
#saving best performing DT Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "alt_tuned_pruned_DT.pkl"
joblib.dump(alt_best_dt, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from Decision Tree 
y_pred_dt = alt_best_dt.predict(X_test_ready)
y_prob_dt = alt_best_dt.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_dt,
    "y_pred": y_pred_dt
})

preds_filename = "HeartFailureData_75F25M_AltTunedDT_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Alternative DT tuning → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Alternative DT tuning → alt_tuned_pruned_DT.pkl
Saved predictions → HeartFailureData_75F25M_AltTunedDT_predictions.csv


### Ensemble Model - Random Forest (RF)

In [23]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
y_prob_rf = rf.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.842391304347826
Precision: 0.9101123595505618
Recall   : 0.7941176470588235
F1 Score : 0.8481675392670157

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.90      0.84        82
           1       0.91      0.79      0.85       102

    accuracy                           0.84       184
   macro avg       0.84      0.85      0.84       184
weighted avg       0.85      0.84      0.84       184

Confusion Matrix:
 [[74  8]
 [21 81]]




### Random Forest Evaluation

---

### Performance Metrics (Test Set)

- **Accuracy**: 84.2%  
- **Precision**: 91.0%  
- **Recall**: **79.4%**  
- **F1 Score**: 84.8%  

#### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 74           | 8            |
| **Actual: 1** | 21           | 81           |

- **False Negatives (missed CVD cases)**: **21**  
- **True Positives (correct CVD detections)**: **81**  
- **Recall (class 1 / CVD)** = 81 / (81 + 21) = **79.4%**

---

### Interpretation

- The model is **precision-oriented** for class 1: most positive flags are correct (**91.0%** precision), with only **8 FP**.
- **Recall at 79.4%** means **21 positive cases were missed**—a concern if the cost of missing CVD is high.
- Overall performance is balanced (**F1 84.8%**, **Accuracy 84.2%**), but there’s a **precision–recall trade-off** tilted toward fewer false alarms.

---

### Improvement Random Forest (RF)

In [24]:
# Random Forest: hyperparameter tuning

# 1) GridSearchCV over impactful RF params
rf = RandomForestClassifier(random_state=42)

param_grid = {
    "n_estimators": [200, 400, 600],
    "max_depth": [None, 8, 12, 16],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["sqrt", "log2", 0.8],  # 0.8 = 80% of features
    "class_weight": [None, "balanced"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=cv,
    scoring="recall",          
    n_jobs=-1,
    verbose=1,
    refit=True
)

grid.fit(X_train_ready, y_train)
best_rf = grid.best_estimator_
print("Best RF params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

# 2) Evaluate best RF 
y_pred = best_rf.predict(X_test_ready)
y_prob = best_rf.predict_proba(X_test_ready)[:, 1]
evaluate_model(y_test, y_pred, "Tuned Random Forest")

Fitting 5 folds for each of 648 candidates, totalling 3240 fits
Best RF params: {'class_weight': None, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 400}
Best CV F1: 0.9566666666666667
=== Tuned Random Forest Evaluation ===
Accuracy : 0.842391304347826
Precision: 0.9010989010989011
Recall   : 0.803921568627451
F1 Score : 0.8497409326424871

Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.89      0.83        82
           1       0.90      0.80      0.85       102

    accuracy                           0.84       184
   macro avg       0.84      0.85      0.84       184
weighted avg       0.85      0.84      0.84       184

Confusion Matrix:
 [[73  9]
 [20 82]]




## Random Forest Model Comparison

| Model                | Accuracy | Precision | Recall | F1 Score | Key Notes |
|-----------------------|----------|-----------|--------|----------|-----------|
| **Baseline RF**       | 0.842    | 0.910     | 0.794  | 0.848    | Strong precision (91%), fewer false positives (8), but 21 CVD cases missed. |
| **Tuned RF**          | 0.842    | 0.901     | 0.804  | 0.850    | Slightly higher recall (+1%), detecting 1 more CVD case, but with 1 extra false positive. |

---

### Interpretation
- **Baseline RF** is more **precision-oriented**, reducing false alarms but missing slightly more CVD cases.  
- **Tuned RF** provides a **better recall-precision balance**, sacrificing a little precision to catch more positives.  
- Performance differences are **minimal**, but the **tuned RF** may be favored in a medical context, where **recall (sensitivity)** is often prioritized.  

---

In [25]:
#saving best performing RF Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "tuned_rf.pkl"
joblib.dump(best_rf, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from Decision Tree 
y_pred_rf = best_rf.predict(X_test_ready)
y_prob_rf = best_rf.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_rf,
    "y_pred": y_pred_rf
})

preds_filename = "HeartFailureData_75F25M_RF_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Tuned RF → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Tuned RF → tuned_rf.pkl
Saved predictions → HeartFailureData_75F25M_RF_predictions.csv


### Deep Learning - Multi-layer Perceptron

In [26]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [27]:
# Initialize MLP model
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),   # one hidden layer with 100 neurons
    activation='relu',           # or 'tanh'
    solver='adam',               # optimizer
    max_iter=1000,                # increase if convergence warning appears
    random_state=42
)

# Train the model
mlp.fit(X_train_ready, y_train)

# Predict
y_pred_mlp = mlp.predict(X_test_ready)
y_prob = mlp.predict_proba(X_test_ready)[:, 1]

evaluate_model(y_test, y_pred_mlp, "Multilayer Perceptron (MLP)")

=== Multilayer Perceptron (MLP) Evaluation ===
Accuracy : 0.7934782608695652
Precision: 0.8478260869565217
Recall   : 0.7647058823529411
F1 Score : 0.8041237113402062

Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.83      0.78        82
           1       0.85      0.76      0.80       102

    accuracy                           0.79       184
   macro avg       0.79      0.80      0.79       184
weighted avg       0.80      0.79      0.79       184

Confusion Matrix:
 [[68 14]
 [24 78]]




## Multilayer Perceptron (MLP) Model Evaluation

### Overall Metrics
- **Accuracy**: 79.3%  
- **Precision**: 84.8%  
- **Recall**: 76.5%  
- **F1 Score**: 80.4%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 68           | 14           |
| **Actual: 1** | 24           | 78           |

### Interpretation
- The model achieves **good precision (85%)**, meaning most positive (CVD) predictions are correct.  
- **Recall is lower (76%)**, with **24 CVD cases missed** (false negatives).  
- Non-CVD cases are recognized with solid accuracy (83% recall), though **14 healthy cases** were incorrectly classified as CVD (false positives).  
- Overall, the MLP shows **balanced but modest performance**, leaning slightly towards precision while sacrificing sensitivity.  

➡️ This suggests the model is reliable for confirming positive cases but may under-detect some true CVD patients, which is a limitation in medical screening tasks.  

---

### Improvements - MLP

In [28]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.8206521739130435
Precision: 0.8709677419354839
Recall   : 0.7941176470588235
F1 Score : 0.8307692307692308

Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.85      0.81        82
           1       0.87      0.79      0.83       102

    accuracy                           0.82       184
   macro avg       0.82      0.82      0.82       184
weighted avg       0.83      0.82      0.82       184

Confusion Matrix:
 [[70 12]
 [21 81]]




## Multilayer Perceptron (MLP – Adam + EarlyStopping) Evaluation

### Overall Metrics
- **Accuracy**: 82.1%  
- **Precision**: 87.1%  
- **Recall**: 79.4%  
- **F1 Score**: 83.1%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 70           | 12           |
| **Actual: 1** | 21           | 81           |

### Interpretation
- This model delivers **improved overall performance** compared to the baseline MLP.  
- **Precision is high (87%)**, indicating strong reliability in positive (CVD) predictions.  
- **Recall (79%)** is slightly better, reducing missed CVD cases to **21 false negatives**.  
- Non-CVD detection is also solid, with **85% recall**, though **12 healthy patients** were flagged incorrectly (false positives).  
- The addition of **EarlyStopping with Adam** enhances model stability, preventing overfitting and yielding a more balanced trade-off between precision and recall.  

➡️ Overall, this is a **stronger and more stable MLP variant**, offering a good balance of sensitivity and precision for CVD detection.  

---

In [29]:
# LBFGS solver - converges fast & well on small datasets
# LBFGS ignores batch_size, early_stopping, learning_rate. It optimizes the full-batch loss.
mlp_lbfgs = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='tanh',         # tanh + lbfgs often works nicely on tabular data
    solver='lbfgs',            # quasi-Newton optimizer
    alpha=1e-3,
    max_iter=1000,
    random_state=42
)

mlp_lbfgs.fit(X_train_ready, y_train)
y_pred_lbfgs = mlp_lbfgs.predict(X_test_ready)
y_prob_lbfgs = mlp_lbfgs.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_lbfgs, "MLP (LBFGS) ")

=== MLP (LBFGS)  Evaluation ===
Accuracy : 0.7608695652173914
Precision: 0.8085106382978723
Recall   : 0.7450980392156863
F1 Score : 0.7755102040816326

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.78      0.74        82
           1       0.81      0.75      0.78       102

    accuracy                           0.76       184
   macro avg       0.76      0.76      0.76       184
weighted avg       0.77      0.76      0.76       184

Confusion Matrix:
 [[64 18]
 [26 76]]




## Multilayer Perceptron (MLP – LBFGS) Evaluation

### Overall Metrics
- **Accuracy**: 76.1%  
- **Precision**: 80.9%  
- **Recall**: 74.5%  
- **F1 Score**: 77.6%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 64           | 18           |
| **Actual: 1** | 26           | 76           |

### Interpretation
- The model achieves **moderate precision (81%)** but has a lower **recall (74.5%)**, resulting in **26 missed CVD cases** (false negatives).  
- For non-CVD patients, performance is weaker, with **18 false positives** misclassified as CVD.  
- Both accuracy (76%) and F1 score (77.6%) are **lower compared to other MLP variants**, indicating reduced reliability.  
- The **LBFGS optimizer** appears less effective for this dataset, as it fails to reach the stronger balance between sensitivity and precision achieved by the Adam-based models.  

➡️ Overall, this variant shows **the weakest performance among the MLP models**, making it less suitable for CVD detection.  

---

In [30]:
#  Improved MLP pipeline: recall-first tuning  
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import (
    f1_score, recall_score, fbeta_score, make_scorer
)


# 1) Recall-first search (Adam + early_stopping)
base_mlp = MLPClassifier(
    solver="adam",
    early_stopping=True,          # uses internal 15% validation
    validation_fraction=0.15,
    n_iter_no_change=20,
    max_iter=2000,                # allow convergence
    random_state=42
)

param_dist = {
    "hidden_layer_sizes": [(64,), (128,), (64, 32), (128, 64)],
    "activation": ["relu", "tanh"],
    "alpha": [1e-5, 1e-4, 3e-4, 1e-3],
    "learning_rate_init": [1e-3, 5e-4, 3e-4, 1e-4],
    "batch_size": [16, 32, 64],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# multi-metric scoring; refit on recall-oriented F-beta
scoring = {
    "f1": make_scorer(f1_score),
    "recall": make_scorer(recall_score),
    "fbeta2": make_scorer(fbeta_score, beta=2)  # emphasize recall
}

rs = RandomizedSearchCV(
    estimator=base_mlp,
    param_distributions=param_dist,
    n_iter=30,
    scoring=scoring,
    refit="fbeta2",
    cv=cv,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

rs.fit(X_train_ready, y_train)
recallfirst_best_mlp = rs.best_estimator_
print("Best MLP params:", rs.best_params_)

# 2) Evaluation
y_prob = recallfirst_best_mlp.predict_proba(X_test_ready)[:, 1]
y_pred_best_mlp = recallfirst_best_mlp.predict(X_test_ready)
evaluate_model(y_test, y_pred_best_mlp, model_name=f"Best MLP")

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best MLP params: {'learning_rate_init': 0.001, 'hidden_layer_sizes': (64, 32), 'batch_size': 16, 'alpha': 0.001, 'activation': 'relu'}
=== Best MLP Evaluation ===
Accuracy : 0.8206521739130435
Precision: 0.8709677419354839
Recall   : 0.7941176470588235
F1 Score : 0.8307692307692308

Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.85      0.81        82
           1       0.87      0.79      0.83       102

    accuracy                           0.82       184
   macro avg       0.82      0.82      0.82       184
weighted avg       0.83      0.82      0.82       184

Confusion Matrix:
 [[70 12]
 [21 81]]




## Multilayer Perceptron (MLP – Best Tuned) Evaluation

### Best Parameters
- **Hidden layers**: (64, 32)  
- **Activation**: ReLU  
- **Learning rate**: 0.001  
- **Batch size**: 16  
- **Alpha (regularization)**: 0.001  

### Overall Metrics
- **Accuracy**: 82.1%  
- **Precision**: 87.1%  
- **Recall**: 79.4%  
- **F1 Score**: 83.1%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 70           | 12           |
| **Actual: 1** | 21           | 81           |

### Interpretation
- The tuned MLP achieves **strong and balanced performance**, with **accuracy of 82.1%** and an **F1 score of 83.1%**.  
- **Precision (87%)** indicates reliable positive predictions, while **recall (79%)** shows that most CVD cases are correctly detected, with **21 missed cases** (false negatives).  
- Non-CVD cases are also well recognized, with **85% recall** and only **12 false positives**.  
- Compared to the baseline and LBFGS variants, this tuned MLP shows clear improvements in both stability and predictive power.  

---

## Multilayer Perceptron (MLP) Model Comparison

| Model                      | Accuracy | Precision | Recall | F1 Score | Key Notes |
|-----------------------------|----------|-----------|--------|----------|-----------|
| **Baseline MLP**            | 0.793    | 0.848     | 0.765  | 0.804    | Decent baseline; 24 false negatives, 14 false positives. |
| **MLP (Adam + EarlyStopping)** | 0.821 | 0.871     | 0.794  | 0.831    | Improved stability; fewer false negatives (21) and false positives (12). |
| **MLP (LBFGS)**             | 0.761    | 0.809     | 0.745  | 0.776    | Weakest variant; highest error rates (26 FN, 18 FP). |
| **Best Tuned MLP**          | 0.821    | 0.871     | 0.794  | 0.831    | Matches EarlyStopping variant; confirms tuning leads to strong balance. |

---

### Interpretation
- **Baseline MLP** delivers moderate results, leaning toward precision but missing a notable number of CVD cases.  
- **MLP with Adam + EarlyStopping** achieves **the best performance**, with higher accuracy and F1 score, reducing both false positives and false negatives.  
- **MLP (LBFGS)** underperforms, showing weaker precision and recall, and thus is less suitable.  
- **Best Tuned MLP** replicates the performance of the EarlyStopping model, validating that tuning improves overall robustness.  

➡️ **Conclusion**: The **Adam + EarlyStopping** and **Best Tuned MLP** variants are the most effective, offering the strongest balance of accuracy and recall, while the LBFGS version is the weakest.  

---

In [31]:
#saving best performing MLP Model for fairness evaluation

# Ensure y_test is a Series 
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "RecallFirstTunedMLP.pkl"
joblib.dump(recallfirst_best_mlp, model_filename)

# Ensure 1D arrays for y_true
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)

# Predictions from MLP
y_pred_mlp = recallfirst_best_mlp.predict(X_test_ready)
y_prob_mlp = recallfirst_best_mlp.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "Sex" in X_test.columns:
    gender_vals = X_test["Sex"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "Sex": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_mlp,
    "y_pred": y_pred_mlp
})

preds_filename = "HeartFailureData_75F25M_RecallFirstTunedMLP_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Recall First Tuned MLP→ {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Recall First Tuned MLP→ RecallFirstTunedMLP.pkl
Saved predictions → HeartFailureData_75F25M_RecallFirstTunedMLP_predictions.csv
