## CVD Prediction - Cardiovascular Disease Dataset (Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset/data)
Model Training and Evaluation

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_75M_25F.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,source_id,gender,age,bmiclass,MAP,cholesterol,gluc,smoke,alco,active,cardio
0,25283,1,3,3,2,1,1,0,0,1,0
1,15317,1,5,1,3,1,1,0,0,1,0
2,1037,1,3,1,5,1,1,0,0,0,1
3,24418,1,2,1,3,1,1,0,0,1,0
4,13764,0,5,2,3,1,1,0,0,0,1


In [2]:
#drop source_id
train_df = train_df.drop(columns=["source_id"])

In [3]:
TARGET = "cardio"
SENSITIVE = "gender"   # 1 = Male, 0 = Female


# Identify feature types
binary_cols = ["gender", "smoke", "alco", "active"]
categorical_cols = ["age", "bmiclass", "MAP", "cholesterol", "gluc"]

X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [4]:
#ONE-HOT ENCODE CATEGORICALS; KEEP SCALED NUMERICS AS-IS 

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# 1) fit encoder on TRAIN categoricals only
ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

# 2) transform TRAIN and TEST
X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

# 3) concatenate: encoded categoricals + scaled numerics
X_train_ready = pd.concat([X_train_cat, X_train[binary_cols]], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test[binary_cols]],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)

Final feature shapes: (30000, 28) (11256, 28)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [5]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function 
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

In [6]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_ready, y_train)

y_pred_knn = knn.predict(X_test_ready)
y_prob_knn = knn.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_knn, "KNN")


# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_ready, y_train)

y_pred_dt = dt.predict(X_test_ready)
y_prob_dt = dt.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_dt, "Decision Tree")

=== KNN Evaluation ===
Accuracy : 0.6566275764036958
Precision: 0.6507467870788468
Recall   : 0.6689876807712909
F1 Score : 0.6597411743991548

Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.64      0.65      5655
           1       0.65      0.67      0.66      5601

    accuracy                           0.66     11256
   macro avg       0.66      0.66      0.66     11256
weighted avg       0.66      0.66      0.66     11256

Confusion Matrix:
 [[3644 2011]
 [1854 3747]]


=== Decision Tree Evaluation ===
Accuracy : 0.6850568585643213
Precision: 0.7045364106645444
Recall   : 0.6322085341903232
F1 Score : 0.6664157335089865

Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.74      0.70      5655
           1       0.70      0.63      0.67      5601

    accuracy                           0.69     11256
   macro avg       0.69      0.68      0.68     11256
weighte

# Model Evaluation Results

## KNN Classifier
- **Accuracy:** 0.657  
- **Precision:** 0.651  
- **Recall:** 0.669  
- **F1 Score:** 0.660  

### Classification Report
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.66      | 0.64   | 0.65     | 5655    |
| 1     | 0.65      | 0.67   | 0.66     | 5601    |
| **Avg / Total** | **0.66** | **0.66** | **0.66** | **11256** |

### Confusion Matrix
|               | Predicted 0 | Predicted 1 |
|---------------|-------------|-------------|
| **Actual 0**  | 3644        | 2011        |
| **Actual 1**  | 1854        | 3747        |

KNN achieves a balanced performance across both classes, with slightly higher recall than precision. It is reasonably good at identifying positive cases but also misclassifies a fair number of negatives as positives.

---

## Decision Tree Classifier
- **Accuracy:** 0.685  
- **Precision:** 0.705  
- **Recall:** 0.632  
- **F1 Score:** 0.666  

### Classification Report
| Class | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.67      | 0.74   | 0.70     | 5655    |
| 1     | 0.70      | 0.63   | 0.67     | 5601    |
| **Avg / Total** | **0.69** | **0.68** | **0.68** | **11256** |

### Confusion Matrix
|               | Predicted 0 | Predicted 1 |
|---------------|-------------|-------------|
| **Actual 0**  | 4170        | 1485        |
| **Actual 1**  | 2060        | 3541        |

The Decision Tree has higher precision but lower recall compared to KNN. It is more conservative in predicting positives (reduces false positives), but at the cost of missing more actual positives.

---

### KNN Improvement
The code improves the KNN model by performing a **grid search** over key hyperparameters (`n_neighbors`, `weights`, and `distance metric`) to find the configuration that yields the best performance. After selecting the optimal model, it further explores **decision threshold tuning** to boost recall, which is critical in medical prediction tasks. 

In [7]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# 1) Hyperparameter tuning for KNN 
param_grid = {
    "n_neighbors": list(range(1, 31)),
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan", "minkowski"],  # minkowski with p=2 is euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=cv,
    scoring="f1",        
    n_jobs=-1,
    verbose=0,
    refit=True
)

# Fit 
grid.fit(X_train_ready, y_train)

print("Best KNN params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_knn = grid.best_estimator_

# 2) Evaluate best KNN on TEST 
y_pred_knn_best = best_knn.predict(X_test_ready)
y_prob_knn_best = best_knn.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred_knn_best, "KNN (best params)")

Best KNN params: {'metric': 'euclidean', 'n_neighbors': 27, 'weights': 'distance'}
Best CV F1: 0.6992184215344929
=== KNN (best params) Evaluation ===
Accuracy : 0.6888770433546553
Precision: 0.708606638839197
Recall   : 0.6364934833065524
F1 Score : 0.6706170052671181

Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.74      0.71      5655
           1       0.71      0.64      0.67      5601

    accuracy                           0.69     11256
   macro avg       0.69      0.69      0.69     11256
weighted avg       0.69      0.69      0.69     11256

Confusion Matrix:
 [[4189 1466]
 [2036 3565]]




## KNN (Best Params)

### Best Parameters
- **Metric:** euclidean  
- **Neighbors:** 27  
- **Weights:** distance  

### Cross-Validation
- **Best CV F1 Score:** 0.699  

### Evaluation
- **Accuracy:** 0.689  
- **Precision:** 0.709  
- **Recall:** 0.636  
- **F1 Score:** 0.671  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4189         | 1466         |
| **Actual: 1** | 2036         | 3565         |

- **False negatives:** **2036** (positives missed)  
- **False positives:** **1466** (negatives flagged)  

**Interpretation:**  
The tuned KNN achieves **68.9% accuracy** with a solid **precision (0.709)**, but its **recall (0.636)** is lower, meaning it misses a fair number of positives.  
This model is more conservative in labeling positives, resulting in fewer false alarms but at the cost of overlooking some actual positive cases.

---

### Further KNN Improvement - Implementing PCA 

In [8]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# 1) PCA + KNN pipeline 
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()
print(f"PCA components: {n_comp} | Explained variance retained: {expl_var:.3f}")

#2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]

evaluate_model(y_test, y_pred_pca_knn, "PCA+KNN")

PCA components: 16 | Explained variance retained: 0.953
=== PCA+KNN Evaluation ===
Accuracy : 0.6815031982942431
Precision: 0.6855670103092784
Recall   : 0.6648812712015711
F1 Score : 0.675065711955044

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.70      0.69      5655
           1       0.69      0.66      0.68      5601

    accuracy                           0.68     11256
   macro avg       0.68      0.68      0.68     11256
weighted avg       0.68      0.68      0.68     11256

Confusion Matrix:
 [[3947 1708]
 [1877 3724]]




# KNN Model Comparison

## 1. Baseline KNN
- **Accuracy:** 0.657  
- **Precision:** 0.651  
- **Recall:** 0.669  
- **F1 Score:** 0.660  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 3644         | 2011         |
| **Actual: 1** | 1854         | 3747         |

- **False negatives:** 1854  
- **False positives:** 2011  

**Interpretation:**  
Balanced recall and precision, but overall performance is modest. The model misses some positives and misclassifies many negatives.

---

## 2. Tuned KNN (Best Params)
- **Best Params:** {metric: euclidean, n_neighbors: 27, weights: distance}  
- **Best CV F1:** 0.699  
- **Accuracy:** 0.689  
- **Precision:** 0.709  
- **Recall:** 0.636  
- **F1 Score:** 0.671  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4189         | 1466         |
| **Actual: 1** | 2036         | 3565         |

- **False negatives:** 2036  
- **False positives:** 1466  

**Interpretation:**  
Improved precision but lower recall compared to baseline. More conservative in predicting positives, resulting in fewer false alarms but more missed positives.

---

## 3. PCA + KNN (16 components, 95.3% variance retained)
- **Accuracy:** 0.682  
- **Precision:** 0.686  
- **Recall:** 0.665  
- **F1 Score:** 0.675  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 3947         | 1708         |
| **Actual: 1** | 1877         | 3724         |

- **False negatives:** 1877  
- **False positives:** 1708  

**Interpretation:**  
Performance slightly better than baseline, with more balanced precision and recall. Dimensionality reduction via PCA improves efficiency and stability, while maintaining a good trade-off.

---

# Summary

| Model            | Accuracy | Precision | Recall | F1 Score |
|------------------|----------|-----------|--------|----------|
| Baseline KNN     | 0.657    | 0.651     | 0.669  | 0.660    |
| Tuned KNN        | 0.689    | 0.709     | 0.636  | 0.671    |
| PCA + KNN        | 0.682    | 0.686     | 0.665  | 0.675    |

- **Baseline KNN:** Balanced but weaker overall.  
- **Tuned KNN:** Best precision, higher accuracy, but lower recall (misses positives).  
- **PCA + KNN:** Balanced performance, recall higher than tuned KNN, slightly better than baseline, with efficiency gains from reduced dimensions.  

---

# Final Conclusion
The **PCA + KNN** model offers the best **overall trade-off**:  
- It retains most variance (95.3%) while reducing dimensionality, making the model more efficient.  
- It achieves **balanced precision and recall** (both ~0.67), unlike the tuned KNN which sacrifices recall.  
- Although tuned KNN has the highest precision, the drop in recall means it misses too many positives.  

**Therefore, PCA + KNN is the preferred version for deployment if balanced detection is the priority.**  

In [9]:
#saving best performing KNN Model for fairness evaluation
import joblib, pandas as pd, numpy as np

# Ensure y_test is a Series (not a DataFrame)
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "knn_best_model.pkl"
joblib.dump(pca_knn, model_filename)

# Ensure 1D arrays
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_prob": probs_pca_knn,
    "y_pred": y_pred_pca_knn
})

preds_filename = "CVDKaggleData_75M25F_PCAKNN_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tuned KNN model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tuned KNN model → knn_best_model.pkl
Saved predictions → CVDKaggleData_75M25F_PCAKNN_predictions.csv


### Improvement - Decision Tree (DT)

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="recall",
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
best_dt = grid_dt.best_estimator_
y_pred_dt_best = best_dt.predict(X_test_ready)
y_prob_dt_best = best_dt.predict_proba(X_test_ready)[:, 1]  

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree")

Best Decision Tree params: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 10, 'min_samples_split': 2}
Best CV F1: 0.6584
=== Tuned Decision Tree Evaluation ===
Accuracy : 0.7001599147121536
Precision: 0.7148648648648649
Recall   : 0.6611319407248706
F1 Score : 0.6869492625915963

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.74      0.71      5655
           1       0.71      0.66      0.69      5601

    accuracy                           0.70     11256
   macro avg       0.70      0.70      0.70     11256
weighted avg       0.70      0.70      0.70     11256

Confusion Matrix:
 [[4178 1477]
 [1898 3703]]




## Decision Tree (Best Params)

### Evaluation
- **Accuracy:** 0.700  
- **Precision:** 0.715  
- **Recall:** 0.661  
- **F1 Score:** 0.687  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4178         | 1477         |
| **Actual: 1** | 1898         | 3703         |

- **False negatives:** **1898** (positives missed)  
- **False positives:** **1477** (negatives flagged)  

**Interpretation:**  
The tuned Decision Tree reaches **70% accuracy** with a balanced trade-off between **precision (0.715)** and **recall (0.661)**. It is reliable when predicting positives but still misses a notable portion of actual positive cases.

---

In [11]:
# Alternative DT tuning: simpler trees + class balancing + cost-complexity pruning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# Stage A: bias toward simpler trees with class_weight="balanced"
base_dt = DecisionTreeClassifier(random_state=42, class_weight="balanced")

param_grid_simple = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 4, 5, 6, 7],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],  # tiny regularization
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",        # recall-focused search
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

print("Stage A — Best simple DT params:", grid_simple.best_params_)
print("Stage A — Best CV Recall:", grid_simple.best_score_)
simple_dt = grid_simple.best_estimator_

# Stage B: cost-complexity pruning on the best simple DT
path = simple_dt.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

unique_alphas = np.unique(np.round(ccp_alphas, 6))
candidate_alphas = np.linspace(unique_alphas.min(), unique_alphas.max(), num=min(20, len(unique_alphas)))
candidate_alphas = np.unique(np.concatenate([candidate_alphas, [0.0]]))  # include no-pruning baseline

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(
        random_state=42,
        class_weight="balanced",
        criterion=simple_dt.criterion,
        max_depth=simple_dt.max_depth,
        min_samples_split=simple_dt.min_samples_split,
        min_samples_leaf=simple_dt.min_samples_leaf,
        min_impurity_decrease=simple_dt.min_impurity_decrease,
        ccp_alpha=alpha
    )
    # recall-focused CV
    recall_cv = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, recall_cv))

best_alpha, best_cv_recall = sorted(cv_scores, key=lambda x: x[1], reverse=True)[0]
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV Recall: {best_cv_recall:.4f}")

# Final model fit with the chosen ccp_alpha
best_dt = DecisionTreeClassifier(
    random_state=42,
    class_weight="balanced",
    criterion=simple_dt.criterion,
    max_depth=simple_dt.max_depth,
    min_samples_split=simple_dt.min_samples_split,
    min_samples_leaf=simple_dt.min_samples_leaf,
    min_impurity_decrease=simple_dt.min_impurity_decrease,
    ccp_alpha=best_alpha
).fit(X_train_ready, y_train)

# Evaluation
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred_dt, "Alternative Tuned & Pruned DT")

Stage A — Best simple DT params: {'criterion': 'entropy', 'max_depth': 6, 'min_impurity_decrease': 0.001, 'min_samples_leaf': 2, 'min_samples_split': 5}
Stage A — Best CV Recall: 0.7039333333333333
Stage B — Best ccp_alpha: 0.000000 | CV Recall: 0.7039
=== Alternative Tuned & Pruned DT Evaluation ===
Accuracy : 0.7082444918265813
Precision: 0.7008843419455523
Recall   : 0.7216568469916086
F1 Score : 0.711118930330753

Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.69      0.71      5655
           1       0.70      0.72      0.71      5601

    accuracy                           0.71     11256
   macro avg       0.71      0.71      0.71     11256
weighted avg       0.71      0.71      0.71     11256

Confusion Matrix:
 [[3930 1725]
 [1559 4042]]




## Decision Tree (Tuned & Pruned)

### Best Parameters
- **Criterion:** entropy  
- **Max depth:** 6  
- **Min impurity decrease:** 0.001  
- **Min samples per leaf:** 2  
- **Min samples per split:** 5  
- **ccp_alpha (post-pruning):** 0.0  

### Cross-Validation
- **Best CV Recall (Stage A):** 0.704  
- **Best CV Recall (Stage B, with pruning):** 0.704  

### Evaluation
- **Accuracy:** 0.708  
- **Precision:** 0.701  
- **Recall:** **0.722**  
- **F1 Score:** 0.711  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 3930         | 1725         |
| **Actual: 1** | 1559         | 4042         |

- **False negatives:** **1559** (positives missed)  
- **False positives:** **1725** (negatives flagged)  

**Interpretation:**  
This tuned and pruned Decision Tree achieves **71% accuracy** with a strong balance between **precision (0.701)** and **recall (0.722)**.  
It demonstrates **higher recall**, meaning it catches more positives, though at the cost of slightly more false positives. The pruning (ccp_alpha=0.0) suggests that additional complexity penalty did not improve generalization, so the controlled depth and impurity constraints already provided effective regularization.

---

In [12]:
# Alternative DT tuning focused on higher recall
# Changes vs previous:
#  - Remove calibration (predict uses raw tree probs at 0.5)
#  - Tune class_weight (heavier positive weights allowed)
#  - Broaden depth a bit but keep regularization via min_samples_* and tiny impurity decrease
#  - Prune only with very small ccp_alphas to avoid killing recall

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# Simpler-but-expressive trees + tuned class weights
base_dt = DecisionTreeClassifier(random_state=42)

param_grid_simple = {
    "criterion": ["gini", "entropy"],                  # add "log_loss" if your sklearn supports it
    "max_depth": [4, 5, 6, 7, 8, 9, 10],               # a bit deeper to help recall
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],
    "class_weight": ["balanced", {0:1,1:2}, {0:1,1:3}, {0:1,1:4}],  # stronge push toward positives
}

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",      # prioritize sensitivity for class 1
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

best_params = grid_simple.best_params_
print("Stage A — Best DT params:", best_params)
print("Stage A — Best CV Recall:", round(grid_simple.best_score_, 4))

# Train a zero-pruned model with best params to get the pruning path
dt0 = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=0.0).fit(X_train_ready, y_train)


# Stage B — Gentle cost-complexity pruning (favor small alphas)
path = dt0.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

# Focus on tiny alphas only + 0.0 to avoid big recall loss
small_slice = ccp_alphas[: min(30, len(ccp_alphas))]  # first 30 values are typically the smallest
candidate_alphas = np.unique(np.r_[0.0, small_slice])

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=alpha)
    rec = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, rec))

best_alpha, best_cv_recall = max(cv_scores, key=lambda x: x[1])
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV Recall: {best_cv_recall:.4f}")

alt_best_dt = DecisionTreeClassifier(random_state=42, **best_params, ccp_alpha=best_alpha).fit(X_train_ready, y_train)


# Evaluation
y_pred = alt_best_dt.predict(X_test_ready)               
y_prob = alt_best_dt.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred, "Alternative Tuned & Pruned Decision Tree")

Stage A — Best DT params: {'class_weight': {0: 1, 1: 3}, 'criterion': 'gini', 'max_depth': 4, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1, 'min_samples_split': 5}
Stage A — Best CV Recall: 1.0
Stage B — Best ccp_alpha: 0.000000 | CV Recall: 1.0000
=== Alternative Tuned & Pruned Decision Tree Evaluation ===
Accuracy : 0.4976012793176972
Precision: 0.4976012793176972
Recall   : 1.0
F1 Score : 0.6645310553479267

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00      5655
           1       0.50      1.00      0.66      5601

    accuracy                           0.50     11256
   macro avg       0.25      0.50      0.33     11256
weighted avg       0.25      0.50      0.33     11256

Confusion Matrix:
 [[   0 5655]
 [   0 5601]]




  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


# Decision Tree Model Comparison

## 1. Baseline Decision Tree
- **Accuracy:** 0.685  
- **Precision:** 0.705  
- **Recall:** 0.632  
- **F1 Score:** 0.666  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4170         | 1485         |
| **Actual: 1** | 2060         | 3541         |

- **False negatives:** 2060  
- **False positives:** 1485  

**Interpretation:**  
The baseline Decision Tree shows decent precision but weaker recall. It correctly classifies most negatives but misses a substantial number of positives.

---

## 2. Tuned Decision Tree
- **Best Params:** {criterion: gini, max_depth: None, min_samples_leaf: 10, min_samples_split: 2}  
- **Best CV F1:** 0.658  
- **Accuracy:** 0.700  
- **Precision:** 0.715  
- **Recall:** 0.661  
- **F1 Score:** 0.687  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4178         | 1477         |
| **Actual: 1** | 1898         | 3703         |

- **False negatives:** 1898  
- **False positives:** 1477  

**Interpretation:**  
This tuned version improves both accuracy and F1 score compared to baseline. It balances precision and recall better, while the constraint on minimum samples per leaf reduces overfitting.

---

## 3. Alternative Tuned & Pruned Decision Tree
- **Best Params:** {criterion: entropy, max_depth: 6, min_impurity_decrease: 0.001, min_samples_leaf: 2, min_samples_split: 5, ccp_alpha: 0.0}  
- **Best CV Recall:** 0.704  
- **Accuracy:** 0.708  
- **Precision:** 0.701  
- **Recall:** 0.722  
- **F1 Score:** 0.711  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 3930         | 1725         |
| **Actual: 1** | 1559         | 4042         |

- **False negatives:** 1559  
- **False positives:** 1725  

**Interpretation:**  
This pruned tree achieves the **highest recall (0.722)** and the best overall F1 score. It catches more positives at the cost of slightly more false positives, making it suitable when recall is a priority.

---

## 4. Alternative Tuned & Pruned Decision Tree (Class-weighted)
- **Best Params:** {class_weight: {0:1, 1:3}, criterion: gini, max_depth: 4, min_impurity_decrease: 0.0001, min_samples_leaf: 1, min_samples_split: 5, ccp_alpha: 0.0}  
- **Best CV Recall:** 1.0  
- **Accuracy:** 0.498  
- **Precision:** 0.498  
- **Recall:** 1.000  
- **F1 Score:** 0.665  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 0            | 5655         |
| **Actual: 1** | 0            | 5601         |

- **False negatives:** 0  
- **False positives:** 5655  

**Interpretation:**  
This extreme class-weighted model predicts **all cases as positive**, leading to perfect recall but unusable performance overall (accuracy below 50%, precision undefined for class 0). It is not practical for deployment.

---

# Summary

| Model                                | Accuracy | Precision | Recall | F1 Score |
|--------------------------------------|----------|-----------|--------|----------|
| Baseline Decision Tree               | 0.685    | 0.705     | 0.632  | 0.666    |
| Tuned Decision Tree                  | 0.700    | 0.715     | 0.661  | 0.687    |
| Alt. Tuned & Pruned Decision Tree    | 0.708    | 0.701     | 0.722  | 0.711    |
| Alt. Class-weighted Decision Tree    | 0.498    | 0.498     | 1.000  | 0.665    |

- **Baseline Decision Tree:** Decent precision but limited recall.  
- **Tuned Decision Tree:** Improves accuracy and F1 by regularization, more balanced.  
- **Alternative Tuned & Pruned DT:** Best overall performance, with **highest recall and F1**, offering the best trade-off.  
- **Class-weighted DT:** Overly biased toward the positive class, not useful in practice.  

---

# Final Conclusion
The **Alternative Tuned & Pruned Decision Tree** is the **best-performing model**.  
It achieves the highest recall (0.722) and the strongest F1 score (0.711), while maintaining good accuracy (0.708). This makes it the most balanced and reliable option among the Decision Tree versions.

---

In [13]:
import joblib, pandas as pd, numpy as np

# Save tuned Decision Tree model
model_filename = "tuned_dt_model.pkl"
joblib.dump(best_dt, model_filename)

# Ensure 1D arrays
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
# Use tuned predictions/probabilities from the best estimator
y_pred = y_pred_dt
y_prob = y_prob_dt

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_pred_dt": y_pred,
    "y_prob": y_prob
})

preds_filename = "CVDKaggleData_75M25F_DT_tuned_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tuned DT model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tuned DT model → tuned_dt_model.pkl
Saved predictions → CVDKaggleData_75M25F_DT_tuned_predictions.csv


### Ensemble Model - Random Forest (RF)

In [14]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.6909203980099502
Precision: 0.7012518968133535
Recall   : 0.6600607034458132
F1 Score : 0.6800331095373862

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.72      0.70      5655
           1       0.70      0.66      0.68      5601

    accuracy                           0.69     11256
   macro avg       0.69      0.69      0.69     11256
weighted avg       0.69      0.69      0.69     11256

Confusion Matrix:
 [[4080 1575]
 [1904 3697]]




## Random Forest Evaluation

### Evaluation
- **Accuracy:** 0.691  
- **Precision:** 0.701  
- **Recall:** 0.660  
- **F1 Score:** 0.680  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4080         | 1575         |
| **Actual: 1** | 1904         | 3697         |

- **False negatives:** 1904 (positives missed)  
- **False positives:** 1575 (negatives flagged)  

**Interpretation:**  
The Random Forest achieves **69.1% accuracy** with a balanced trade-off between **precision (0.701)** and **recall (0.660)**.  
It performs slightly better at identifying negatives (class 0 recall: 0.72) than positives (class 1 recall: 0.66).  
The model is fairly balanced overall, with **moderate precision and recall**, making it a stable classifier but still missing a notable portion of positives.  

---

### Improvement Random Forest (RF)

In [15]:
from sklearn.experimental import enable_halving_search_cv  # must be first
from sklearn.model_selection import HalvingGridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

Xtr = getattr(X_train_ready, "values", X_train_ready).astype("float32")
Xte = getattr(X_test_ready, "values", X_test_ready).astype("float32")

rf = RandomForestClassifier(random_state=42, n_jobs=1, bootstrap=True)

# Do NOT include 'n_estimators' in param_grid; Halving will vary it as the resource.
param_grid = {
    "max_depth": [None, 12],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": ["sqrt"],
    "class_weight": ["balanced"],
}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

search = HalvingGridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    resource="n_estimators",   # use trees as the resource
    min_resources=100,         # start small
    max_resources=400,         # end at your intended max
    factor=3,
    cv=cv,
    scoring="recall",
    n_jobs=-1,
    verbose=1,
    refit=True,
    return_train_score=False,
)

search.fit(Xtr, y_train)
best_rf = search.best_estimator_
y_pred_rf = best_rf.predict(Xte)
y_prob_rf = best_rf.predict_proba(Xte)[:, 1]
evaluate_model(y_test, y_pred_rf, "Random Forest (best, halving over n_estimators)")



n_iterations: 2
n_required_iterations: 2
n_possible_iterations: 2
min_resources_: 100
max_resources_: 400
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 8
n_resources: 100
Fitting 3 folds for each of 8 candidates, totalling 24 fits
----------
iter: 1
n_candidates: 3
n_resources: 300
Fitting 3 folds for each of 3 candidates, totalling 9 fits
=== Random Forest (best, halving over n_estimators) Evaluation ===
Accuracy : 0.7018479033404407
Precision: 0.7114334149557355
Recall   : 0.6743438671665773
F1 Score : 0.6923923006416132

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.73      0.71      5655
           1       0.71      0.67      0.69      5601

    accuracy                           0.70     11256
   macro avg       0.70      0.70      0.70     11256
weighted avg       0.70      0.70      0.70     11256

Confusion Matrix:
 [[4123 1532]
 [1824 3777]]




# Random Forest Model Comparison

## 1. Standard Random Forest
- **Accuracy:** 0.691  
- **Precision:** 0.701  
- **Recall:** 0.660  
- **F1 Score:** 0.680  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4080         | 1575         |
| **Actual: 1** | 1904         | 3697         |

- **False negatives:** 1904  
- **False positives:** 1575  

**Interpretation:**  
This baseline Random Forest achieves solid performance with a balanced trade-off. It identifies negatives slightly better than positives, but misses nearly 1900 positives.

---

## 2. Random Forest (Best, Halving over n_estimators)
- **Accuracy:** 0.702  
- **Precision:** 0.711  
- **Recall:** 0.674  
- **F1 Score:** 0.692  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4123         | 1532         |
| **Actual: 1** | 1824         | 3777         |

- **False negatives:** 1824  
- **False positives:** 1532  

**Interpretation:**  
The tuned Random Forest (via HalvingGridSearchCV) improves **accuracy, precision, recall, and F1** compared to the standard model. It catches more positives (fewer false negatives) while also reducing false positives. This makes it a more balanced and effective version.

---

# Summary

| Model                                | Accuracy | Precision | Recall | F1 Score |
|--------------------------------------|----------|-----------|--------|----------|
| Standard Random Forest               | 0.691    | 0.701     | 0.660  | 0.680    |
| RF (Halving over n_estimators, best) | 0.702    | 0.711     | 0.674  | 0.692    |

- **Standard RF:** Balanced but slightly weaker on recall and overall score.  
- **Halving RF (best):** Stronger across all metrics, with better recall and fewer misclassifications.  

**Final Conclusion:**  
The **Random Forest tuned via Halving** is the moderate model, offering a consistent improvement in accuracy, precision, recall, and F1 over the standard Random Forest.

---

In [16]:
# Save Tuned Random Forest Results

# Save tuned Random Forest model
model_filename = "tuned_rf_model.pkl"
joblib.dump(best_rf, model_filename)

# Ensure 1D arrays for y_true and y_pred
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
y_pred = y_pred_rf  # from best_rf.predict(X_test_ready)
y_prob = y_prob_rf

# Optional gender column if present in test set
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results DataFrame
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_pred_rf_tuned": y_pred,
    "y_prob" :y_prob_rf
})

preds_filename = "CVDKaggleData_75M25F_RF_tuned_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tuned RF model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tuned RF model → tuned_rf_model.pkl
Saved predictions → CVDKaggleData_75M25F_RF_tuned_predictions.csv


### Deep Learning - Multi-layer Perceptron

In [17]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [18]:
# Initialize MLP model
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),   # one hidden layer with 100 neurons
    activation='relu',           # or 'tanh'
    solver='adam',               # optimizer
    max_iter=1000,                # increase if convergence warning appears
    random_state=42
)

# Train the model
mlp.fit(X_train_ready, y_train)

# Predict
y_pred_mlp = mlp.predict(X_test_ready)

evaluate_model(y_test, y_pred_mlp, "Multilayer Perceptron (MLP)")

=== Multilayer Perceptron (MLP) Evaluation ===
Accuracy : 0.6961620469083155
Precision: 0.7021316033364227
Recall   : 0.6763078021781824
F1 Score : 0.6889778101127683

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.72      0.70      5655
           1       0.70      0.68      0.69      5601

    accuracy                           0.70     11256
   macro avg       0.70      0.70      0.70     11256
weighted avg       0.70      0.70      0.70     11256

Confusion Matrix:
 [[4048 1607]
 [1813 3788]]




## Multilayer Perceptron (MLP) Evaluation

### Evaluation
- **Accuracy:** 0.696  
- **Precision:** 0.702  
- **Recall:** 0.676  
- **F1 Score:** 0.689  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4048         | 1607         |
| **Actual: 1** | 1813         | 3788         |

- **False negatives:** 1813 (positives missed)  
- **False positives:** 1607 (negatives flagged)  

**Interpretation:**  
The MLP achieves **69.6% accuracy** with a balanced trade-off between **precision (0.702)** and **recall (0.676)**.  
It correctly identifies both negatives and positives at similar rates (class 0 recall: 0.72, class 1 recall: 0.68).  
The model slightly favors precision, making it reliable when predicting positives, but it still misses around 1800 true positives.  
Overall, the MLP performs comparably to Random Forest and tuned Decision Trees, showing balanced but not standout performance.  

---

### Improvements - MLP

In [19]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.7133972992181947
Precision: 0.7283214766391078
Recall   : 0.6763078021781824
F1 Score : 0.7013516015552675

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.75      0.72      5655
           1       0.73      0.68      0.70      5601

    accuracy                           0.71     11256
   macro avg       0.71      0.71      0.71     11256
weighted avg       0.71      0.71      0.71     11256

Confusion Matrix:
 [[4242 1413]
 [1813 3788]]




## MLP (Adam + EarlyStopping) Evaluation

### Evaluation
- **Accuracy:** 0.713  
- **Precision:** 0.728  
- **Recall:** 0.676  
- **F1 Score:** 0.701  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4242         | 1413         |
| **Actual: 1** | 1813         | 3788         |

- **False negatives:** 1813 (positives missed)  
- **False positives:** 1413 (negatives flagged)  

**Interpretation:**  
The MLP with **Adam optimizer and EarlyStopping** delivers **71.3% accuracy**, the best so far among your neural models.  
It achieves **high precision (0.728)**, making it reliable when predicting positives, though recall (0.676) indicates that some positives are still missed.  
The model improves over the plain MLP by reducing false positives (1413 vs. 1607 earlier), while maintaining a similar level of false negatives.  
Overall, this tuned MLP variant offers the **most balanced performance** among your neural approaches, with strong generalization due to EarlyStopping.  

---

### Further Improvement MLP 

In [20]:
# OPTION A — Fastest win: early stopping + single-metric scoring + lighter CV
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import fbeta_score, make_scorer
import numpy as np

# Keep arrays lean
Xtr = getattr(X_train_ready, "values", X_train_ready).astype("float32")
Xte = getattr(X_test_ready, "values", X_test_ready).astype("float32")

base_mlp = MLPClassifier(
    solver="adam",
    early_stopping=True,          # <- stop per-config when val score plateaus
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=200,                 # <- much lower; early_stopping will bail sooner
    tol=1e-4,
    random_state=42
)

# Trim the space; focus on what usually matters
param_dist = {
    "hidden_layer_sizes": [(64,), (128,), (64, 32)],
    "activation": ["relu"],       # 'relu' typically dominates for tabular MLPs
    "alpha": [1e-5, 1e-4, 3e-4, 1e-3],
    "learning_rate_init": [1e-3, 5e-4, 3e-4],
    "batch_size": [32, 64, 128],  # larger batch -> faster per-epoch
}

# Fewer folds = big speedup with little generalization loss
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Use a single scorer that matches your objective (recall-weighted)
fbeta2 = make_scorer(fbeta_score, beta=2)

rs = RandomizedSearchCV(
    estimator=base_mlp,
    param_distributions=param_dist,
    n_iter=20,                    # fewer trials; early stop handles over-training
    scoring=fbeta2,               # <- single metric speeds everything up
    refit=True,                   # refit on full train with best params
    cv=cv,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

rs.fit(Xtr, y_train)
best_mlp = rs.best_estimator_

print("Best MLP params:", rs.best_params_)
print(f"Best CV F-beta (β=2): {rs.best_score_:.4f}")

y_pred = best_mlp.predict(Xte)
y_prob = best_mlp.predict_proba(Xte)[:, 1]
evaluate_model(y_test, y_pred, model_name="Best MLP (Adam + ES, fast)")

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best MLP params: {'learning_rate_init': 0.0003, 'hidden_layer_sizes': (64, 32), 'batch_size': 32, 'alpha': 0.0003, 'activation': 'relu'}
Best CV F-beta (β=2): 0.7020
=== Best MLP (Adam + ES, fast) Evaluation ===
Accuracy : 0.7144633972992182
Precision: 0.7250612860644918
Recall   : 0.6864845563292269
F1 Score : 0.7052457813646368

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.74      0.72      5655
           1       0.73      0.69      0.71      5601

    accuracy                           0.71     11256
   macro avg       0.72      0.71      0.71     11256
weighted avg       0.71      0.71      0.71     11256

Confusion Matrix:
 [[4197 1458]
 [1756 3845]]




# MLP Model Comparison

## 1. Baseline MLP
- **Accuracy:** 0.696  
- **Precision:** 0.702  
- **Recall:** 0.676  
- **F1 Score:** 0.689  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4048         | 1607         |
| **Actual: 1** | 1813         | 3788         |

- **False negatives:** 1813  
- **False positives:** 1607  

**Interpretation:**  
The baseline MLP shows balanced performance with recall (0.676) slightly higher than precision. It misses about **1,800 positives** and has a moderate number of false alarms.

---

## 2. MLP (Adam + EarlyStopping)
- **Accuracy:** 0.713  
- **Precision:** 0.728  
- **Recall:** 0.676  
- **F1 Score:** 0.701  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4242         | 1413         |
| **Actual: 1** | 1813         | 3788         |

- **False negatives:** 1813  
- **False positives:** 1413  

**Interpretation:**  
Adding **Adam optimizer with EarlyStopping** boosts **accuracy and precision**. False positives decrease compared to baseline (1413 vs 1607), while recall remains the same. This model is more reliable when predicting positives.

---

## 3. Best MLP (Adam + ES, tuned)
- **Best Params:** {learning_rate_init: 0.0003, hidden_layer_sizes: (64, 32), batch_size: 32, alpha: 0.0003, activation: relu}  
- **Best CV F-beta (β=2):** 0.702  
- **Accuracy:** 0.714  
- **Precision:** 0.725  
- **Recall:** 0.686  
- **F1 Score:** 0.705  
- **Support:** 0→5655, 1→5601  

**Confusion Matrix**  
|               | Predicted: 0 | Predicted: 1 |
|--------------:|-------------:|-------------:|
| **Actual: 0** | 4197         | 1458         |
| **Actual: 1** | 1756         | 3845         |

- **False negatives:** 1756  
- **False positives:** 1458  

**Interpretation:**  
The tuned MLP further improves recall (0.686) while keeping strong precision (0.725). It reduces missed positives (1756 vs 1813 earlier) and maintains a balanced trade-off. This is the **best-performing MLP variant** among the three.

---

# Summary

| Model                      | Accuracy | Precision | Recall | F1 Score |
|-----------------------------|----------|-----------|--------|----------|
| Baseline MLP               | 0.696    | 0.702     | 0.676  | 0.689    |
| MLP (Adam + EarlyStopping) | 0.713    | 0.728     | 0.676  | 0.701    |
| Best MLP (Adam + ES, tuned)| 0.714    | 0.725     | 0.686  | 0.705    |

- **Baseline MLP:** Solid starting point, balanced precision/recall but lower accuracy.  
- **MLP (Adam + ES):** Improves accuracy and precision, reduces false positives, but recall unchanged.  
- **Best MLP (Adam + ES, tuned):** Delivers the best balance with improved recall and F1 score, while keeping precision high.  

---

# Final Conclusion
The **Best MLP (Adam + EarlyStopping, tuned)** is the strongest model.  
It achieves the **highest recall and F1**, meaning it captures more positives while remaining precise, making it the most reliable choice for deployment.  

---

In [21]:
# Save Tuned MLP Results
import joblib, pandas as pd, numpy as np

# Save MLP model
model_filename =  "mlp_adamtuned.pkl"
joblib.dump(best_mlp, model_filename)

# Ensure 1D arrays for y_true and y_pred
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
y_pred = best_mlp.predict(Xte)
y_prob = best_mlp.predict_proba(Xte)[:, 1]

# Optional gender column if present in test set
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results DataFrame
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_pred": y_pred,
    "y_prob" : y_prob
})

preds_filename = "CVDKaggleData_75M25F_MLP_adamtuned_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Adam tuned MLP model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Adam tuned MLP model → mlp_adamtuned.pkl
Saved predictions → CVDKaggleData_75M25F_MLP_adamtuned_predictions.csv
