## CVD Prediction - Cardiovascular Disease Dataset (Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset/data)
Model Training and Evaluation

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_50_50.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,source_id,gender,age,bmiclass,MAP,cholesterol,gluc,smoke,alco,active,cardio
0,17997,1,5,2,3,1,1,0,0,1,0
1,24871,1,4,1,4,1,1,0,0,1,1
2,19175,1,5,1,3,1,1,0,0,1,1
3,3585,1,4,2,3,1,1,1,1,1,0
4,13478,1,3,1,5,1,1,1,1,1,1


In [2]:
#drop source_id
train_df = train_df.drop(columns=["source_id"])

In [3]:
TARGET = "cardio"
SENSITIVE = "gender"   # 1 = Male, 0 = Female


# Identify feature types
binary_cols = ["gender", "smoke", "alco", "active"]
categorical_cols = ["age", "bmiclass", "MAP", "cholesterol", "gluc"]

X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [4]:
#ONE-HOT ENCODE CATEGORICALS; KEEP SCALED NUMERICS AS-IS 

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# 1) fit encoder on TRAIN categoricals only
ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

# 2) transform TRAIN and TEST
X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

# 3) concatenate: encoded categoricals + scaled numerics
X_train_ready = pd.concat([X_train_cat, X_train[binary_cols]], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test[binary_cols]],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)

Final feature shapes: (30000, 28) (11256, 28)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [5]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function 
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

In [6]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_ready, y_train)

y_pred_knn = knn.predict(X_test_ready)
y_prob_knn = knn.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_knn, "KNN")


# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_ready, y_train)

y_pred_dt = dt.predict(X_test_ready)
y_prob_dt = dt.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_dt, "Decision Tree")

=== KNN Evaluation ===
Accuracy : 0.6773276474769012
Precision: 0.6840530940362685
Recall   : 0.6532762006784503
F1 Score : 0.6683105022831051

Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.70      0.69      5655
           1       0.68      0.65      0.67      5601

    accuracy                           0.68     11256
   macro avg       0.68      0.68      0.68     11256
weighted avg       0.68      0.68      0.68     11256

Confusion Matrix:
 [[3965 1690]
 [1942 3659]]


=== Decision Tree Evaluation ===
Accuracy : 0.6956289978678039
Precision: 0.7148784825133373
Recall   : 0.6459560792715586
F1 Score : 0.678671918964547

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.74      0.71      5655
           1       0.71      0.65      0.68      5601

    accuracy                           0.70     11256
   macro avg       0.70      0.70      0.69     11256
weighted

## KNN Model Evaluation

### Overall Metrics
- **Accuracy**: 67.7%  
- **Precision**: 68.4%  
- **Recall**: 65.3%  
- **F1 Score**: 66.8%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 3965         | 1690         |
| **Actual: 1** | 1942         | 3659         |

### Interpretation
- The model achieves **balanced but modest performance**, with accuracy below 70%.  
- **Precision (68.4%)** and **recall (65.3%)** are close, suggesting no strong bias toward precision or sensitivity.  
- The confusion matrix shows a substantial number of errors:  
  - **1690 false positives** (healthy classified as CVD).  
  - **1942 false negatives** (CVD cases missed).  
- Overall, this KNN struggles to capture patterns effectively, leading to many misclassifications.  

---

## Decision Tree Model Evaluation

### Overall Metrics
- **Accuracy**: 69.6%  
- **Precision**: 71.5%  
- **Recall**: 64.6%  
- **F1 Score**: 67.9%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 4212         | 1443         |
| **Actual: 1** | 1983         | 3618         |

### Interpretation
- The Decision Tree achieves **slightly higher accuracy (69.6%)** than KNN.  
- **Precision is stronger (71.5%)**, indicating positive predictions are more reliable, but **recall remains modest (64.6%)**, with many CVD cases still missed.  
- Errors are distributed as:  
  - **1443 false positives** (fewer than KNN).  
  - **1983 false negatives** (similar to KNN).  
- This model is slightly more effective than KNN, with better precision and fewer false positives, but still limited in recall.  

---

### Conclusion
Both models provide **moderate predictive power**, with accuracies below 70%.  
- **KNN** shows balanced but weaker performance, with many misclassifications.  
- **Decision Tree** performs better, offering **higher precision and fewer false alarms**, though recall remains limited.  

---

### KNN Improvement
The code improves the KNN model by performing a **grid search** over key hyperparameters (`n_neighbors`, `weights`, and `distance metric`) to find the configuration that yields the best performance. After selecting the optimal model, it further explores **decision threshold tuning** to boost recall, which is critical in medical prediction tasks. 

In [7]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# 1) Hyperparameter tuning for KNN 
param_grid = {
    "n_neighbors": list(range(1, 31)),
    "weights": ["uniform", "distance"],
    "metric": ["euclidean", "manhattan", "minkowski"],  # minkowski with p=2 is euclidean
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=param_grid,
    cv=cv,
    scoring="f1",        
    n_jobs=-1,
    verbose=0,
    refit=True
)

# Fit 
grid.fit(X_train_ready, y_train)

print("Best KNN params:", grid.best_params_)
print("Best CV F1:", grid.best_score_)

best_knn = grid.best_estimator_

# 2) Evaluate best KNN on TEST 
y_pred_knn_best = best_knn.predict(X_test_ready)
y_prob_knn_best = best_knn.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred_knn_best, "KNN (best params)")

Best KNN params: {'metric': 'euclidean', 'n_neighbors': 29, 'weights': 'uniform'}
Best CV F1: 0.6960630156544079
=== KNN (best params) Evaluation ===
Accuracy : 0.7006929637526652
Precision: 0.6962715441435103
Recall   : 0.7068380646313158
F1 Score : 0.7015150172765128

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.69      0.70      5655
           1       0.70      0.71      0.70      5601

    accuracy                           0.70     11256
   macro avg       0.70      0.70      0.70     11256
weighted avg       0.70      0.70      0.70     11256

Confusion Matrix:
 [[3928 1727]
 [1642 3959]]




## KNN (Best Params) Model Evaluation

### Best Parameters
- **Metric**: Euclidean  
- **Neighbors**: 29  
- **Weights**: Uniform  
- **Best CV F1**: 0.696  

### Overall Metrics
- **Accuracy**: 70.1%  
- **Precision**: 69.6%  
- **Recall**: 70.7%  
- **F1 Score**: 70.2%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 3928         | 1727         |
| **Actual: 1** | 1642         | 3959         |

### Interpretation
- This tuned KNN achieves **balanced performance**, with accuracy and F1 score just above 70%.  
- **Precision (69.6%)** and **recall (70.7%)** are closely aligned, meaning the model does not strongly favor either avoiding false alarms or maximizing sensitivity.  
- **False negatives** (CVD cases missed): **1642**, improved compared to the baseline KNN.  
- **False positives** (healthy misclassified as CVD): **1727**, slightly higher than the baseline.  
- Compared to the untuned version, this model provides **a more even trade-off** between precision and recall, at the cost of more false positives.  

Overall, the tuned KNN delivers **modest but stable performance**, making it a more reliable option than the baseline, though recall and precision remain limited.  

---

### Further KNN Improvement - Implementing PCA 

In [8]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# 1) PCA + KNN pipeline 
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()
print(f"PCA components: {n_comp} | Explained variance retained: {expl_var:.3f}")

#2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]

evaluate_model(y_test, y_pred_pca_knn, "PCA+KNN")

PCA components: 16 | Explained variance retained: 0.950
=== PCA+KNN Evaluation ===
Accuracy : 0.6916311300639659
Precision: 0.7017809776430466
Recall   : 0.6613104802713801
F1 Score : 0.6809449397922603

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.72      0.70      5655
           1       0.70      0.66      0.68      5601

    accuracy                           0.69     11256
   macro avg       0.69      0.69      0.69     11256
weighted avg       0.69      0.69      0.69     11256

Confusion Matrix:
 [[4081 1574]
 [1897 3704]]




## KNN Model Versions – Interpretation

### Baseline KNN
- **Accuracy**: 67.7%  
- **Precision**: 68.4%  
- **Recall**: 65.3%  
- **F1 Score**: 66.8%  
- **Confusion Matrix**:  
  - False Positives: 1,690  
  - False Negatives: 1,942  

**Interpretation**:  
The baseline KNN shows **modest performance**, with accuracy below 70%. Recall (65.3%) is limited, meaning nearly 2,000 CVD cases are missed. Errors are distributed fairly evenly, suggesting the model struggles to separate classes clearly.  

---

### Tuned KNN (Best Params)
- **Parameters**: Euclidean distance, 29 neighbors, uniform weights  
- **Accuracy**: 70.1%  
- **Precision**: 69.6%  
- **Recall**: 70.7%  
- **F1 Score**: 70.2%  
- **Confusion Matrix**:  
  - False Positives: 1,727  
  - False Negatives: 1,642  

**Interpretation**:  
After tuning, the KNN improves in both **accuracy and recall**, achieving a more balanced performance. The number of missed CVD cases drops from ~1,942 to ~1,642, although false positives increase slightly. This configuration shows the best trade-off between sensitivity and reliability, compared to the baseline.  

---

### PCA + KNN
- **PCA Components**: 16 (95% variance retained)  
- **Accuracy**: 69.2%  
- **Precision**: 70.2%  
- **Recall**: 66.1%  
- **F1 Score**: 68.1%  
- **Confusion Matrix**:  
  - False Positives: 1,574  
  - False Negatives: 1,897  

**Interpretation**:  
Using PCA reduces dimensionality while retaining most variance, leading to **slightly lower recall (66.1%)** compared to the tuned version. While false positives decrease (1,574 vs 1,727), false negatives increase (1,897 missed CVD cases). This makes the model more conservative but less sensitive to detecting true CVD patients.  

---

### Overall Summary
- **Baseline KNN**: Limited predictive ability, balanced but weak across all metrics.  
- **Tuned KNN**: Best performing version, with **higher recall and accuracy**, making it the most balanced option.  
- **PCA + KNN**: Reduces dimensionality effectively, but at the cost of **lower recall**, leading to more missed CVD cases, though fewer false positives.  

The **Tuned KNN** is the most suitable version, as it minimizes missed CVD cases while keeping precision stable.  

---

In [9]:
#saving best performing KNN Model for fairness evaluation
import joblib, pandas as pd, numpy as np

# Ensure y_test is a Series (not a DataFrame)
if isinstance(y_test, pd.DataFrame):
    y_test = y_test.squeeze("columns")

# Save model
model_filename = "knn_tuned_model.pkl"
joblib.dump(best_knn, model_filename)

# Ensure 1D arrays
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
y_pred_knn = y_pred_knn_best
y_prob_knn = y_prob_knn_best

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_prob": y_prob_knn,
    "y_pred": y_pred_knn
})

preds_filename = "CVDKaggleData_50F50M__tunedKNN_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tuned KNN model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tuned KNN model → knn_tuned_model.pkl
Saved predictions → CVDKaggleData_50F50M__tunedKNN_predictions.csv


### Improvement - Decision Tree (DT)

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="recall",
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
best_dt = grid_dt.best_estimator_
y_pred_dt_best = best_dt.predict(X_test_ready)
y_prob_dt_best = best_dt.predict_proba(X_test_ready)[:, 1]  

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree")

Best Decision Tree params: {'criterion': 'entropy', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 5}
Best CV F1: 0.6532
=== Tuned Decision Tree Evaluation ===
Accuracy : 0.7121535181236673
Precision: 0.7422532320952185
Recall   : 0.6457775397250491
F1 Score : 0.6906625930876455

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.78      0.73      5655
           1       0.74      0.65      0.69      5601

    accuracy                           0.71     11256
   macro avg       0.72      0.71      0.71     11256
weighted avg       0.72      0.71      0.71     11256

Confusion Matrix:
 [[4399 1256]
 [1984 3617]]




## Tuned Decision Tree Model Evaluation

### Best Parameters
- **Criterion**: Entropy  
- **Max Depth**: 7  
- **Min Samples Split**: 5  
- **Min Samples Leaf**: 1  
- **Best CV F1**: 0.6532  

### Overall Metrics
- **Accuracy**: 71.2%  
- **Precision**: 74.2%  
- **Recall**: 64.6%  
- **F1 Score**: 69.1%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 4399         | 1256         |
| **Actual: 1** | 1984         | 3617         |

### Interpretation
- The tuned Decision Tree achieves **accuracy of 71.2%**, showing modest improvement compared to baseline versions.  
- **Precision is fairly high (74.2%)**, meaning predicted CVD cases are usually correct.  
- **Recall (64.6%)** is weaker, with **1,984 missed CVD patients (false negatives)**, limiting sensitivity.  
- Non-CVD patients are detected well (**78% correctly identified**), with **1,256 false positives**.  
- The model thus leans toward **precision over recall**, being more conservative in identifying positives but less effective at capturing all true CVD cases.  

➡️ Overall, this tuned Decision Tree provides **moderate performance**, but the relatively low recall makes it less suited for medical screening tasks where sensitivity is critical.  

---

In [11]:
# Alternative DT tuning: simpler trees + class balancing + cost-complexity pruning
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

# Stage A: bias toward simpler trees with class_weight="balanced"
base_dt = DecisionTreeClassifier(random_state=42, class_weight="balanced")

param_grid_simple = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 4, 5, 6, 7],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [2, 4, 6],
    "min_impurity_decrease": [0.0, 1e-4, 1e-3],  # tiny regularization
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_simple = GridSearchCV(
    estimator=base_dt,
    param_grid=param_grid_simple,
    cv=cv,
    scoring="recall",        # recall-focused search
    n_jobs=-1,
    verbose=0,
    refit=True
)
grid_simple.fit(X_train_ready, y_train)

print("Stage A — Best simple DT params:", grid_simple.best_params_)
print("Stage A — Best CV Recall:", grid_simple.best_score_)
simple_dt = grid_simple.best_estimator_

# Stage B: cost-complexity pruning on the best simple DT
path = simple_dt.cost_complexity_pruning_path(X_train_ready, y_train)
ccp_alphas = path.ccp_alphas

unique_alphas = np.unique(np.round(ccp_alphas, 6))
candidate_alphas = np.linspace(unique_alphas.min(), unique_alphas.max(), num=min(20, len(unique_alphas)))
candidate_alphas = np.unique(np.concatenate([candidate_alphas, [0.0]]))  # include no-pruning baseline

cv_scores = []
for alpha in candidate_alphas:
    dt_alpha = DecisionTreeClassifier(
        random_state=42,
        class_weight="balanced",
        criterion=simple_dt.criterion,
        max_depth=simple_dt.max_depth,
        min_samples_split=simple_dt.min_samples_split,
        min_samples_leaf=simple_dt.min_samples_leaf,
        min_impurity_decrease=simple_dt.min_impurity_decrease,
        ccp_alpha=alpha
    )
    # recall-focused CV
    recall_cv = cross_val_score(dt_alpha, X_train_ready, y_train, cv=cv, scoring="recall", n_jobs=-1).mean()
    cv_scores.append((alpha, recall_cv))

best_alpha, best_cv_recall = sorted(cv_scores, key=lambda x: x[1], reverse=True)[0]
print(f"Stage B — Best ccp_alpha: {best_alpha:.6f} | CV Recall: {best_cv_recall:.4f}")

# Final model fit with the chosen ccp_alpha
best_dt = DecisionTreeClassifier(
    random_state=42,
    class_weight="balanced",
    criterion=simple_dt.criterion,
    max_depth=simple_dt.max_depth,
    min_samples_split=simple_dt.min_samples_split,
    min_samples_leaf=simple_dt.min_samples_leaf,
    min_impurity_decrease=simple_dt.min_impurity_decrease,
    ccp_alpha=best_alpha
).fit(X_train_ready, y_train)

# Evaluation
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]   

evaluate_model(y_test, y_pred_dt, "Alternative Tuned & Pruned DT")

Stage A — Best simple DT params: {'criterion': 'gini', 'max_depth': 4, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 2, 'min_samples_split': 5}
Stage A — Best CV Recall: 0.6849333333333332
Stage B — Best ccp_alpha: 0.000000 | CV Recall: 0.6849
=== Alternative Tuned & Pruned DT Evaluation ===
Accuracy : 0.7133084577114428
Precision: 0.7193274205469328
Recall   : 0.6950544545616855
F1 Score : 0.7069826568600744

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.73      0.72      5655
           1       0.72      0.70      0.71      5601

    accuracy                           0.71     11256
   macro avg       0.71      0.71      0.71     11256
weighted avg       0.71      0.71      0.71     11256

Confusion Matrix:
 [[4136 1519]
 [1708 3893]]




## Decision Tree Model Comparison

| Model                                | Accuracy | Precision | Recall | F1 Score | Key Notes |
|--------------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline DT**                      | 0.696    | 0.715     | 0.646  | 0.679    | Balanced start; 1,983 false negatives and 1,443 false positives. |
| **Tuned DT (max_depth=7, entropy)**  | 0.712    | 0.742     | 0.646  | 0.691    | Improved precision and accuracy, but recall remains low (1,984 false negatives). |
| **Alt. Tuned & Pruned DT (gini, depth=4)** | 0.713 | 0.719     | 0.695  | 0.707    | More balanced trade-off; fewer false negatives (1,708), but slightly higher false positives (1,519). |

---

### Interpretation

- **Baseline DT**: Provides a reasonable baseline with accuracy of 69.6%. Recall (64.6%) is limited, resulting in nearly 2,000 missed CVD cases. Errors are balanced across both false positives and false negatives.  

- **Tuned DT (entropy, depth=7)**: Achieves the highest **precision (74.2%)** and improves accuracy to 71.2%. However, recall does not improve (64.6%), still missing ~2,000 CVD cases. This makes it stronger for correct positive predictions but weaker for sensitivity.  

- **Alt. Tuned & Pruned DT (gini, depth=4)**: Achieves a **better balance** between precision (71.9%) and recall (69.5%). This reduces missed CVD cases (1,708 false negatives, fewer than both baseline and tuned DT). While false positives rise slightly (1,519), the overall F1 score (70.7%) is the best among the three.  

---

### Conclusion
- **Tuned DT (entropy)** is more precision-oriented, giving more reliable positive predictions but missing many CVD cases.  
- **Alt. Tuned & Pruned DT** strikes the best balance, offering **higher recall and the strongest F1 score**, making it the most suitable choice for applications where minimizing missed CVD patients is important.  
- The **Baseline DT** is adequate but outperformed by both tuned variants.  

**Preferred option**: The **Alt. Tuned & Pruned DT** because it improves recall without sacrificing too much precision, leading to the **lowest number of missed CVD cases** while keeping accuracy stable.  

---

In [12]:
import joblib, pandas as pd, numpy as np

# Save tuned Decision Tree model
model_filename = "tunedpruned_dt_model.pkl"
joblib.dump(best_dt, model_filename)

# Ensure 1D arrays
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
# Use tuned predictions/probabilities from the best estimator
y_pred_dt = best_dt.predict(X_test_ready)
y_prob_dt = best_dt.predict_proba(X_test_ready)[:, 1]   

# Optional gender column if present
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_pred_dt": y_pred_dt,
    "y_prob": y_prob_dt
})

preds_filename = "CVDKaggleData_50F50M_DT_tunedpruned_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tunedand pruned DT model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tunedand pruned DT model → tunedpruned_dt_model.pkl
Saved predictions → CVDKaggleData_50F50M_DT_tunedpruned_predictions.csv


### Ensemble Model - Random Forest (RF)

In [13]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.701137171286425
Precision: 0.7164699051674086
Recall   : 0.660953401178361
F1 Score : 0.687592867756315

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.74      0.71      5655
           1       0.72      0.66      0.69      5601

    accuracy                           0.70     11256
   macro avg       0.70      0.70      0.70     11256
weighted avg       0.70      0.70      0.70     11256

Confusion Matrix:
 [[4190 1465]
 [1899 3702]]




## Random Forest Model Evaluation

### Overall Metrics
- **Accuracy**: 70.1%  
- **Precision**: 71.6%  
- **Recall**: 66.1%  
- **F1 Score**: 68.8%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 4190         | 1465         |
| **Actual: 1** | 1899         | 3702         |

### Interpretation
- The model achieves **moderate accuracy (70.1%)**, in line with previous decision tree-based models.  
- **Precision (71.6%)** indicates that predicted CVD cases are fairly reliable.  
- **Recall (66.1%)** shows the model still misses a notable portion of CVD cases (**1,899 false negatives**).  
- For non-CVD cases, **4190 are correctly identified**, but there are **1465 false positives**, where healthy patients were incorrectly flagged as CVD.  
- The **F1 score (68.8%)** reflects a fair but not strong balance between precision and recall.  

➡️ Overall, this Random Forest provides **stable but modest predictive power**. It reduces variance compared to single decision trees but still struggles with sensitivity, meaning a considerable number of CVD patients remain undetected.  

---

### Improvement Random Forest (RF)

In [14]:
from sklearn.experimental import enable_halving_search_cv  # must be first
from sklearn.model_selection import HalvingGridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

Xtr = getattr(X_train_ready, "values", X_train_ready).astype("float32")
Xte = getattr(X_test_ready, "values", X_test_ready).astype("float32")

rf = RandomForestClassifier(random_state=42, n_jobs=1, bootstrap=True)

# Do NOT include 'n_estimators' in param_grid; Halving will vary it as the resource.
param_grid = {
    "max_depth": [None, 12],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": ["sqrt"],
    "class_weight": ["balanced"],
}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

search = HalvingGridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    resource="n_estimators",   # use trees as the resource
    min_resources=100,         # start small
    max_resources=400,         # end at your intended max
    factor=3,
    cv=cv,
    scoring="recall",
    n_jobs=-1,
    verbose=1,
    refit=True,
    return_train_score=False,
)

search.fit(Xtr, y_train)
best_rf = search.best_estimator_
y_pred_rf = best_rf.predict(Xte)
y_prob_rf = best_rf.predict_proba(Xte)[:, 1]
evaluate_model(y_test, y_pred_rf, "Random Forest (best, halving over n_estimators)")



n_iterations: 2
n_required_iterations: 2
n_possible_iterations: 2
min_resources_: 100
max_resources_: 400
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 8
n_resources: 100
Fitting 3 folds for each of 8 candidates, totalling 24 fits
----------
iter: 1
n_candidates: 3
n_resources: 300
Fitting 3 folds for each of 3 candidates, totalling 9 fits
=== Random Forest (best, halving over n_estimators) Evaluation ===
Accuracy : 0.7092217484008528
Precision: 0.7220526516596719
Recall   : 0.6757721835386538
F1 Score : 0.6981462694826155

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.74      0.72      5655
           1       0.72      0.68      0.70      5601

    accuracy                           0.71     11256
   macro avg       0.71      0.71      0.71     11256
weighted avg       0.71      0.71      0.71     11256

Confusion Matrix:
 [[4198 1457]
 [1816 3785]]




## Random Forest Model Comparison

| Model                           | Accuracy | Precision | Recall | F1 Score | Key Notes |
|---------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline RF**                 | 0.701    | 0.716     | 0.661  | 0.688    | Solid baseline; 1,899 false negatives, 1,465 false positives. |
| **Tuned RF (halving search)**   | 0.709    | 0.722     | 0.676  | 0.698    | Slightly higher accuracy, recall, and F1; fewer false negatives (1,816). |

---

### Interpretation
- **Baseline RF** achieves balanced performance with ~70% accuracy. It performs reasonably on both classes but still misses **1,899 CVD cases** (false negatives).  
- **Tuned RF (halving over n_estimators)** improves performance slightly across all metrics:  
  - Accuracy increases to **70.9%**.  
  - Recall improves to **67.6%**, reducing missed CVD cases to **1,816**.  
  - Precision rises slightly to **72.2%**, with a small drop in false positives (1,457 vs. 1,465).  
- The F1 score also improves from **68.8% → 69.8%**, showing a better balance between precision and recall.  

---

### Conclusion
The **Tuned RF (halving search)** is the stronger variant, as it improves recall and overall balance while keeping precision stable. This leads to **fewer missed CVD cases**, which is especially valuable in medical screening contexts, even if the performance gains over the baseline are modest.  

---

In [15]:
# Save Tuned Random Forest Results

# Save tuned Random Forest model
model_filename = "tuned_rf_model.pkl"
joblib.dump(best_rf, model_filename)

# Ensure 1D arrays for y_true and y_pred
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
y_pred = y_pred_rf  # from best_rf.predict(X_test_ready)
y_prob = y_prob_rf

# Optional gender column if present in test set
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results DataFrame
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_pred_rf": y_pred,
    "y_prob" :y_prob_rf
})

preds_filename = "CVDKaggleData_50M50F_RF_tuned_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved tuned RF model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved tuned RF model → tuned_rf_model.pkl
Saved predictions → CVDKaggleData_50M50F_RF_tuned_predictions.csv


### Deep Learning - Multi-layer Perceptron

In [16]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [17]:
# Initialize MLP model
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),   # one hidden layer with 100 neurons
    activation='relu',           # or 'tanh'
    solver='adam',               # optimizer
    max_iter=1000,                # increase if convergence warning appears
    random_state=42
)

# Train the model
mlp.fit(X_train_ready, y_train)

# Predict
y_pred_mlp = mlp.predict(X_test_ready)

evaluate_model(y_test, y_pred_mlp, "Multilayer Perceptron (MLP)")

=== Multilayer Perceptron (MLP) Evaluation ===
Accuracy : 0.7083333333333334
Precision: 0.7230561970746728
Recall   : 0.6707730762363864
F1 Score : 0.6959340557562286

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.75      0.72      5655
           1       0.72      0.67      0.70      5601

    accuracy                           0.71     11256
   macro avg       0.71      0.71      0.71     11256
weighted avg       0.71      0.71      0.71     11256

Confusion Matrix:
 [[4216 1439]
 [1844 3757]]




## Multilayer Perceptron (MLP) Model Evaluation

### Overall Metrics
- **Accuracy**: 70.8%  
- **Precision**: 72.3%  
- **Recall**: 67.1%  
- **F1 Score**: 69.6%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 4216         | 1439         |
| **Actual: 1** | 1844         | 3757         |

### Interpretation
- The MLP reaches **70.8% accuracy**, comparable to tree-based models on this dataset.  
- **Precision (72.3%)** indicates that most predicted CVD cases are correct.  
- **Recall (67.1%)** shows that sensitivity is still limited, with **1,844 CVD cases missed** (false negatives).  
- For non-CVD patients, 4216 are correctly identified, though **1439 false positives** remain.  
- With an **F1 score of 69.6%**, the model maintains a fair balance between precision and recall but does not strongly outperform simpler models.  

➡️ Overall, this MLP provides **moderate performance with balanced trade-offs**, but its recall remains too low for critical medical screening where minimizing missed CVD cases is a priority.  

---

### Improvements - MLP

In [18]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.7149964463397299
Precision: 0.7269967748055397
Recall   : 0.6841635422246027
F1 Score : 0.7049300956585725

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.75      0.72      5655
           1       0.73      0.68      0.70      5601

    accuracy                           0.71     11256
   macro avg       0.72      0.71      0.71     11256
weighted avg       0.72      0.71      0.71     11256

Confusion Matrix:
 [[4216 1439]
 [1769 3832]]




## Multilayer Perceptron (MLP – Adam + EarlyStopping) Evaluation

### Overall Metrics
- **Accuracy**: 71.5%  
- **Precision**: 72.7%  
- **Recall**: 68.4%  
- **F1 Score**: 70.5%  

### Confusion Matrix
|               | Predicted: 0 | Predicted: 1 |
|---------------|--------------|--------------|
| **Actual: 0** | 4216         | 1439         |
| **Actual: 1** | 1769         | 3832         |

### Interpretation
- This model achieves **accuracy of 71.5%**, slightly stronger than the baseline MLP.  
- **Precision (72.7%)** remains high, meaning most predicted CVD cases are correct.  
- **Recall improves to 68.4%**, reducing false negatives to **1,769** (compared to 1,844 in the baseline MLP).  
- Non-CVD detection remains stable: 4216 correct, with **1439 false positives**.  
- The **F1 score (70.5%)** reflects this better balance, showing that EarlyStopping helps improve generalization.  

➡️ Overall, the **Adam + EarlyStopping MLP** offers **more stable and balanced performance** than the plain MLP, with fewer missed CVD cases and stronger generalization, making it the preferable neural network configuration.  

---

### Further Improvement MLP 

In [19]:
# OPTION A — Fastest win: early stopping + single-metric scoring + lighter CV
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import fbeta_score, make_scorer
import numpy as np

# Keep arrays lean
Xtr = getattr(X_train_ready, "values", X_train_ready).astype("float32")
Xte = getattr(X_test_ready, "values", X_test_ready).astype("float32")

base_mlp = MLPClassifier(
    solver="adam",
    early_stopping=True,          # <- stop per-config when val score plateaus
    validation_fraction=0.1,
    n_iter_no_change=10,
    max_iter=200,                 # <- much lower; early_stopping will bail sooner
    tol=1e-4,
    random_state=42
)

# Trim the space; focus on what usually matters
param_dist = {
    "hidden_layer_sizes": [(64,), (128,), (64, 32)],
    "activation": ["relu"],       # 'relu' typically dominates for tabular MLPs
    "alpha": [1e-5, 1e-4, 3e-4, 1e-3],
    "learning_rate_init": [1e-3, 5e-4, 3e-4],
    "batch_size": [32, 64, 128],  # larger batch -> faster per-epoch
}

# Fewer folds = big speedup with little generalization loss
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Use a single scorer that matches your objective (recall-weighted)
fbeta2 = make_scorer(fbeta_score, beta=2)

rs = RandomizedSearchCV(
    estimator=base_mlp,
    param_distributions=param_dist,
    n_iter=20,                    # fewer trials; early stop handles over-training
    scoring=fbeta2,               # <- single metric speeds everything up
    refit=True,                   # refit on full train with best params
    cv=cv,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

rs.fit(Xtr, y_train)
best_mlp = rs.best_estimator_

print("Best MLP params:", rs.best_params_)
print(f"Best CV F-beta (β=2): {rs.best_score_:.4f}")

y_pred = best_mlp.predict(Xte)
y_prob = best_mlp.predict_proba(Xte)[:, 1]
evaluate_model(y_test, y_pred, model_name="Best MLP (Adam + ES, fast)")

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best MLP params: {'learning_rate_init': 0.001, 'hidden_layer_sizes': (64,), 'batch_size': 32, 'alpha': 1e-05, 'activation': 'relu'}
Best CV F-beta (β=2): 0.6899
=== Best MLP (Adam + ES, fast) Evaluation ===
Accuracy : 0.7160625444207533
Precision: 0.7228915662650602
Recall   : 0.6963042313872523
F1 Score : 0.7093488541287741

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.74      0.72      5655
           1       0.72      0.70      0.71      5601

    accuracy                           0.72     11256
   macro avg       0.72      0.72      0.72     11256
weighted avg       0.72      0.72      0.72     11256

Confusion Matrix:
 [[4160 1495]
 [1701 3900]]




## Multilayer Perceptron (MLP) Model Comparison

| Model                               | Accuracy | Precision | Recall | F1 Score | Key Notes |
|-------------------------------------|----------|-----------|--------|----------|-----------|
| **Baseline MLP**                    | 0.708    | 0.723     | 0.671  | 0.696    | Balanced, but misses 1,844 CVD cases; 1,439 false positives. |
| **MLP (Adam + EarlyStopping)**      | 0.715    | 0.727     | 0.684  | 0.705    | Better recall; 1,769 false negatives (reduced) while keeping 1,439 false positives. |
| **Best Tuned MLP (Adam + ES, fast)**| 0.716    | 0.723     | 0.696  | 0.709    | Strongest recall (69.6%); misses only 1,701 CVD cases; slightly higher false positives (1,495). |

---

### Interpretation
- **Baseline MLP**: Provides fair balance (accuracy ~71%), but recall (67.1%) is limited, leading to nearly **1,844 missed CVD patients**. Precision is decent at 72.3%, showing reliable positive predictions.  

- **MLP (Adam + EarlyStopping)**: Improves recall to **68.4%**, reducing false negatives to **1,769**. Accuracy also rises slightly (71.5%). It keeps false positives constant at 1,439, showing a more stable and generalizable model.  

- **Best Tuned MLP (Adam + ES, fast)**: Achieves the **best recall (69.6%)** and F1 score (70.9%). False negatives are further reduced to **1,701**, though false positives increase slightly (1,495). Accuracy (71.6%) and precision (72.3%) remain strong, making this the most effective version overall.  

---

### Conclusion
- **Baseline MLP** is adequate but limited by lower recall.  
- **MLP (Adam + EarlyStopping)** offers a clear improvement, balancing recall and precision more effectively.  
- **Best Tuned MLP** is the **strongest performer**, achieving the **highest recall and F1 score**, meaning it misses fewer CVD cases while maintaining reliability in predictions.  

➡️ For CVD detection, where missing cases is critical, the **Best Tuned MLP (Adam + ES, fast)** is the preferred choice.  

---

In [20]:
# Save Tuned MLP Results
import joblib, pandas as pd, numpy as np

# Save MLP model
model_filename =  "mlp_adamtuned.pkl"
joblib.dump(best_mlp, model_filename)

# Ensure 1D arrays for y_true and y_pred
y_true = y_test.to_numpy() if hasattr(y_test, "to_numpy") else np.asarray(y_test)
y_pred = best_mlp.predict(Xte)
y_prob = best_mlp.predict_proba(Xte)[:, 1]

# Optional gender column if present in test set
if isinstance(X_test, pd.DataFrame) and "gender" in X_test.columns:
    gender_vals = X_test["gender"].to_numpy()
else:
    gender_vals = np.full(shape=len(y_true), fill_value=np.nan)

# Build and save results DataFrame
results = pd.DataFrame({
    "gender": gender_vals,
    "y_true": y_true,
    "y_pred": y_pred,
    "y_prob" : y_prob
})

preds_filename = "CVDKaggleData_50M50F_MLP_adamtuned_predictions.csv"
results.to_csv(preds_filename, index=False)

print(f"Saved Adam tuned MLP model → {model_filename}")
print(f"Saved predictions → {preds_filename}")

Saved Adam tuned MLP model → mlp_adamtuned.pkl
Saved predictions → CVDKaggleData_50M50F_MLP_adamtuned_predictions.csv
