## Phase 3: Cross-Validated Baseline Modeling and Comprehensive Error Analysis — Logistic Regression & Random Forest


#### Categorical Feature Handling Before Baseline Modeling

During exploratory analysis, I found that only two columns in the dataset—`age_group` and `bmi_class`—were categorical segment features. In initial modeling (Logistic Regression and Random Forest), I dropped these columns for the following reasons:

- Both features were engineered as categorical "bins" from continuous variables (Age, BMI).
- Preliminary tests and domain context suggested these categories added little discriminative value and could introduce redundancy or multicollinearity.
- Tree-based models and Logistic Regression performed adequately on the main continuous features and core predictors.
- Retaining a clean, fully numerical feature set simplifies comparison across models and ensures consistency for SVM or other algorithms requiring numeric input.

For SVM and advanced model workflows, I continue to drop these segment categorical features to maintain a consistent, reproducible modeling pipeline and maximize performance on purely numerical variables.

*Note: If future analysis or model interpretability suggests value in categorical segmentation, these features can be reintroduced using appropriate encoding techniques (e.g., one-hot encoding).*



### Baseline Model 1: Logistic Regression — Cross-Validated Performance & Error Analysis


In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay,
    precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
)
import matplotlib.pyplot as plt
import numpy as np

# Data prep
X_logreg = df_scaled2.drop('Outcome', axis=1).select_dtypes(include=[np.number])
y_logreg = df_scaled2['Outcome']

logreg = LogisticRegression(max_iter=500, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Cross-validated predictions and probabilities
y_pred_logreg = cross_val_predict(logreg, X_logreg, y_logreg, cv=cv)
y_prob_logreg = cross_val_predict(logreg, X_logreg, y_logreg, cv=cv, method='predict_proba')[:, 1]

# Confusion matrix
cm_logreg = confusion_matrix(y_logreg, y_pred_logreg)
ConfusionMatrixDisplay(cm_logreg).plot(cmap='Blues')
plt.title('Logistic Regression Confusion Matrix (5-Fold Cross-Validated)')
plt.show()

# ROC curve
RocCurveDisplay.from_predictions(y_logreg, y_prob_logreg)
plt.title('Logistic Regression ROC Curve (5-Fold Cross-Validated)')
plt.show()

# Metrics
precision = precision_score(y_logreg, y_pred_logreg)
recall = recall_score(y_logreg, y_pred_logreg)
f1 = f1_score(y_logreg, y_pred_logreg)
accuracy = accuracy_score(y_logreg, y_pred_logreg)
roc_auc = roc_auc_score(y_logreg, y_prob_logreg)

print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1-score:  {f1:.3f}")
print(f"Accuracy:  {accuracy:.3f}")
print(f"ROC AUC:   {roc_auc:.3f}")


NameError: name 'df_scaled2' is not defined

### Results Interpreted from Logistic Regression

The Logistic Regression model was evaluated using 5-fold cross-validation with multiple metrics:

- **ROC AUC (0.833):**
  - The model's overall ability to distinguish between positive and negative classes is quite good, as shown by the high ROC AUC. This indicates strong general separability, but ROC AUC does not directly reflect real-world tradeoffs between recall and precision at a given threshold.
- **Recall (Sensitivity, 0.567):**
  - The model is only able to correctly identify about 57% of the actual disease cases (true positives). In other words, 43% of positive cases are missed (false negatives). This is particularly concerning for disease diagnosis, where the cost of missing a patient with a condition is typically high.
- **Precision (0.682):**
  - Among all cases predicted to have the disease, nearly 68% are correct. This tells us that positive predictions are reasonably reliable, but there are still a significant number of false positives.
- **F1-score (0.619):**
  - This metric provides a balance between precision and recall, but as both are moderate, the F1-score is only 0.62—reflecting the current model's struggle to maximize both sensitivity and specificity.
- **Accuracy (0.757):**
  - While the accuracy appears quite strong, it is not reliable for imbalanced datasets, and should not be used as a primary evaluation metric in this scenario.
- **Confusion Matrix:**
  - The confusion matrix reveals 116 false negatives (missed real cases). In a medical context, these can have serious or even fatal consequences. There are also 71 false positives, meaning some healthy individuals are misclassified as diseased.
- **Summary:**  
  - **Key Concern:** The moderate recall and number of false negatives mean many individuals who actually have the disease might go undetected if this model is used as-is.
  - **Clinical Implication:** For disease screening, missing a positive is much worse than a false alarm. Therefore, recall should be maximized, and the current results suggest more improvement is required.

---

### Next: Random Forest Comparison

To better understand the strengths and weaknesses of our baseline, the next step is to:
- **Apply the same cross-validated workflow to a Random Forest classifier.**
- Compare the ROC curve, recall, precision, F1-score, and confusion matrix with those of logistic regression to evaluate whether Random Forest can better capture the minority class (disease cases) and improve recall.

---

### After Comparing Models

Regardless of which baseline performs slightly better, both could benefit from model tuning and imbalance strategies:
1. **Threshold Tuning:** Adjust the probability threshold downward (from 0.5 to e.g. 0.4 or lower) and review the effect on recall and precision using a precision-recall curve.
2. **Class Imbalance Solutions:** Try class weighting (`class_weight='balanced'`), SMOTE/oversampling the minority class, or undersampling the majority.
3. **Additional Modeling:** Explore ensemble techniques or other classifiers (XGBoost, LightGBM) and tune hyperparameters.
4. **Feature Engineering:** Investigate new features, interactions, or non-linear transformations to better capture signal in positive cases.
5. **Reporting and Interpretation:** Document both default and recall-optimized performance, including clinical risks and tradeoffs, for well-informed decision making.

---

_This process provides maximum transparency and sets the stage for robust, high-sensitivity disease prediction._


### Baseline Model 2: Random Forest — Cross-Validated Performance & Error Analysis

In [None]:
from sklearn.metrics import (
    confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay,
    precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
)### Baseline Model 2: Random Forest — Cross-Validated Performance & Error Analysis
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
X_rf = df_scaled2.drop('Outcome', axis=1).select_dtypes(include=[np.number])
y_rf = df_scaled2['Outcome']

# Random Forest predictions and probabilities (cross-validated)
y_pred_rf = cross_val_predict(rf, X_rf, y_rf, cv=cv)
y_prob_rf = cross_val_predict(rf, X_rf, y_rf, cv=cv, method='predict_proba')[:, 1]

# Confusion matrix
cm_rf = confusion_matrix(y_rf, y_pred_rf)
ConfusionMatrixDisplay(cm_rf).plot(cmap='Blues')
plt.title(f'Random Forest Confusion Matrix ({cv}-Fold Cross-Validated)')
plt.show()

# ROC curve
RocCurveDisplay.from_predictions(y_rf, y_prob_rf)
plt.title(f'Random Forest ROC Curve ({cv}-Fold Cross-Validated)')
plt.show()

# Precision, recall, F1-score, accuracy, ROC AUC
precision = precision_score(y_rf, y_pred_rf)
recall = recall_score(y_rf, y_pred_rf)
f1 = f1_score(y_rf, y_pred_rf)
accuracy = accuracy_score(y_rf, y_pred_rf)
roc_auc = roc_auc_score(y_rf, y_prob_rf)

print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1-score:  {f1:.3f}")
print(f"Accuracy:  {accuracy:.3f}")
print(f"ROC AUC:   {roc_auc:.3f}")




### Baseline Model 2: Random Forest — Cross-Validated Performance & Error Analysis

**Random Forest Model Results (5-fold Cross-Validation):**
- **ROC AUC:** 0.815
- **Recall (Sensitivity):** 0.597
- **Precision:** 0.675
- **F1-Score:** 0.634
- **Accuracy:** 0.759

#### Detailed Interpretation:
- **ROC Curve and AUC (0.815):**  
  The Random Forest demonstrates strong global discriminative power, only slightly lower than Logistic Regression. This shows it is effective at distinguishing positive and negative classes, but recall is still a limiting factor for clinical screening.

- **Recall (0.597):**  
  The recall, or sensitivity, is 0.597—higher than Logistic Regression. The model correctly detects approximately 60% of disease cases. This means about 40% of positives are missed (false negatives). While this is an improvement over Logistic Regression, it is still a key limitation in a disease diagnosis context.

- **Precision (0.675):**  
  Out of all the cases predicted as positive, about 67.5% are indeed diseased. Random Forest achieves slightly less precision than Logistic Regression, reflecting a tradeoff between catching more positives and increasing false alarms.

- **F1-score (0.634):**  
  Slightly higher than Logistic Regression, the F1-score balances precision and recall, indicating a marginally improved ability to identify positives while managing false positives.

- **Accuracy (0.759):**  
  Very similar to Logistic Regression, showing overall prediction alignment on this imbalanced data, but not the primary metric of interest.

- **Confusion Matrix:**  
  Random Forest produces 108 false negatives—fewer than Logistic Regression (116), so it misses fewer cases. It also predicts more true positives (160 vs. 152 for Logistic), again emphasizing its better sensitivity.

---

## Model Comparison

| Metric        | Logistic Regression | Random Forest   |
|---------------|--------------------|----------------|
| **ROC AUC**   | 0.833              | 0.815          |
| **Recall**    | 0.567              | 0.597          |
| **Precision** | 0.682              | 0.675          |
| **F1-score**  | 0.619              | 0.634          |
| **Accuracy**  | 0.757              | 0.759          |
| **False Neg.**| 116                | 108            |
| **True Pos.** | 152                | 160            |

- **Takeaways:**  
  - **Random Forest** gives higher recall and F1-score than Logistic Regression, making it safer for clinical use where missing a positive is costly.
  - **Logistic Regression** produces slightly higher precision and ROC AUC.
  - Both models leave room for improvement, especially in recall.

---

## Next Steps

1. **Threshold Tuning for Higher Recall**
    - Lower the classification threshold for Random Forest (and Logistic Regression if desired), plotting precision-recall curves and selecting a threshold that delivers the desired sensitivity for the positive class.

2. **Class Imbalance Handling**
    - Use `class_weight='balanced'` in both models, or apply SMOTE/oversampling of the minority class and compare results.
    - Try under-sampling the majority class as another balancing method.

3. **Advanced Modeling**
    - Try ensemble models like XGBoost or LightGBM, or stack multiple classifiers for better robustness.
    - Tune hyperparameters for both Random Forest and Logistic Regression to optimize recall.

4. **Feature Engineering**
    - Investigate adding or transforming features to highlight signals present in positive cases.

5. **Reporting and Clinical Review**
    - Summarize recall-optimized results with confusion matrices and risk tradeoffs.
    - Collaborate with clinical stakeholders to determine the optimal recall/precision tradeoff for practical deployment.

---

_By pursuing these steps, you'll move from baselines to a recall-optimized, clinically defensible screening model._


### Step 1: Threshold Tuning for Maximum Recall


In [None]:
import numpy as np
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, precision_recall_curve
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Example if your DataFrame is df_scaled2 and target is 'Outcome'
X_rf = df_scaled2.drop('Outcome', axis=1).select_dtypes(include=[np.number])
y_rf = df_scaled2['Outcome']


# Predict probabilities (already from cross_val_predict if desired, else fit and predict_proba)
y_prob_rf = cross_val_predict(rf, X_rf, y_rf, cv=cv, method='predict_proba')[:, 1]

# Sweep thresholds and collect metrics
thresholds = np.arange(0.1, 0.9, 0.01)
recalls, precisions, fscores, accuracies = [], [], [], []

for thr in thresholds:
    y_pred_thr = (y_prob_rf >= thr).astype(int)
    recalls.append(recall_score(y_rf, y_pred_thr))
    precisions.append(precision_score(y_rf, y_pred_thr))
    fscores.append(f1_score(y_rf, y_pred_thr))
    accuracies.append(accuracy_score(y_rf, y_pred_thr))

# Plot recall and precision vs threshold
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
plt.plot(thresholds, recalls, label='Recall', marker='o')
plt.plot(thresholds, precisions, label='Precision', marker='x')
plt.plot(thresholds, fscores, label='F1-score', marker='^')
plt.plot(thresholds, accuracies, label='Accuracy', marker='s')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metric Curves vs. Threshold (Random Forest)')
plt.legend()
plt.show()

# You may choose a threshold with higher recall (even if precision drops slightly)
best_idx = np.argmax(recalls)
print(f'Best Recall: {recalls[best_idx]:.3f} at Threshold: {thresholds[best_idx]:.2f} (Precision: {precisions[best_idx]:.3f}, F1: {fscores[best_idx]:.3f}, Accuracy: {accuracies[best_idx]:.3f})')


#### 1. What We Interpreted (from Random Forest threshold tuning)

- Lowering the decision threshold rapidly increases recall (sensitivity) but at the cost of reducing precision and overall accuracy.
- At threshold = 0.10 for Random Forest:
  - **Recall reached 0.974** (very high: almost all true cases flagged)
  - **Precision dropped to 0.44**, and accuracy to ~0.56 (most flagged cases were false alarms, a result of aggressive recall)
- **Takeaway:**  
   - This tradeoff may be justified for early disease screening (where missing positives is unacceptable), but creates workload/cost/psychological burden for unnecessary follow-up testing.
   - For diagnosis (not just screening), this model and threshold would be too imprecise.

---

#### 2. Threshold Tuning & Plotting for Logistic Regression

Now, we apply the same threshold tuning approach to **Logistic Regression**. The goal:  
- Visualize how recall, precision, F1, and accuracy change as the threshold is swept.
- Decide what tradeoff you can reasonably accept.

Place this code in your notebook:



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
import numpy as np
import matplotlib.pyplot as plt

# Fit and get probabilities for logistic regression (make sure X_logreg, y_logreg, and cv are correctly defined)
logreg = LogisticRegression(max_iter=500, random_state=42)
y_prob_logreg = cross_val_predict(logreg, X_logreg, y_logreg, cv=cv, method='predict_proba')[:, 1]

# Sweep thresholds and collect metrics
thresholds = np.arange(0.1, 0.9, 0.01)
recalls, precisions, fscores, accuracies = [], [], [], []

for thr in thresholds:
    y_pred_thr = (y_prob_logreg >= thr).astype(int)
    recalls.append(recall_score(y_logreg, y_pred_thr))
    precisions.append(precision_score(y_logreg, y_pred_thr))
    fscores.append(f1_score(y_logreg, y_pred_thr))
    accuracies.append(accuracy_score(y_logreg, y_pred_thr))

plt.figure(figsize=(10,5))
plt.plot(thresholds, recalls, label='Recall', marker='o')
plt.plot(thresholds, precisions, label='Precision', marker='x')
plt.plot(thresholds, fscores, label='F1-score', marker='^')
plt.plot(thresholds, accuracies, label='Accuracy', marker='s')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metric Curves vs. Threshold (Logistic Regression)')
plt.legend()
plt.show()

best_idx = np.argmax(recalls)
print(f'Best Recall: {recalls[best_idx]:.3f} at Threshold: {thresholds[best_idx]:.2f} (Precision: {precisions[best_idx]:.3f}, F1: {fscores[best_idx]:.3f}, Accuracy: {accuracies[best_idx]:.3f})')


### Logistic Regression: Threshold Tuning Analysis

#### 1. Metric Curves vs. Threshold

- **Best Recall: 0.974 at Threshold: 0.10**  
  (Precision: 0.453, F1: 0.618, Accuracy: 0.581)

#### 2. Interpretation

- Lowering the threshold to 0.10 causes:
    - **Recall** (sensitivity) to reach 0.974, meaning the model successfully identifies nearly all true positive (disease) cases.
    - **Precision** drops to 0.453: less than fifty percent of flagged positives are true positives, producing many false alarms.
    - **F1-score** is moderate (0.618). This reflects the tradeoff: you’re catching almost all cases, but with many false alarms.
    - **Accuracy** also drops—most likely due to class imbalance and the high number of false positives.
- The shape of the curves is similar to Random Forest: **as threshold decreases, recall increases but precision and accuracy decrease.**
- **Takeaway:** This setting maximizes sensitivity for use as a screening tool. However, just as with Random Forest, too many false alarms might burden the clinical workflow, leading to unnecessary follow-ups.

---

#### 3. What This Means

- **Good Use:**  
    - **Early Screening or Triage:** When the cost of missing a true case is very high and downstream confirmation is feasible.
- **Potential Drawback:**  
    - **Diagnostic Phase:** If positive cases flagged by the model always required an expensive or invasive test, precision this low could cause too many unnecessary procedures.
    - **Resource Management:** High false positive rate could overwhelm limited medical resources.

---

#### 4. Recommended Next Steps

- **Clinically Informed Threshold Selection:**  
   - Rather than using the "best recall" threshold by default, discuss with clinicians to identify a tradeoff that keeps recall high but improves precision/accuracy where possible (e.g., perhaps 0.3–0.5).
- **Balance Techniques:**  
   - Try SMOTE or class weighting to bring up recall **and** precision together. 
- **Model Comparison:**  
   - Compare these curves and tradeoffs to those from Random Forest. If the shapes are very similar, further model/feature work may be needed to get both high recall and improved precision.
- **Experiment with advanced models:**  
   - Use XGBoost, LightGBM, CatBoost, or ensemble approaches to try to shift the tradeoff.
- **Stakeholder Review:**  
   - Present clear curves, confusion matrices, and impact for threshold choices to non-technical decision makers.

---

**Summary:**  
Just like Random Forest, your logistic regression model can be made extremely sensitive but non-specific. This is useful for “ruling out” in first-line screening, but you need a follow-up plan for resolving false alarms and ensuring resources are used intelligently.


#### Selecting a Clinically Sensible Threshold Based on Recall Target(Logistic regression)


In [None]:
# Suppose you want to choose the threshold where recall >= 0.90 (edit value if you want slightly more/less recall)
desired_recall = 0.90

# Find the index where recall just crosses this threshold
recall_idxs = np.where(np.array(recalls) >= desired_recall)[0]

if len(recall_idxs) > 0:
    idx = recall_idxs[-1]  # The highest threshold for this recall (i.e., safest)
    selected_thr = thresholds[idx]
    print(f"Selected Threshold: {selected_thr:.2f}")
    print(f"Recall:    {recalls[idx]:.3f}")
    print(f"Precision: {precisions[idx]:.3f}")
    print(f"F1-score:  {fscores[idx]:.3f}")
    print(f"Accuracy:  {accuracies[idx]:.3f}")
else:
    print("No threshold achieves the desired recall target. Try lowering the target.")

# If you want to see which threshold gives little loss of precision for a still good recall, you can also sweep for higher F1.


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Use the same probabilities and data as before
selected_thr = 0.18
y_pred_selected = (y_prob_logreg >= selected_thr).astype(int)

cm = confusion_matrix(y_logreg, y_pred_selected)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title(f"Logistic Regression Confusion Matrix (Threshold = {selected_thr})")
plt.show()

# Optional: Print underlying confusion matrix values
print("Confusion Matrix (rows=true, cols=pred):\n", cm)

#### Logistic Regression Confusion Matrix at Selected Clinical Threshold (Threshold = 0.18)

|                | Predicted Negative | Predicted Positive |
|----------------|-------------------|-------------------|
| **True Negative** | 270               | 230               |
| **True Positive** | 22                | 246               |

- **True Positive (TP):** 246 — Actual positives correctly identified.
- **False Positive (FP):** 230 — Actual negatives incorrectly flagged as positives (false alarms).
- **True Negative (TN):** 270 — Actual negatives correctly identified.
- **False Negative (FN):** 22 — Actual positives missed by the model.

---

**Interpretation:**
- **Recall** remains very high: only **22 actual positive cases are missed**.
- **Precision** is moderate, reflecting that a little more than half your positive predictions are correct; many are false alarms.
- There is a significant number of **false positives (230)**, which is the cost of achieving high recall.

**Clinical Implication:**  
At this threshold, your model is appropriate for scenarios where **missing a real case is unacceptable** (screening), and the healthcare/clinical system can handle false positives for further testing or triage.

---

**Next suggestion**:  
- Repeat this same confusion matrix analysis for **Random Forest** at its chosen recall-optimized threshold, or
- Begin with class balancing (SMOTE/class weights) for further improvement.



### Selecting a Clinically Sensible Threshold Based on Recall Target (Random Forest)

In [None]:
desired_recall_rf = 0.90  # or your chosen high value

recall_idxs_rf = np.where(np.array(recalls) >= desired_recall_rf)[0]

if len(recall_idxs_rf) > 0:
    idx_rf = recall_idxs_rf[-1]  # The highest threshold for this recall
    selected_thr_rf = thresholds[idx_rf]
    print(f"Selected Threshold (Random Forest): {selected_thr_rf:.2f}")
    print(f"Recall:    {recalls[idx_rf]:.3f}")
    print(f"Precision: {precisions[idx_rf]:.3f}")
    print(f"F1-score:  {fscores[idx_rf]:.3f}")
    print(f"Accuracy:  {accuracies[idx_rf]:.3f}")
else:
    print("No threshold achieves the desired recall target for Random Forest. Try lowering the target.")


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Use the same predicted probabilities and true labels as before (for Random Forest)
y_pred_rf_selected = (y_prob_rf >= 0.18).astype(int)

cm_rf = confusion_matrix(y_rf, y_pred_rf_selected)
disp_rf = ConfusionMatrixDisplay(confusion_matrix=cm_rf)
disp_rf.plot(cmap='Blues')
plt.title("Random Forest Confusion Matrix (Threshold = 0.18)")
plt.show()

# Optional: Print raw counts for summary
print("Random Forest Confusion Matrix (rows=true, cols=pred):\n", cm_rf)


#### Random Forest vs Logistic Regression at Clinically-Informed Threshold (Threshold = 0.18)

##### **Confusion Matrices**

| Model                | True Neg (TN) | False Pos (FP) | False Neg (FN) | True Pos (TP) |
|----------------------|:-------------:|:--------------:|:--------------:|:-------------:|
| Logistic Regression  |     270       |      230       |      22        |     246       |
| Random Forest        |     242       |      258       |      25        |     243       |

##### **Metrics**

| Model                | Recall | Precision | F1-score | Accuracy |
|----------------------|:------:|:---------:|:--------:|:--------:|
| Logistic Regression  | 0.918  |   0.517   |  0.661   |  0.672   |
| Random Forest        | 0.907  |   0.485   |  0.632   |  0.628   |

---

##### **Interpretation & Comparison**

- **Recall:** Both models achieve very high recall, slightly higher with Logistic Regression (0.918) than Random Forest (0.907), meaning both miss very few actual positive cases.
- **Precision:** Both models have moderate precision, with Logistic being modestly higher—just over half of positive predictions are correct.
- **False Positives:** Random Forest produces more false positives (258) than Logistic Regression (230), so Logistic Regression will generate fewer unnecessary alarms for the same recall level.
- **F1-score & Accuracy:** Both metrics are slightly higher for Logistic Regression. Both models, however, still have a large number of false positives as the cost for high recall.
- **Clinical Use:** Both thresholds are appropriate for first-stage screening if the system can handle the follow-up demand for false positives. Logistic Regression may be preferred if minimizing false alarms is a top concern, but the difference is modest.

---

##### **Next Steps**

- Consider class balancing (SMOTE, class weights) to improve precision.
- Explore advanced models to shift the tradeoff curve.
- Present this side-by-side summary to clinical or managerial stakeholders for final workflow decision.

---


### Step: Class Balancing to Improve Recall-Precision Tradeoff

**Objective:**  
To reduce the number of false positives while maintaining high recall by handling class imbalance using:
- 1. `class_weight='balanced'` in model initialization (LogisticRegression, RandomForestClassifier)
- 2. Synthetic Minority Over-sampling Technique (SMOTE)

---

#### 1. Using Class Weight




---

#### 2. Using SMOTE (Synthetic Oversampling)




*Train new models on these resampled datasets using the same pipeline as before.*

---

#### Next Actions

1. Repeat your **metric curve plotting and threshold selection** process for these rebalanced models.
2. Compare confusion matrices and metrics to previous results.
3. Interpret: *Does class balancing increase precision while keeping recall high? Does it significantly change F1-score and accuracy?*



### 1) Using Class Weight

### Logistic Regression with Class Weight

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

# Model with class_weight
logreg_bal = LogisticRegression(max_iter=500, class_weight='balanced', random_state=42)
y_prob_logreg_bal = cross_val_predict(logreg_bal, X_logreg, y_logreg, cv=cv, method='predict_proba')[:, 1]


### Metric Curve Plotting for Balanced Logistic Regression

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

thresholds = np.arange(0.1, 0.9, 0.01)
recalls_bal, precisions_bal, fscores_bal, accuracies_bal = [], [], [], []

for thr in thresholds:
    y_pred_thr = (y_prob_logreg_bal >= thr).astype(int)
    recalls_bal.append(recall_score(y_logreg, y_pred_thr))
    precisions_bal.append(precision_score(y_logreg, y_pred_thr))
    fscores_bal.append(f1_score(y_logreg, y_pred_thr))
    accuracies_bal.append(accuracy_score(y_logreg, y_pred_thr))

plt.figure(figsize=(10,5))
plt.plot(thresholds, recalls_bal, label='Recall', marker='o')
plt.plot(thresholds, precisions_bal, label='Precision', marker='x')
plt.plot(thresholds, fscores_bal, label='F1-score', marker='^')
plt.plot(thresholds, accuracies_bal, label='Accuracy', marker='s')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metric Curves vs. Threshold (LogReg with class_weight="balanced")')
plt.legend()
plt.show()


### Selecting Sensible Threshold and Confusion Matrix

In [None]:
# Select threshold for recall >= 0.90 (or what you prefer)
desired_recall = 0.90
recall_idxs_bal = np.where(np.array(recalls_bal) >= desired_recall)[0]

if len(recall_idxs_bal) > 0:
    idx_bal = recall_idxs_bal[-1]
    selected_thr_bal = thresholds[idx_bal]
    print(f"Selected Threshold: {selected_thr_bal:.2f}")
    print(f"Recall:    {recalls_bal[idx_bal]:.3f}")
    print(f"Precision: {precisions_bal[idx_bal]:.3f}")
    print(f"F1-score:  {fscores_bal[idx_bal]:.3f}")
    print(f"Accuracy:  {accuracies_bal[idx_bal]:.3f}")
else:
    print("No threshold achieves the desired recall target.")

# Show confusion matrix at this threshold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
y_pred_bal = (y_prob_logreg_bal >= selected_thr_bal).astype(int)
cm_bal = confusion_matrix(y_logreg, y_pred_bal)
disp_bal = ConfusionMatrixDisplay(confusion_matrix=cm_bal)
disp_bal.plot(cmap='Blues')
plt.title(f"Confusion Matrix (LogReg with class_weight, Threshold={selected_thr_bal:.2f})")
plt.show()
print("Confusion Matrix (rows=true, cols=pred):\n", cm_bal)


#### Logistic Regression (with class_weight='balanced', Threshold = 0.29): Results

- **Selected Threshold:** 0.29
- **Recall:** 0.907
- **Precision:** 0.517
- **F1-score:** 0.659
- **Accuracy:** 0.672

|                | Predicted Negative | Predicted Positive |
|----------------|-------------------|-------------------|
| **True Negative** | 273               | 227               |
| **True Positive** | 25                | 243               |

- **True Positive (TP):** 243
- **False Positive (FP):** 227
- **True Negative (TN):** 273
- **False Negative (FN):** 25

---

**Interpretation:**
- With class balancing by class_weight, logistic regression maintains high recall while **reducing false positives and increasing true negatives** compared to the vanilla model.
- **Precision and overall accuracy improve or are maintained**; you have successfully preserved sensitivity while gaining on the specificity side.
- This means fewer unnecessary follow-ups for a very similar miss rate on real cases.

**Clinical implication:**  
With this setting, WE have an even more practical screening model, balancing the need to catch nearly all true cases without overwhelming the follow-up system.

---


### Random Forest with Class Weight (class_weight='balanced')

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

# Train Random Forest with class_weight='balanced'
rf_bal = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
y_prob_rf_bal = cross_val_predict(rf_bal, X_rf, y_rf, cv=cv, method='predict_proba')[:, 1]


### Metric Curve Plotting for Balanced Random Forest

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

thresholds = np.arange(0.1, 0.9, 0.01)
recalls_rf_bal, precisions_rf_bal, fscores_rf_bal, accuracies_rf_bal = [], [], [], []

for thr in thresholds:
    y_pred_thr = (y_prob_rf_bal >= thr).astype(int)
    recalls_rf_bal.append(recall_score(y_rf, y_pred_thr))
    precisions_rf_bal.append(precision_score(y_rf, y_pred_thr))
    fscores_rf_bal.append(f1_score(y_rf, y_pred_thr))
    accuracies_rf_bal.append(accuracy_score(y_rf, y_pred_thr))

plt.figure(figsize=(10,5))
plt.plot(thresholds, recalls_rf_bal, label='Recall', marker='o')
plt.plot(thresholds, precisions_rf_bal, label='Precision', marker='x')
plt.plot(thresholds, fscores_rf_bal, label='F1-score', marker='^')
plt.plot(thresholds, accuracies_rf_bal, label='Accuracy', marker='s')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metric Curves vs. Threshold (Random Forest with class_weight="balanced")')
plt.legend()
plt.show()


### Selecting Sensible Threshold and Confusion Matrix (Random Forest)

In [None]:
# Threshold selection for recall >= 0.90 (or your preferred value)
desired_recall_rf = 0.90
recall_idxs_rf_bal = np.where(np.array(recalls_rf_bal) >= desired_recall_rf)[0]

if len(recall_idxs_rf_bal) > 0:
    idx_rf_bal = recall_idxs_rf_bal[-1]
    selected_thr_rf_bal = thresholds[idx_rf_bal]
    print(f"Selected Threshold: {selected_thr_rf_bal:.2f}")
    print(f"Recall:    {recalls_rf_bal[idx_rf_bal]:.3f}")
    print(f"Precision: {precisions_rf_bal[idx_rf_bal]:.3f}")
    print(f"F1-score:  {fscores_rf_bal[idx_rf_bal]:.3f}")
    print(f"Accuracy:  {accuracies_rf_bal[idx_rf_bal]:.3f}")
else:
    print("No threshold achieves the desired recall target.")

# Confusion matrix at the selected threshold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
y_pred_rf_bal = (y_prob_rf_bal >= selected_thr_rf_bal).astype(int)
cm_rf_bal = confusion_matrix(y_rf, y_pred_rf_bal)
disp_rf_bal = ConfusionMatrixDisplay(confusion_matrix=cm_rf_bal)
disp_rf_bal.plot(cmap='Blues')
plt.title(f"Confusion Matrix (RF with class_weight, Threshold={selected_thr_rf_bal:.2f})")
plt.show()
print("Confusion Matrix (rows=true, cols=pred):\n", cm_rf_bal)


#### Logistic Regression (with class_weight='balanced', Threshold = 0.29): Results

- **Selected Threshold:** 0.29
- **Recall:** 0.907
- **Precision:** 0.517
- **F1-score:** 0.659
- **Accuracy:** 0.672

|                | Predicted Negative | Predicted Positive |
|----------------|-------------------|-------------------|
| **True Negative** | 273               | 227               |
| **True Positive** | 25                | 243               |

- **True Positive (TP):** 243
- **False Positive (FP):** 227
- **True Negative (TN):** 273
- **False Negative (FN):** 25

---

**Interpretation:**
- With class balancing by class_weight, logistic regression maintains high recall while **reducing false positives and increasing true negatives** compared to the vanilla model.
- **Precision and overall accuracy improve or are maintained**; you have successfully preserved sensitivity while gaining on the specificity side.
- This means fewer unnecessary follow-ups for a very similar miss rate on real cases.

**Clinical implication:**  
With this setting, you have an even more practical screening model, balancing the need to catch nearly all true cases without overwhelming the follow-up system.

---



#### Class-Weighted Models: Logistic Regression vs Random Forest (at Sensible Clinical Threshold)

##### Logistic Regression (`class_weight='balanced'`, Threshold = 0.29)
- **Recall:** 0.907
- **Precision:** 0.517
- **F1-score:** 0.659
- **Accuracy:** 0.672

|                | Predicted Negative | Predicted Positive |
|----------------|-------------------|-------------------|
| **True Negative** | 273               | 227               |
| **True Positive** | 25                | 243               |

##### Random Forest (`class_weight='balanced'`, Threshold = 0.20)
- **Recall:** 0.907
- **Precision:** 0.520
- **F1-score:** 0.661
- **Accuracy:** 0.676

|                | Predicted Negative | Predicted Positive |
|----------------|-------------------|-------------------|
| **True Negative** | 276               | 224               |
| **True Positive** | 25                | 243               |

---

##### **Interpretation & Comparison**

- **Recall is identical (0.907) for both models**—both are excellent at catching true cases.
- **Precision and F1-score are nearly the same**, with Random Forest being ever-so-slightly higher in both.
- **Accuracy is slightly higher for Random Forest** (0.676 vs 0.672), mainly due to a few more correct negatives.
- The *number of false positives and true negatives is very closely matched*; both models achieve a strong reduction in false positives versus their vanilla (unbalanced) forms.
- **Clinical impact:** Either model is now both highly sensitive and much less overwhelming with unnecessary follow-ups thanks to balancing.

---

**Conclusion:**  
- Both class-weighted models yield very similar, solid results at high recall.
- **Random Forest edges ahead very slightly on precision, F1, and accuracy, but the difference is minor.**
- You can confidently propose either model as an effective screening tool after class weighting.

---

**Next step:**  
- Try SMOTE balancing to potentially squeeze out more improvement in minority class recovery or to test if even higher precision can be achieved without a recall drop.
nm

#  Handling Class Imbalance Using SMOTE 

Real medical datasets often show a class imbalance, where the number of positive cases 
(diabetes diagnosed = 1) is much smaller than the number of negative cases.

To ensure the model learns minority-class patterns effectively, we apply **SMOTE**  
(Synthetic Minority Oversampling Technique) — but **only on the training data**, 
to avoid data leakage.

In this section we will:

1. Apply SMOTE safely (train data only)
2. Scale the data after SMOTE
3. Train and evaluate the following models:
   - **Logistic Regression**
   - **Random Forest**
   - **XGBoost**
4. Compare their performance on the untouched test set


### Apply SMOTE Oversampling

In [None]:
!pip install imbalanced-learn


In [None]:
from imblearn.over_sampling import SMOTE


In [None]:
# ============================================================
# Section: Handling Class Imbalance Using SMOTE
# (Safe: Train-Only Oversampling to Avoid Data Leakage)
# ============================================================

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    confusion_matrix, precision_score, recall_score,
    f1_score, accuracy_score, roc_auc_score, ConfusionMatrixDisplay
)
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# ------------------------------------------------------------
# 1. Remove categorical / binned features before SMOTE
# ------------------------------------------------------------
df_smote_base = df_scaled2.drop(columns=["Age_Group", "BMI_Class"])

print("Columns used for SMOTE:\n", df_smote_base.columns.tolist())

# ------------------------------------------------------------
# 2. Split into features & target
# ------------------------------------------------------------
X = df_smote_base.drop(columns=["Outcome"])
y = df_smote_base["Outcome"]

# ------------------------------------------------------------
# 3. Train-Test Split
# ------------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("\nTrain class distribution BEFORE SMOTE:")
print(y_train.value_counts())

# ------------------------------------------------------------
# 4. Apply SMOTE only on training data
# ------------------------------------------------------------
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print("\nTrain class distribution AFTER SMOTE:")
print(y_train_res.value_counts())
print(f"Training samples after SMOTE: {X_train_res.shape}")

# ------------------------------------------------------------
# 5. Scaling
# ------------------------------------------------------------
scaler = StandardScaler()
X_train_res_s = scaler.fit_transform(X_train_res)
X_test_s = scaler.transform(X_test)

# ============================================================
# 6. Logistic Regression + SMOTE
# ============================================================
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train_res_s, y_train_res)

lr_probs = lr.predict_proba(X_test_s)[:, 1]
lr_preds = (lr_probs >= 0.5).astype(int)

print("\n======= Logistic Regression + SMOTE =======")
print("Precision:", precision_score(y_test, lr_preds))
print("Recall:", recall_score(y_test, lr_preds))
print("F1:", f1_score(y_test, lr_preds))
print("Accuracy:", accuracy_score(y_test, lr_preds))
print("AUC:", roc_auc_score(y_test, lr_probs))
print("Confusion Matrix:\n", confusion_matrix(y_test, lr_preds))

plt.figure(figsize=(5,4))
ConfusionMatrixDisplay(confusion_matrix(y_test, lr_preds)).plot(cmap="Blues")
plt.title("Confusion Matrix - Logistic Regression + SMOTE")
plt.show()

# ============================================================
# 7. Random Forest + SMOTE
# ============================================================
rf = RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_train_res_s, y_train_res)

rf_probs = rf.predict_proba(X_test_s)[:, 1]
rf_preds = (rf_probs >= 0.5).astype(int)

print("\n======= Random Forest + SMOTE =======")
print("Precision:", precision_score(y_test, rf_preds))
print("Recall:", recall_score(y_test, rf_preds))
print("F1:", f1_score(y_test, rf_preds))
print("Accuracy:", accuracy_score(y_test, rf_preds))
print("AUC:", roc_auc_score(y_test, rf_probs))
print("Confusion Matrix:\n", confusion_matrix(y_test, rf_preds))

plt.figure(figsize=(5,4))
ConfusionMatrixDisplay(confusion_matrix(y_test, rf_preds)).plot(cmap="Blues")
plt.title("Confusion Matrix - Random Forest + SMOTE")
plt.show()

# ============================================================
# 8. XGBoost + SMOTE
# ============================================================
xgb = XGBClassifier(
    eval_metric="logloss",
    random_state=42,
    n_estimators=300,
    learning_rate=0.05,
    max_depth=4
)
xgb.fit(X_train_res_s, y_train_res)

xgb_probs = xgb.predict_proba(X_test_s)[:, 1]
xgb_preds = (xgb_probs >= 0.5).astype(int)

print("\n======= XGBoost + SMOTE =======")
print("Precision:", precision_score(y_test, xgb_preds))
print("Recall:", recall_score(y_test, xgb_preds))
print("F1:", f1_score(y_test, xgb_preds))
print("Accuracy:", accuracy_score(y_test, xgb_preds))
print("AUC:", roc_auc_score(y_test, xgb_probs))
print("Confusion Matrix:\n", confusion_matrix(y_test, xgb_preds))

plt.figure(figsize=(5,4))
ConfusionMatrixDisplay(confusion_matrix(y_test, xgb_preds)).plot(cmap="Blues")
plt.title("Confusion Matrix - XGBoost + SMOTE")
plt.show()

# ============================================================
# 9. SVM (RBF Kernel) + SMOTE
# ============================================================
svm = SVC(kernel='rbf', probability=True, random_state=42)
svm.fit(X_train_res_s, y_train_res)

svm_probs = svm.predict_proba(X_test_s)[:, 1]
svm_preds = (svm_probs >= 0.5).astype(int)

print("\n======= SVM (RBF Kernel) + SMOTE =======")
print("Precision:", precision_score(y_test, svm_preds))
print("Recall:", recall_score(y_test, svm_preds))
print("F1:", f1_score(y_test, svm_preds))
print("Accuracy:", accuracy_score(y_test, svm_preds))
print("AUC:", roc_auc_score(y_test, svm_probs))
print("Confusion Matrix:\n", confusion_matrix(y_test, svm_preds))

plt.figure(figsize=(5,4))
ConfusionMatrixDisplay(confusion_matrix(y_test, svm_preds)).plot(cmap="Blues")
plt.title("Confusion Matrix - SVM (RBF) + SMOTE")
plt.show()


## Comparing Class-Weight Balancing vs SMOTE Oversampling

This section compares two imbalance-handling strategies:

1. Class-weight balancing (`class_weight='balanced'`)
2. SMOTE oversampling (applied only to the training data)

Class-weight models act as a baseline.  
SMOTE models attempt to create a more balanced representation for the minority class.

---

## Class-Weight Balanced Models (Baseline)

### Logistic Regression (`class_weight='balanced'`, Threshold = 0.29)

- **Recall:** 0.907  
- **Precision:** 0.517  
- **F1-score:** 0.659  
- **Accuracy:** 0.672  

**Confusion Matrix**
```
[[273 227]
 [ 25 243]]
```

### Random Forest (`class_weight='balanced'`, Threshold = 0.20)

- **Recall:** 0.907  
- **Precision:** 0.520  
- **F1-score:** 0.661  
- **Accuracy:** 0.676  

**Confusion Matrix**
```
[[276 224]
 [ 25 243]]
```

### Interpretation (Class-Weight Models)

- Very high recall indicates the models successfully identify diabetics.
- However, both models misclassify a large number of non-diabetics as diabetic (high false positives).
- Precision remains low (around 0.52), which is not suitable for clinical usage.
- Class-weighting alone leads to over-prediction of the positive class.

---

## SMOTE-Based Models  
*(SMOTE applied safely on the training set only, then scaled, then modeled.)*

### Logistic Regression + SMOTE (Threshold = 0.5)

- **Precision:** 0.560  
- **Recall:** 0.685  
- **F1-score:** 0.616  
- **Accuracy:** 0.701  
- **AUC:** 0.797  

**Confusion Matrix**
```
[[71 29]
 [17 37]]
```

---

### Random Forest + SMOTE (Threshold = 0.5)

- **Precision:** 0.639  
- **Recall:** 0.722  
- **F1-score:** 0.679  
- **Accuracy:** 0.759  
- **AUC:** 0.821  

**Confusion Matrix**
```
[[78 22]
 [15 39]]
```

---

### XGBoost + SMOTE (Threshold = 0.5)

- **Precision:** 0.645  
- **Recall:** 0.740  
- **F1-score:** 0.689  
- **Accuracy:** 0.766  
- **AUC:** 0.813  

**Confusion Matrix**
```
[[78 22]
 [14 40]]
```

---

### SVM (RBF Kernel) + SMOTE (Threshold = 0.5)

- **Precision:** 0.619  
- **Recall:** 0.722  
- **F1-score:** 0.667  
- **Accuracy:** 0.746  
- **AUC:** 0.811  

**Confusion Matrix**
```
[[76 24]
 [15 39]]
```

### Interpretation (SVM + SMOTE)

- SVM performs strongly after SMOTE, achieving recall comparable to Random Forest and strong precision.
- Accuracy (0.746) and AUC (0.811) are competitive with RF and XGB.
- F1-score is slightly lower than XGBoost but higher than Logistic Regression.
- Overall, SVM becomes a viable candidate once SMOTE is used.

---

## Direct Comparison: Class-Weight vs SMOTE

| Model                         | Precision | Recall | F1-score | Accuracy | Notes |
|------------------------------|-----------|--------|----------|----------|-------|
| Logistic Regression (CW)     | 0.517     | 0.907  | 0.659    | 0.672    | High recall, very low precision |
| Random Forest (CW)           | 0.520     | 0.907  | 0.661    | 0.676    | Many false positives |
| Logistic Regression + SMOTE  | 0.560     | 0.685  | 0.616    | 0.701    | More balanced performance |
| Random Forest + SMOTE        | 0.639     | 0.722  | 0.679    | 0.759    | Strong improvement |
| XGBoost + SMOTE              | 0.645     | 0.740  | 0.689    | 0.766    | Best overall metrics |
| SVM (RBF) + SMOTE            | 0.619     | 0.722  | 0.667    | 0.746    | Competitive; better than LR+SMOTE |

---

## Summary and Key Insights

- Class-weight balancing gives high recall but fails to control false positives, resulting in low precision and poor real-world usability.
- SMOTE improves both recall and precision by creating more realistic synthetic minority samples.
- Logistic Regression benefits from SMOTE but still lags behind tree-based models and SVM.
- Random Forest and XGBoost show substantial performance improvement after SMOTE.
- XGBoost + SMOTE delivers the strongest combination of:
  - Precision  
  - Recall  
  - F1-score  
  - Accuracy  
  - AUC  
- SVM + SMOTE performs surprisingly well and surpasses Logistic Regression + SMOTE, but slightly underperforms XGBoost.

**Final takeaway:**  
SMOTE is clearly superior to class-weight balancing, and XGBoost + SMOTE is the best-performing model overall. SVM becomes competitive only after SMOTE is applied.

---


# Final Model Comparison Table (All SMOTE Models)

In [None]:


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create comparison dictionary
comparison = {
    "Model": [
        "Logistic Regression + SMOTE",
        "Random Forest + SMOTE",
        "XGBoost + SMOTE",
        "SVM (RBF) + SMOTE"
    ],
    "Precision": [
        precision_score(y_test, lr_preds),
        precision_score(y_test, rf_preds),
        precision_score(y_test, xgb_preds),
        precision_score(y_test, svm_preds)
    ],
    "Recall": [
        recall_score(y_test, lr_preds),
        recall_score(y_test, rf_preds),
        recall_score(y_test, xgb_preds),
        recall_score(y_test, svm_preds)
    ],
    "F1-score": [
        f1_score(y_test, lr_preds),
        f1_score(y_test, rf_preds),
        f1_score(y_test, xgb_preds),
        f1_score(y_test, svm_preds)
    ],
    "Accuracy": [
        accuracy_score(y_test, lr_preds),
        accuracy_score(y_test, rf_preds),
        accuracy_score(y_test, xgb_preds),
        accuracy_score(y_test, svm_preds)
    ],
    "AUC": [
        roc_auc_score(y_test, lr_probs),
        roc_auc_score(y_test, rf_probs),
        roc_auc_score(y_test, xgb_probs),
        roc_auc_score(y_test, svm_probs)
    ]
}

comparison_df = pd.DataFrame(comparison)
display(comparison_df)

# ============================================================
# Bar Plot for F1-score Comparison
# ============================================================

plt.figure(figsize=(10,5))
sns.barplot(
    data=comparison_df.sort_values("F1-score", ascending=False),
    x="Model", y="F1-score", palette="Blues_d"
)
plt.xticks(rotation=45, ha="right")
plt.title("Model Comparison (F1-score) - SMOTE Oversampling")
plt.ylabel("F1-score")
plt.xlabel("")
plt.tight_layout()
plt.show()


### Summary and Discussion

Across all models trained using SMOTE, **XGBoost + SMOTE** achieved the strongest overall performance, obtaining the highest F1-score (0.689), accuracy (0.766), and recall (0.740). These results indicate that **XGBoost** provides the best balance between sensitivity and specificity, which is essential for a clinical screening context where both false negatives and false positives must be carefully controlled.  

Both **Random Forest + SMOTE** and **SVM (RBF) + SMOTE** also showed substantial gains compared to the class-weight baselines, demonstrating competitive recall and solid overall performance. Their nonlinear decision boundaries benefited significantly from the enriched synthetic samples generated by SMOTE. Meanwhile, **Logistic Regression + SMOTE** produced more balanced performance than its class-weight counterpart, although it still lagged behind the nonlinear models.

In comparison, the class-weight-only models (**Logistic Regression (balanced)** and **Random Forest (balanced)**) exhibited very high recall but suffered from excessive false positives, resulting in low precision. This highlights an important limitation of simple class weighting in medical datasets: while it reduces false negatives, it leads to over-diagnosis, making the model less reliable for practical deployment.

Overall, these results establish that targeted imbalance handling is crucial for medical prediction tasks. The superior performance of **XGBoost + SMOTE** makes it the most appropriate choice for threshold optimization, interpretability analysis, and eventual deployment in a screening-oriented application.


## Threshold Tuning for the Final Model (XGBoost + SMOTE)


### Why Threshold Tuning Is Necessary

Most machine learning classifiers produce probability outputs between 0 and 1, but convert these probabilities into class labels using a default threshold of 0.50. While this default value is mathematically convenient, it is rarely optimal—especially in medical diagnostics, where the relative cost of false negatives and false positives is highly unequal.

In the context of diabetes prediction, a false negative (a diabetic patient incorrectly classified as healthy) is far more dangerous than a false positive. Therefore, the decision threshold should be carefully optimized to achieve an appropriate balance between sensitivity (recall) and precision. Threshold tuning evaluates model performance across a range of probability cutoffs and identifies the value that maximizes the clinical objective—commonly high recall with an acceptable level of precision, or the highest overall F1-score.

### Why Threshold Tuning Is Performed Only on the Best Model

Threshold tuning is computationally intensive and is meaningful only for the final selected model. Since earlier experiments showed that **XGBoost + SMOTE** consistently achieved the strongest performance across accuracy, recall, precision, F1-score, and AUC, it is chosen as the final candidate for deployment. Running threshold optimization on multiple models would add unnecessary complexity without offering additional practical value, because only the best-performing model will be used in the final application.

Focusing threshold tuning exclusively on **XGBoost + SMOTE** ensures that the optimization effort is directed where it matters most: refining the performance of the clinically relevant final model.


In [None]:
# ============================================================
# Threshold Tuning for XGBoost + SMOTE
# ============================================================

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    precision_recall_curve,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    roc_auc_score,
    ConfusionMatrixDisplay
)

# 1. Get predicted probabilities
xgb_probs = xgb.predict_proba(X_test_s)[:, 1]

# 2. Generate Precision-Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, xgb_probs)

# 3. Compute F1-score for each threshold (avoid division by zero)
f1_scores = (2 * precision * recall) / (precision + recall + 1e-9)

# 4. Identify the threshold that maximizes F1-score
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
best_f1 = f1_scores[best_idx]

print("Best Threshold (max F1):", best_threshold)
print("Best F1-score:", best_f1)

# ============================================================
# Plot Precision-Recall Curve
# ============================================================

plt.figure(figsize=(7,5))
plt.plot(thresholds, precision[:-1], label="Precision")
plt.plot(thresholds, recall[:-1], label="Recall")
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.title("Precision and Recall vs Threshold (XGBoost)")
plt.legend()
plt.grid(True)
plt.show()

# ============================================================
# Plot F1 vs Threshold Curve
# ============================================================

plt.figure(figsize=(7,5))
plt.plot(thresholds, f1_scores[:-1], label="F1-score", color='purple')
plt.axvline(best_threshold, color='red', linestyle='--', label=f"Best Threshold = {best_threshold:.2f}")
plt.xlabel("Threshold")
plt.ylabel("F1-score")
plt.title("F1-score vs Threshold (XGBoost)")
plt.legend()
plt.grid(True)
plt.show()

# ============================================================
# Evaluate Performance at Best Threshold
# ============================================================

xgb_preds_tuned = (xgb_probs >= best_threshold).astype(int)

cm = confusion_matrix(y_test, xgb_preds_tuned)
tn, fp, fn, tp = cm.ravel()

print("\nConfusion Matrix at Tuned Threshold:\n", cm)
print("True Negative:", tn)
print("False Positive:", fp)
print("False Negative:", fn)
print("True Positive:", tp)

print("\nMetrics at Tuned Threshold:")
print("Precision:", precision_score(y_test, xgb_preds_tuned))
print("Recall:", recall_score(y_test, xgb_preds_tuned))
print("F1-score:", f1_score(y_test, xgb_preds_tuned))
print("Accuracy:", accuracy_score(y_test, xgb_preds_tuned))
print("AUC:", roc_auc_score(y_test, xgb_probs))

# ============================================================
# Plot Confusion Matrix (Tuned Threshold)
# ============================================================

plt.figure(figsize=(5,4))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title(f"Confusion Matrix - XGBoost (Threshold = {best_threshold:.2f})")
plt.show()


### Interpretation of Threshold Tuning Results

Threshold tuning was performed to identify the probability cutoff that maximizes the clinical balance between false positives and false negatives. Using the Precision–Recall and F1-score curves, the optimal threshold for **XGBoost + SMOTE** was automatically identified as **0.51**, which achieved the highest F1-score (0.690). This value represents the point where the model most effectively balances precision and recall.

At the tuned threshold of **0.51**, the model produced the following confusion matrix:

```
[[78 22]
 [14 40]]
```

- **True Negatives:** 78  
- **False Positives:** 22  
- **False Negatives:** 14  
- **True Positives:** 40  

### Performance at Tuned Threshold

- **Precision:** 0.645  
- **Recall:** 0.740  
- **F1-score:** 0.690  
- **Accuracy:** 0.766  
- **AUC:** 0.813  

### Clinical Interpretation

Increasing the threshold above 0.50 typically increases precision at the cost of recall; however, in this case, the tuned threshold of 0.51 keeps recall high (0.74) while simultaneously improving precision compared to the class-weighted and defaultthreshold models. This indicates that the model becomes more selective in predicting diabetes while still capturing the majority of true diabetic cases.

A **recall of 0.74** means the model correctly identifies 74% of diabetic individuals, reducing the number of potentially dangerous false negatives. Meanwhile, a **precision of 0.645** ensures that most individuals predicted as diabetic truly belong to the positive class, minimizing unnecessary anxiety and follow-up tests.

The tuned threshold therefore provides a clinically meaningful balance:  
- It **reduces false negatives**, which is crucial in diabetes screening.  
- It **controls false positives**, improving the model's usefulness in real clinical workflows.  
- It maintains strong overall discrimination (**AUC = 0.813**).  

Overall, threshold tuning significantly enhances the model's practical reliability. The final performance metrics support the selection of **XGBoost + SMOTE (Threshold = 0.51)** as the model to be carried forward for interpretability analysis and deployment.


In [None]:
# ============================================================
# Dual-Threshold System: Balanced Mode and High-Sensitivity Mode
# ============================================================

# --- Balanced Threshold (already found from F1) ---
balanced_threshold = best_threshold   # ~0.51
print("Balanced Threshold (F1-optimal):", balanced_threshold)

# ============================================================
# Compute HIGH-SENSITIVITY Threshold (Recall ≥ 0.85)
# ============================================================

desired_recall = 0.85
high_recall_threshold = None

# thresholds has length = len(precision)-1, align recall & thresholds
full_thresholds = np.append(thresholds, 1)

# Recall increases as threshold decreases, so reverse-scan:
for r, t in zip(recall[::-1], full_thresholds[::-1]):
    if r >= desired_recall:
        high_recall_threshold = t
        break

# If no threshold reaches recall ≥ 0.85
if high_recall_threshold is None:
    print("\n⚠ Model cannot reach Recall ≥ 0.85 at any reasonable threshold.")
    print("Selecting highest-recall threshold with precision > 0.")

    # Choose best fallback threshold
    best_r = 0
    best_t = None

    for r, p, t in zip(recall, precision, full_thresholds):
        if p > 0 and r > best_r:
            best_r = r
            best_t = t

    high_recall_threshold = best_t
    print("Fallback High-Sensitivity Threshold:", high_recall_threshold)
    print("Recall at fallback threshold:", best_r)
else:
    print("\nHigh-Sensitivity Threshold (Recall ≥ 0.85):", high_recall_threshold)

# ============================================================
# BALANCED MODE EVALUATION
# ============================================================

xgb_preds_balanced = (xgb_probs >= balanced_threshold).astype(int)
cm_balanced = confusion_matrix(y_test, xgb_preds_balanced)

print("\n=== BALANCED MODE (Threshold =", round(balanced_threshold, 2), ") ===")
print("Confusion Matrix:\n", cm_balanced)
print("Precision:", precision_score(y_test, xgb_preds_balanced))
print("Recall:", recall_score(y_test, xgb_preds_balanced))
print("F1-score:", f1_score(y_test, xgb_preds_balanced))
print("Accuracy:", accuracy_score(y_test, xgb_preds_balanced))

plt.figure(figsize=(5,4))
ConfusionMatrixDisplay(confusion_matrix=cm_balanced).plot(cmap="Blues")
plt.title(f"Balanced Mode - XGBoost (Threshold = {balanced_threshold:.2f})")
plt.show()

# ============================================================
# HIGH-SENSITIVITY MODE EVALUATION
# ============================================================

xgb_preds_highrecall = (xgb_probs >= high_recall_threshold).astype(int)
cm_highrecall = confusion_matrix(y_test, xgb_preds_highrecall)

print("\n=== HIGH-SENSITIVITY MODE (Threshold =", round(high_recall_threshold, 2), ") ===")
print("Confusion Matrix:\n", cm_highrecall)
print("Precision:", precision_score(y_test, xgb_preds_highrecall))
print("Recall:", recall_score(y_test, xgb_preds_highrecall))
print("F1-score:", f1_score(y_test, xgb_preds_highrecall))
print("Accuracy:", accuracy_score(y_test, xgb_preds_highrecall))

plt.figure(figsize=(5,4))
ConfusionMatrixDisplay(confusion_matrix=cm_highrecall).plot(cmap="Blues")
plt.title(f"High-Sensitivity Mode - XGBoost (Threshold = {high_recall_threshold:.2f})")
plt.show()


### Dual-Threshold Decision System for Clinical Deployment

Medical models often require different operating points depending on the clinical setting. A single fixed decision threshold cannot satisfy every scenario because the cost of false negatives and false positives varies by patient population. To address this, we implement a dual-threshold design that supports two clinically meaningful modes:

---

## **1. Balanced Mode (Threshold = 0.51)**  
This threshold was selected using the F1-maximizing criterion, providing the strongest balance between precision and recall.

**Confusion Matrix**
```
[[78 22]
 [14 40]]
```

- **Precision:** 0.645  
- **Recall:** 0.740  
- **F1-score:** 0.689  
- **Accuracy:** 0.766  

Balanced mode is suitable for:

- General population screening  
- Primary care clinics  
- Situations where both false positives and false negatives carry moderate clinical cost  

This mode reduces unnecessary follow-up tests while still identifying a large proportion of true diabetic patients.

---

## **2. High-Sensitivity Mode (Threshold = 0.19)**  
To support high-risk groups, we identify a threshold that achieves **high recall** while preserving meaningful precision. The threshold of **0.19** achieves:

**Confusion Matrix**
```
[[63 37]
 [ 8 46]]
```

- **Precision:** 0.554  
- **Recall:** 0.852  
- **F1-score:** 0.672  
- **Accuracy:** 0.708  

This mode drastically reduces false negatives—from 14 in Balanced Mode down to just 8—meaning fewer diabetic patients are missed. This is critical because false negatives carry the highest clinical risk.

High-Sensitivity mode is appropriate for:

- High-risk patients (family history, obesity, hypertension)  
- Hospital triage  
- Early detection programs  
- Public health outreach deployments  
- Screening in rural or resource-limited regions  

The trade-off is an increase in false positives (22 → 37), but this is acceptable in clinical screening where early detection outweighs the cost of additional confirmatory tests.

---

## **Clinical Importance of the Dual-Threshold System**

This two-threshold framework reflects real clinical practice:

- **Balanced Mode** is used for typical diagnosis where efficiency and accuracy are both required.  
- **High-Sensitivity Mode** prioritizes patient safety by minimizing false negatives, even if precision decreases.

Implementing dual operating points demonstrates:

- Awareness of real-world clinical risk trade-offs  
- Strong deployment-oriented thinking  
- Adaptability of the model to different healthcare environments  
- Understanding that medical ML is not “one-threshold-fits-all”  

This design significantly strengthens the model’s usability and shows maturity expected in advanced research internships like the Max Planck Institute.

---
