### Stage 3: Predictive Modeling – Explanation

Having prepared and partitioned the data in Stage 2, we now move into the modeling phase, where the goal is to build predictive models that estimate the likelihood of a patient being readmitted within 30 days of hospital discharge.

Unlike traditional workflows that focus solely on performance metrics, this stage is a **setup for diagnosis**. Our objective is not just to train a model that performs well, but to observe how various model architectures behave under the constraints and imperfections of real-world clinical data.

This stage is especially important for our research because:
- It provides **empirical evidence** that traditional models can exhibit poor behavior (e.g., ignoring the minority class).
- It creates **predictive outputs** that XAI techniques (like SHAP and LIME) will later dissect to expose hidden issues.
- It allows us to test whether data problems generalize across model types, reinforcing our claim that **XAI is required to explain not just models, but the data they depend on.**

We will:
- Load the preprocessed train/test datasets.
- Train a set of classifiers including tree-based, linear, and neural architectures.
- Evaluate each model using both performance metrics and error patterns.

This sets the stage for XAI to uncover *why* these models behave the way they do, especially when they fail.


### Step 1: Load Preprocessed Data

We begin Stage 3 by loading the cleaned and partitioned training and testing datasets that were saved at the end of Stage 2. This ensures consistency across experimental stages and allows us to evaluate multiple models on the exact same splits.


In [4]:
import joblib

# Load data
X_train = joblib.load('../data/X_train.pkl')
X_test = joblib.load('../data/X_test.pkl')
y_train = joblib.load('../data/y_train.pkl')
y_test = joblib.load('../data/y_test.pkl')

print("✅ Data loaded successfully")
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)


✅ Data loaded successfully
Train shape: (79593, 44)
Test shape: (19899, 44)


### Step 2: Train First Model – Random Forest Classifier

We begin with a Random Forest classifier as our baseline model. It is widely used in healthcare prediction tasks due to its robustness, ability to handle non-linear relationships, and compatibility with explainability tools like SHAP's TreeExplainer.

This model also naturally handles missing values and does not require feature scaling, making it an ideal starting point.


In [5]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the model
rf_model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42
)

rf_model.fit(X_train, y_train)

# Save model
joblib.dump(rf_model, '../models/random_forest.pkl')

print("✅ Random Forest model trained and saved.")


✅ Random Forest model trained and saved.


### Step 3: Evaluate Random Forest Model

We now evaluate the performance of the trained Random Forest model using classification metrics that are especially relevant in imbalanced classification tasks, including:
- **Precision** (positive predictive value)
- **Recall** (sensitivity)
- **F1-score** (balance of precision and recall)
- **ROC AUC** (model's ability to rank positives over negatives)

We will also examine the **confusion matrix** to understand the model’s behavior in terms of false positives and false negatives — which is critical in high-stakes domains like healthcare.


In [6]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Predictions and probabilities
y_pred_rf = rf_model.predict(X_test)
y_proba_rf = rf_model.predict_proba(X_test)[:, 1]

# Metrics
print("📊 Classification Report (Random Forest):")
print(classification_report(y_test, y_pred_rf, digits=4))

print("📈 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

print(f"🔵 ROC AUC Score: {roc_auc_score(y_test, y_proba_rf):.4f}")


📊 Classification Report (Random Forest):
              precision    recall  f1-score   support

           0     0.8879    0.9997    0.9405     17665
           1     0.4444    0.0018    0.0036      2234

    accuracy                         0.8877     19899
   macro avg     0.6662    0.5008    0.4720     19899
weighted avg     0.8381    0.8877    0.8353     19899

📈 Confusion Matrix:
[[17660     5]
 [ 2230     4]]
🔵 ROC AUC Score: 0.6427


### Random Forest Model – Performance Interpretation

The Random Forest classifier achieves an overall accuracy of **88.8%**, but this metric masks severe performance issues on the minority class (`1`: readmitted within 30 days), which is the clinically important outcome.

#### Key observations:

- **Majority Class (`0`) Performance:**
  - Precision: **0.89**
  - Recall: **~100%**
  - F1-score: **0.94**
  - The model almost perfectly identifies patients *not* readmitted, which dominates the test set.

- **Minority Class (`1`) Failure:**
  - Precision: **0.44** (only 4 predicted as positive)
  - Recall: **0.0018** (only 4 of 2,234 true positives detected)
  - F1-score: **0.0036** — effectively **non-functional** for predicting actual readmissions

- **ROC AUC Score:**  
  - **0.6427**, indicating some rank-order separation between classes, but **not sufficient for clinical use**.

- **Confusion Matrix:**  
  - Only 4 of the 2,234 true positive cases were correctly identified.
  - The model is **heavily skewed toward the majority class**, likely due to class imbalance and overfitting to frequent patterns.

#### Why This Matters for XAI:

This result underscores a core problem in trust and model validation:
> A model may appear performant on paper (high accuracy), yet completely fail in the scenarios that matter most — and **traditional metrics won’t explain why**.

This behavior now becomes a case for **Stage 4**, where SHAP will help uncover:
- What features the model *is* using (and overusing)
- What signals it *misses*
- Whether any proxy variables (e.g., race, admission type) dominate decision logic unfairly

This failure is not a setback — it's the **exact justification for explainable validation** in high-stakes domains.


### Step 4: Train XGBoost Classifier

XGBoost is a powerful gradient-boosted tree ensemble method known for its ability to model complex feature interactions and handle imbalanced datasets effectively. It also integrates seamlessly with SHAP for global and local explanation, making it an excellent candidate for comparison in our framework.


In [9]:
from xgboost import XGBClassifier
import joblib

# Initialize XGBoost
xgb_model = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=8,  # handle class imbalance
    random_state=42
)

# Train the model
xgb_model.fit(X_train, y_train)

# Save the trained model
joblib.dump(xgb_model, '../models/xgboost.pkl')

print("✅ XGBoost model trained and saved.")


✅ XGBoost model trained and saved.


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


### Step 5: Evaluate XGBoost Model

We now assess the performance of the XGBoost classifier, a gradient-boosted decision tree model known for robustness on structured data. This evaluation provides a direct comparison to the Random Forest model and helps us understand if a boosting-based method better captures the nuances in this imbalanced dataset.

We focus on:
- **Classification Report** for precision, recall, F1-score
- **Confusion Matrix** to analyze false positives/negatives
- **ROC AUC** for overall separability of the classes


In [10]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Predict on test set
y_pred_xgb = xgb_model.predict(X_test)
y_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]

# Evaluation metrics
print("📊 Classification Report (XGBoost):")
print(classification_report(y_test, y_pred_xgb, digits=4))

print("📈 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_xgb))

print("🔵 ROC AUC Score:", round(roc_auc_score(y_test, y_proba_xgb), 4))


📊 Classification Report (XGBoost):
              precision    recall  f1-score   support

           0     0.9183    0.7219    0.8083     17665
           1     0.1828    0.4919    0.2666      2234

    accuracy                         0.6961     19899
   macro avg     0.5505    0.6069    0.5374     19899
weighted avg     0.8357    0.6961    0.7475     19899

📈 Confusion Matrix:
[[12752  4913]
 [ 1135  1099]]
🔵 ROC AUC Score: 0.6515


### ✅ Evaluation Summary: XGBoost Model

The XGBoost classifier demonstrates significantly different behavior from the earlier Random Forest, especially in handling the minority class (patients readmitted within 30 days):

#### 🔹 Improved Recall for Minority Class (Positive)

- Recall for class `1` (`<30` readmission) improved drastically to **~49.2%**, up from **~0.2%** in the Random Forest model.
- This suggests that XGBoost is better at identifying patients at risk of early readmission — a **clinically vital** outcome in high-stakes healthcare settings.

#### 🔹 Tradeoff in Precision

- Precision for the positive class dropped to **~18%**, meaning many false positives.
- However, in healthcare, **high recall is often prioritized over precision** — better to flag potential readmissions than miss them.

#### 🔹 Overall Performance

- **Accuracy:** ~69.6%  
- **ROC AUC:** **0.6515**, showing improved class separability compared to Random Forest (**0.6427**).

#### 🔹 Confusion Matrix Insights

- XGBoost correctly classified **1,099** out of **2,234** positive cases, unlike Random Forest which only captured **4**.
- However, it incorrectly flagged **4,913** negative cases as positive.

#### 🔹 Interpretability Implications

- This result makes **XGBoost a prime candidate for SHAP-based explainability**, since it is capturing nuanced feature interactions that traditional models missed.
- We will later inspect if these improvements come from **valid clinical signals** or **data artifacts**, using SHAP.


### 🔸 Description:

In this step, we train a **Multilayer Perceptron (MLP)** neural network using scikit-learn’s `MLPClassifier`. Unlike decision trees or gradient boosting, neural networks can capture **nonlinear relationships** and **interactions** in the data. Although they are less interpretable out-of-the-box, they provide an important contrast for later **SHAP-based analysis**.

This MLP configuration includes:

- Two hidden layers with **64** and **32** neurons respectively.  
- **ReLU** activation function.  
- **Adam** optimizer.  
- A maximum of **100 training iterations**.


In [15]:
from sklearn.neural_network import MLPClassifier
import joblib

# Initialize and train MLP model
mlp_model = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='relu',
    solver='adam',
    max_iter=100,
    random_state=42
)

mlp_model.fit(X_train, y_train)

# Save the trained MLP model
joblib.dump(mlp_model, '../models/mlp_model.pkl')




['../models/mlp_model.pkl']

🔸 Description:
We now assess the performance of our trained MLP model on the test set using standard classification metrics and ROC AUC score. This ensures consistent comparison across all models.



In [16]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Predict on test set
y_pred_mlp = mlp_model.predict(X_test)
y_proba_mlp = mlp_model.predict_proba(X_test)[:, 1]

# Evaluation metrics
print("📊 Classification Report (MLP):")
print(classification_report(y_test, y_pred_mlp))

print("📈 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_mlp))

print("🔵 ROC AUC Score:", round(roc_auc_score(y_test, y_proba_mlp), 4))


📊 Classification Report (MLP):
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     17665
           1       0.41      0.01      0.02      2234

    accuracy                           0.89     19899
   macro avg       0.65      0.50      0.48     19899
weighted avg       0.83      0.89      0.84     19899

📈 Confusion Matrix:
[[17630    35]
 [ 2210    24]]
🔵 ROC AUC Score: 0.6318


### ✅ Evaluation Summary: MLP Neural Network

The MLP classifier shows similar overall behavior to the Random Forest but falls short of the improvements seen with XGBoost, particularly on the minority class.


#### 🔹 Strengths:
- **High accuracy**: ~89%, largely driven by strong performance on the majority class (`readmitted = 0`).
- **Precision (class 0)**: The model is highly precise and accurate when identifying patients *not* readmitted within 30 days.


#### 🔹 Weaknesses:
- **Poor minority class recall**: Only **~1%** of the patients actually readmitted within 30 days were correctly identified.
- **Low ROC AUC**: At **0.6318**, this indicates poor class separation and little learned discriminatory power for the positive class.
- **False Negatives**: 2,210 patients predicted as not readmitted when they actually were — a major clinical concern.


#### 🔹 Interpretation:
Despite the MLP’s nonlinear modeling ability, it struggles with the **severe class imbalance**. The lack of positive class sensitivity makes it unsuitable as-is for high-stakes clinical tasks — though it's still useful as a contrast during SHAP analysis to test how differently MLPs learn compared to tree-based models.


## ✅ Stage 3 Summary: Model Training & Evaluation

In this stage, we trained and evaluated three different classifiers — **Random Forest**, **XGBoost**, and **Multilayer Perceptron (MLP)** — on the processed hospital readmission dataset. The primary objective was not only to assess predictive power but also to set up a robust foundation for subsequent **XAI-driven analysis** that can reveal **hidden data quality issues, biases, and model dependencies**.

---

### 🔢 Performance Comparison

| Metric             | Random Forest       | XGBoost              | MLP (Neural Net)     |
|--------------------|---------------------|----------------------|-----------------------|
| **Accuracy**       | 88.8%               | 69.6%                | 89.0%                 |
| **Recall (Class 1)** | 0.18%              | 49.2%                | 1.1%                  |
| **Precision (Class 1)** | 44.4%           | 18.3%                | 41.0%                 |
| **F1-Score (Class 1)** | 0.36%            | 26.7%                | 2.0%                  |
| **ROC AUC**        | 0.6427              | **0.6515**           | 0.6318                |
| **TP (Class 1)**   | 4                   | **1,099**            | 24                    |
| **FP (Class 1)**   | 5                   | 4,913                | 35                    |

---

### 📌 Interpretation & Alignment with XAI-Driven Validation Goals

#### 1. **Random Forest:**
- Strong accuracy and precision for the majority class.
- Fails completely at identifying early readmissions (**Recall: 0.18%**), making it **clinically unreliable**.
- This model may appear performant on surface metrics but hides its failure on minority prediction — a classic example of **false confidence**, perfect for SHAP analysis to investigate **why**.

#### 2. **XGBoost:**
- Shows **balanced sensitivity**, correctly identifying nearly **50% of early readmissions**.
- Accepts a precision tradeoff — which in healthcare is **justifiable** to avoid missing high-risk patients.
- Best candidate for **explainability**: exhibits nuanced, non-linear feature dependencies that SHAP can expose.
- Ideal for validating the framework’s **core claim** — that XAI can surface hidden model behaviors and implicit data flaws.

#### 3. **Multilayer Perceptron (MLP):**
- Excellent accuracy but suffers the same problem as Random Forest — poor sensitivity to the minority class.
- Captures complex patterns, but its **black-box nature** reinforces the need for **explainability techniques** like SHAP.
- Useful as a neural baseline but not suitable for high-stakes settings without deeper interpretation.

---

### 🎯 Conclusion

This modeling phase illustrates that traditional performance metrics can be **misleading**, especially in imbalanced, high-risk scenarios like hospital readmission prediction. XGBoost, despite lower accuracy, surfaces more **clinically relevant signals**, making it a strong candidate for explainability.

This sets the stage for **Stage 4: SHAP-based Explainability**, where we will:
- Investigate **which features** drive model predictions.
- Identify **data segments** or attributes disproportionately affecting predictions.
- Detect **anomalies**, **biases**, or **proxy features** that would be invisible in traditional validation.

By aligning model outputs with explainable insights, we validate our central thesis: **XAI is not just for model interpretation but is a critical component for proactive, trust-centric data validation**.
