## 🛠️ Stage 2: Data Preprocessing for XAI-Compatible Modeling

This preprocessing stage transforms raw hospital encounter data into a structured format ready for machine learning, while intentionally preserving features and patterns that XAI tools can later interrogate. Rather than aggressively "cleaning" the data to perfection, we apply *strategic preparation* that supports both modeling and explainability.

Key actions and their relevance:

1. **Identifier Removal:**  
   Dropping `encounter_id` and `patient_nbr` prevents leakage of non-predictive, uniquely identifying information into the model, ensuring that predictions are driven by clinical or contextual features — not row IDs.

2. **Missing Value Normalization:**  
   We replaced all `'?'` placeholders with proper `NaN` values. This is essential because most validation or imputation logic depends on standard missing value representations. It also allows SHAP to later reveal if imputed or missing features contribute disproportionately to predictions.

3. **Target Mapping (`readmitted_binary`):**  
   The original readmission target had three classes. We focus on a binary classification: predicting **whether a patient will be readmitted within 30 days** (1) or not (0). This is both clinically relevant and consistent with literature.

4. **Invalid Demographic Cleanup:**  
   Rows with unknown gender or missing race were removed. This avoids introducing ambiguous demographic signals that could complicate both model fairness and interpretation through XAI later.

5. **Selective Feature Dropping:**  
   Features with extremely high missingness (e.g., `weight`, `payer_code`, `medical_specialty`) were dropped. While these may be interesting, their sparsity could create misleading patterns during model training and SHAP analysis.

6. **Label Encoding for Categorical Features:**  
   All string-based categorical features were label encoded. This preserves their ordinal relationship for tree-based models (e.g., Random Forest, XGBoost) while maintaining interpretability through SHAP's internal mappings.

7. **Stratified Train-Test Split:**  
   We split the dataset using stratified sampling on the binary readmission label. This preserves the original imbalance and ensures a representative test set for evaluation and explanation.

This stage sets the foundation for interpretable modeling. By cleaning with intent — rather than aggressively eliminating all irregularities — we leave room for XAI tools to detect subtle biases, outliers, and inconsistencies later on.


### Loading the dataset

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/diabetic_data.csv')

# Preview top rows
display(df.head())


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


### Step 1: Drop Identifiers

`encounter_id` and `patient_nbr` are identifiers, not features. Including them could lead to data leakage or distort model learning.

We drop:
- `encounter_id` (unique per visit)
- `patient_nbr` (can be used later for longitudinal grouping but excluded for now)


In [2]:
df = df.drop(columns=['encounter_id', 'patient_nbr'])

### Step 2: Replace '?' with NaN

Many features use `'?'` as a placeholder for missing data, which is not recognized by Pandas as null. We replace all such instances with `np.nan` for correct downstream processing.


In [3]:
import numpy as np

df.replace('?', np.nan, inplace=True)


### Step 3: Filter Ambiguous or Invalid Entries

We remove:
- Rows where gender is `Unknown/Invalid`
- Optionally, rows with missing race (to ensure cleaner bias analysis later)


In [4]:
# Remove ambiguous gender
df = df[df['gender'] != 'Unknown/Invalid']

# Remove rows with missing race
df = df[df['race'].notna()]


### Step 4: Encode Target Variable

We define a binary target:
- `1` → Readmitted **within 30 days** (`'<30'`)
- `0` → Not readmitted or readmitted after 30 days (`'NO'`, `'>30'`)


In [5]:
df['readmitted_binary'] = df['readmitted'].apply(lambda x: 1 if x == '<30' else 0)
df['readmitted_binary'].value_counts()


readmitted_binary
0    88323
1    11169
Name: count, dtype: int64

### Step 5: Handle High-Missing Columns

We drop `weight`, `payer_code`, and `medical_specialty` due to high missing values (~40-97%) that could distort model learning. These features may be revisited later if needed for advanced imputation/XAI.


In [6]:
df.drop(columns=['weight', 'payer_code', 'medical_specialty'], inplace=True)


### Step 6: Encode Categorical Variables

To ensure model compatibility, we encode all remaining object-type columns using Label Encoding.

Note:
- One-Hot Encoding could explode dimensionality (too many categories)
- Label encoding preserves interpretability for SHAP, especially for tree models


In [7]:
from sklearn.preprocessing import LabelEncoder

cat_cols = df.select_dtypes(include='object').columns.drop('readmitted')  # exclude original target
label_encoders = {}

for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le


### Step 7: Split Data for Modeling

We split into training and testing sets using stratified sampling to preserve the original distribution of the target variable (`readmitted_binary`).


In [8]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=['readmitted', 'readmitted_binary'])
y = df['readmitted_binary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("Positive class rate in train:", y_train.mean())
print("Positive class rate in test:", y_test.mean())


Train shape: (79593, 44)
Test shape: (19899, 44)
Positive class rate in train: 0.11225861570741145
Positive class rate in test: 0.11226694808784361


### Class Balance & Data Split Interpretation:

The target variable `readmitted_binary` shows a significant **class imbalance**:
- 88,323 instances (≈ 89%) belong to the **negative class** (`0`: not readmitted within 30 days).
- 11,169 instances (≈ 11%) belong to the **positive class** (`1`: readmitted within 30 days).

This imbalance is clinically consistent — most patients are not readmitted within 30 days — but poses a **modeling challenge**, as naive classifiers may favor the majority class.

To mitigate this during evaluation and explanation, we applied **stratified splitting**, ensuring:
- Training set: 79,593 samples
- Test set: 19,899 samples
- Class proportions are preserved across both sets (~11.2% positive rate)

Maintaining class distribution across splits is **essential for fair model evaluation** and for XAI to uncover insights across both frequent and rare cases (e.g., why a rare 30-day readmission is predicted).


### Cleaned & Encoded Full Dataset

In [9]:
df.to_csv('../data/cleaned_diabetes.csv', index=False)

### 🔹 2. Train/Test Splits
Saving the final X_train, X_test, y_train, y_test as .pkl or .csv so Stage 3 can load directly.

In [10]:
import joblib

joblib.dump(X_train, '../data/X_train.pkl')
joblib.dump(X_test, '../data/X_test.pkl')
joblib.dump(y_train, '../data/y_train.pkl')
joblib.dump(y_test, '../data/y_test.pkl')


['../data/y_test.pkl']

### 🔹 3. Label Encoders (if needed in SHAP explanations)
If reverse-transform encoded categorical values in SHAP outputs, saving the label_encoders dictionary.

In [11]:
joblib.dump(label_encoders, '../data/label_encoders.pkl')

['../data/label_encoders.pkl']

### Overall Conclusion – Stage 2: Data Preprocessing

This stage successfully transformed the raw diabetes hospitalization dataset into a structured, model-ready format, while **intentionally preserving signals of data quality and potential bias** for later explainability analysis.

Key outcomes include:

- ✅ **Identifier features** were dropped to prevent leakage.
- ✅ **Placeholder values** (`'?'`) were normalized to `NaN`, allowing proper handling and future interpretability.
- ✅ **Target variable (`readmitted`)** was converted to a binary outcome: predicting whether a patient will be readmitted within 30 days — a clinically meaningful and high-stakes decision.
- ✅ **Demographic filters** (e.g., removing unknown gender) ensured that fairness analysis will not be contaminated by invalid data.
- ✅ **Categorical variables** were label encoded in a SHAP-compatible manner, preserving feature semantics for downstream explanations.
- ✅ **Train-test split** was performed using stratified sampling to retain the original class imbalance across both sets, ensuring fair and trustworthy model evaluation.

This preprocessing approach reflects a core tenet of our framework: rather than over-cleaning the data, we **strategically structure it** so that **explainable AI techniques can later uncover what traditional validation might miss** — including subtle patterns of bias, inconsistency, and trust degradation.

With the data now partitioned and cleaned, we proceed to Stage 3: model training — the first point where we’ll observe how these decisions affect predictive performance.
