# Cardiovascular Disease Risk Prediction: Data Preprocessing Pipeline

## Executive Summary

This notebook implements a **rigorous, clinical-grade data preprocessing pipeline** for cardiovascular disease risk stratification. The workflow transforms raw clinical data into **model-ready artifacts** suitable for reproducible machine learning research and HPC deployment.

### Core Principles

1. **No Data Leakage**: Transformers are fit on training data only; test set remains truly "unseen"
2. **Clinical Validity**: Stratified splitting preserves population prevalence (~8% disease rate)
3. **Type-Specific Preprocessing**: Different feature types receive appropriate transformations
4. **HPC Optimization**: Dense NumPy arrays enable GPU acceleration (XGBoost CUDA, CatBoost GPU)
5. **Reproducibility**: Fixed random states and explicit configuration ensure deterministic runs

### Clinical Context

Cardiovascular disease screening is a **high-stakes application** where:
- **False Negatives (FN)**: Missing an at-risk patient → disease progression, cardiac event, mortality
- **False Positives (FP)**: Unnecessary follow-up → anxiety, cost, but no clinical risk

The cost asymmetry is **severe**. We prioritize recall (minimize FN) and accept more FP as the clinical trade-off.

### Workflow Overview

| Phase | Input | Output | Purpose |
|-------|-------|--------|---------|  
| **1. Load & Split** | `HeartDisease.csv` | Train/Test (stratified 80/20) | Ensure consistent class ratios |
| **2. Analyze Imbalance** | Train/Test labels | Class distribution report | Verify stratification; confirm ~8% positive rate |
| **3. Feature Engineering** | Raw feature columns | Categorized feature groups | Prepare for type-specific transformations |
| **4. Build Pipeline** | Feature specifications | `ColumnTransformer` object | Encapsulate preprocessing logic |
| **5. Fit & Transform** | Training data | Preprocessed arrays | Learn parameters from train only |
| **6. Validate** | Transformed data | Sanity check report | Confirm zero missing values, consistent shapes |
| **7. Export** | Preprocessed arrays | Compressed `.joblib` files | HPC-ready artifacts for modeling |

### Handoff Artifacts

| File | Format | Size | Used By |
|------|--------|------|---------|  
| `X_train_ready.joblib` | Dense NumPy array | ~2–5 MB | Model training on GPU |
| `X_test_ready.joblib` | Dense NumPy array | ~0.5–1 MB | Model evaluation |
| `y_train_ready.joblib` | 1D array (0/1) | <1 MB | Classification labels |
| `y_test_ready.joblib` | 1D array (0/1) | <1 MB | Evaluation labels |
| `feature_names.joblib` | List of strings | <1 MB | SHAP/LIME interpretability |
| `preprocessor.joblib` | Fitted transformer | ~1 MB | Production inference pipelines |

---

## Step 0: Environment Configuration & Reproducibility Setup

### Software Dependencies

This preprocessing pipeline uses **only essential libraries** to minimize dependencies and maximize reproducibility:

| Library | Version | Role |
|---------|---------|------|  
| `pandas` | ≥1.3.0 | DataFrame manipulation, type handling |
| `numpy` | ≥1.21.0 | Numerical computing, array operations |
| `scikit-learn` | ≥1.0.0 | Preprocessing pipelines, transformers |
| `joblib` | ≥1.1.0 | Serialization of transformers and models |
| `pathlib` | Standard library | Cross-platform file path operations |

**Intentionally Excluded**: XGBoost, CatBoost, Optuna (reserved for modeling phase)

### Reproducibility & Determinism

To ensure **bit-for-bit reproducibility** across runs and platforms:

```python
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
```

This seed is propagated through stratified train-test splitting and any future cross-validation folds.

In [16]:
# Core libraries for preprocessing
import pandas as pd
import numpy as np
import joblib
from pathlib import Path

# Scikit-learn preprocessing components
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Configuration
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Paths
PROCESSED_DIR = Path("processed")
PROCESSED_DIR.mkdir(exist_ok=True)

print("=" * 60)
print("PREPROCESSING NOTEBOOK INITIALIZED")
print("=" * 60)
print(f"Output directory: {PROCESSED_DIR.resolve()}")
print(f"Random state: {RANDOM_STATE}")

PREPROCESSING NOTEBOOK INITIALIZED
Output directory: C:\Users\natha\code\github\xai-cardiovascular-risk-prediction\processed
Random state: 42


## Step 1: Data Loading & Stratified Train-Test Splitting

### Why Stratification Matters

The cardiovascular dataset exhibits **severe class imbalance**: approximately **8% of patients have diagnosed heart disease**. Without stratification, random splitting can create inconsistent class ratios between train and test sets, leading to:

- Inconsistent class distributions between sets
- Biased performance estimates  
- High variance across cross-validation folds

**Stratified splitting** ensures both train and test sets maintain the original class ratio:

$$\frac{n_+^{\text{train}}}{n^{\text{train}}} \approx \frac{n_+^{\text{test}}}{n^{\text{test}}} \approx \frac{n_+^{\text{total}}}{n^{\text{total}}} = p \approx 0.08$$

### Dual-Mode Loading Strategy

The code supports two input scenarios for flexibility:

1. **Pre-split Mode** (Preferred): Loads from `processed/X_train_raw.csv`, etc. when available
2. **Fallback Mode**: Loads from `HeartDisease.csv` and performs stratified split if intermediate files are missing

This design maximizes robustness and maintains reproducibility via `RANDOM_STATE=42`.

In [17]:
# Define expected split files
x_train_path = PROCESSED_DIR / "X_train_raw.csv"
x_test_path = PROCESSED_DIR / "X_test_raw.csv"
y_train_path = PROCESSED_DIR / "y_train_raw.csv"
y_test_path = PROCESSED_DIR / "y_test_raw.csv"

if all(p.exists() for p in [x_train_path, x_test_path, y_train_path, y_test_path]):
    # Load pre-split data from EDA
    X_train = pd.read_csv(x_train_path)
    X_test = pd.read_csv(x_test_path)
    y_train = pd.read_csv(y_train_path).squeeze("columns")
    y_test = pd.read_csv(y_test_path).squeeze("columns")
    print("Loaded existing stratified train/test splits from /processed")
else:
    # Fallback: create stratified split from the cleaned full dataset
    full_data_path = Path("HeartDisease.csv")
    if not full_data_path.exists():
        raise FileNotFoundError(
            "Could not find pre-split data in /processed or HeartDisease.csv in the project root."
        )

    df = pd.read_csv(full_data_path)
    y = df["Heart_Disease"]
    X = df.drop(columns=["Heart_Disease"])

    # Convert labels to numeric if needed
    if y.dtype == "object":
        y = y.map({"No": 0, "Yes": 1}).astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.20, random_state=RANDOM_STATE, stratify=y
    )

    # Save splits for reproducibility
    X_train.to_csv(x_train_path, index=False)
    X_test.to_csv(x_test_path, index=False)
    y_train.to_csv(y_train_path, index=False)
    y_test.to_csv(y_test_path, index=False)
    print("Created and saved stratified train/test splits to /processed")

# Ensure binary targets are numeric (0/1)
if y_train.dtype == "object":
    y_train = y_train.map({"No": 0, "Yes": 1}).astype(int)
    y_test = y_test.map({"No": 0, "Yes": 1}).astype(int)

# Create Age_num if missing
def _age_category_to_num(value: str) -> float:
    if pd.isna(value):
        return np.nan
    text = str(value).strip()
    if text.lower() in {"80 or older", "80+", "80+ years"}:
        return 80.0
    if "-" in text:
        parts = text.split("-")
        try:
            low, high = float(parts[0]), float(parts[1])
            return (low + high) / 2
        except ValueError:
            return np.nan
    return np.nan

if "Age_num" not in X_train.columns and "Age_Category" in X_train.columns:
    X_train["Age_num"] = X_train["Age_Category"].apply(_age_category_to_num)
    X_test["Age_num"] = X_test["Age_Category"].apply(_age_category_to_num)
    print("Created Age_num from Age_Category")

# Display shapes for verification
print("=" * 60)
print("DATA LOADED SUCCESSFULLY")
print("=" * 60)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape:  {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape:  {y_test.shape}")
print(f"\nTarget dtype: {y_train.dtype}")

Created and saved stratified train/test splits to /processed
Created Age_num from Age_Category
DATA LOADED SUCCESSFULLY
X_train shape: (247083, 19)
X_test shape:  (61771, 19)
y_train shape: (247083,)
y_test shape:  (61771,)

Target dtype: int64


### Data Preview

Quick sanity check to verify data loaded correctly before proceeding with transformations.

In [18]:
# Preview training features and labels
print("Training Features (first 5 rows):")
display(X_train.head())

print("\nTraining Labels (first 5 values):")
print(y_train.head())
print(f"\nLabel distribution: {dict(y_train.value_counts())}")

## Step 2: Class Imbalance Analysis

### Clinical Significance

**Cardiovascular disease screening is asymmetric**: the cost of different error types is vastly different.

| Event | Clinical Consequence | Model Trade-off | Policy |
|-------|---------------------|-----------------|--------| 
| **False Negative (FN)** | Missed patient → disease progression, MI, stroke, death | ❌ Unacceptable | Minimize via high recall (target 85%+) |
| **False Positive (FP)** | Unnecessary follow-up → anxiety, cost, no mortality risk | ✓ Acceptable trade-off | Accepted to reduce FN |

With ~8% disease prevalence, naive accuracy (predict "No disease" for all) = **92%** but sensitivity (recall) = **0%** → clinically worthless.

### Our Strategy: Class Weighting

We use `class_weight='balanced'` in models rather than synthetic oversampling (SMOTE) because:

- **Preserves true population distribution** for calibration
- **No synthetic data** – all training examples are real patients  
- **Probability estimates remain calibrated** to true prevalence

SMOTE distorts prevalence (8% → 50%), making probability estimates meaningless for clinical decision-making.

In [20]:
def display_class_distribution(y, set_name):
    """Display class counts and percentages for a target vector."""
    total = len(y)
    positive_count = y.sum()
    negative_count = total - positive_count
    positive_pct = 100 * positive_count / total
    negative_pct = 100 * negative_count / total
    
    print(f"\n{set_name}:")
    print(f"  Class 0 (No Disease): {negative_count:,} ({negative_pct:.2f}%)")
    print(f"  Class 1 (Disease):    {positive_count:,} ({positive_pct:.2f}%)")
    return positive_pct

print("=" * 60)
print("CLASS DISTRIBUTION ANALYSIS")
print("=" * 60)

train_pct = display_class_distribution(y_train, "Training Set")
test_pct = display_class_distribution(y_test, "Test Set")

# Verify stratification worked
print(f"\nStratification check: Train={train_pct:.2f}%, Test={test_pct:.2f}%")
if abs(train_pct - test_pct) < 1.0:
    print("   → Class ratios are consistent (good stratification)")

CLASS DISTRIBUTION ANALYSIS

Training Set:
  Class 0 (No Disease): 227,106 (91.91%)
  Class 1 (Disease):    19,977 (8.09%)

Test Set:
  Class 0 (No Disease): 56,777 (91.92%)
  Class 1 (Disease):    4,994 (8.08%)

Stratification check: Train=8.09%, Test=8.08%
   → Class ratios are consistent (good stratification)


## Step 3: Feature Engineering & Type Categorization

### Rationale for Feature Type Grouping

Different feature types have **inherently different distributions and meanings**. A "one-size-fits-all" preprocessing approach fails because:

1. **Numeric vs. Categorical**: Numeric features (BMI, height) have continuous ranges; categorical features (health status) have discrete categories
2. **Scaling Sensitivity**: Numeric features need scaling for convergence; categorical features need encoding for model compatibility
3. **Missing Value Patterns**: Numeric missing values are best handled by median; categorical missing values by mode

### Feature Groups

#### Numeric Features: Robust Imputation & Standardization

**Columns**: Height, Weight, BMI, Alcohol Consumption, Age (numeric), Fruit/Vegetable Consumption, Fried Potato Consumption

**Pipeline**: Median Imputation → StandardScaler

**Why Median?** Robust to outliers common in medical data (extreme BMI, unusual weight)

**Why StandardScaler?** Z-score transformation ($z_i = \frac{x_i - \mu}{\sigma}$) ensures:
- Faster gradient descent convergence
- Fair regularization penalties across features
- GPU numerical stability

#### Categorical Features: Mode Imputation & One-Hot Encoding

**Columns**: General_Health, Checkup, Diabetes

**Pipeline**: Mode Imputation → OneHotEncoder

**Why One-Hot?** 
- Prevents false ordinal relationships
- Works with all model types
- Explicitly represents category presence/absence

**Implementation**: `OneHotEncoder(handle_unknown='ignore', sparse_output=False)`
- `handle_unknown='ignore'`: Gracefully handles rare categories in test set
- `sparse_output=False`: Dense arrays for GPU compatibility

#### Binary Features: Simple Imputation

**Columns**: Exercise, Sex, Smoking History, Skin Cancer, Depression, Arthritis, Other Cancer

**Pipeline**: Mode Imputation only (already in [0, 1] range)

In [22]:
# Define feature groups explicitly
CATEGORICAL_COLS = [
    "General_Health",
    "Checkup",
    "Diabetes",
]

NUMERIC_COLS = [
    "Height_(cm)",
    "Weight_(kg)",
    "BMI",
    "Alcohol_Consumption",
    "Fruit_Consumption",
    "Green_Vegetables_Consumption",
    "FriedPotato_Consumption",
    "Age_num",
]

BINARY_COLS = [
    "Exercise",
    "Skin_Cancer",
    "Other_Cancer",
    "Depression",
    "Arthritis",
    "Sex",
    "Smoking_History",
]

DROP_COLS = ["Age_Category"]  # Redundant with Age_num

# Validate that required columns exist
missing_cols = {
    "categorical": [col for col in CATEGORICAL_COLS if col not in X_train.columns],
    "numeric": [col for col in NUMERIC_COLS if col not in X_train.columns],
    "binary": [col for col in BINARY_COLS if col not in X_train.columns],
}

if any(missing_cols.values()):
    raise ValueError(f"Missing expected columns in X_train: {missing_cols}")

print("=" * 60)
print("FEATURE GROUP DEFINITIONS")
print("=" * 60)
print(f"Categorical columns: {len(CATEGORICAL_COLS)}")
print(f"Numeric columns:     {len(NUMERIC_COLS)}")
print(f"Binary columns:      {len(BINARY_COLS)}")
print(f"Columns to drop:     {DROP_COLS}")

FEATURE GROUP DEFINITIONS
Categorical columns: 3
Numeric columns:     8
Binary columns:      7
Columns to drop:     ['Age_Category']


### Preprocessing: Drop Redundant Columns & Normalize Binary Fields

We remove `Age_Category` since we have `Age_num` as its numeric equivalent. This prevents:
- Feature redundancy
- Multicollinearity issues in Logistic Regression
- Wasted memory during GPU training

In [23]:
# Drop redundant columns
cols_to_drop = [col for col in DROP_COLS if col in X_train.columns]

if cols_to_drop:
    X_train = X_train.drop(columns=cols_to_drop)
    X_test = X_test.drop(columns=cols_to_drop)
    print(f"Dropped columns: {cols_to_drop}")

# Normalize binary fields to 0/1
_yes_no_map = {"Yes": 1, "No": 0}
_sex_map = {"Male": 1, "Female": 0}

for col in BINARY_COLS:
    if col not in X_train.columns:
        continue

    if X_train[col].dtype == "object":
        if set(X_train[col].dropna().unique()).issubset({"Yes", "No"}):
            X_train[col] = X_train[col].map(_yes_no_map)
            X_test[col] = X_test[col].map(_yes_no_map)
        elif set(X_train[col].dropna().unique()).issubset({"Male", "Female"}):
            X_train[col] = X_train[col].map(_sex_map)
            X_test[col] = X_test[col].map(_sex_map)

    # Ensure numeric dtype
    X_train[col] = pd.to_numeric(X_train[col], errors="coerce")
    X_test[col] = pd.to_numeric(X_test[col], errors="coerce")

print(f"Remaining features: {X_train.shape[1]}")

Dropped columns: ['Age_Category']
Remaining features: 18


## Step 4: Build Preprocessing Pipeline

### Design Philosophy: Encapsulation & Reproducibility

We use scikit-learn's `ColumnTransformer` to apply **type-specific preprocessing in a single reproducible object**:

1. **Encapsulates Logic** → All preprocessing in one fitted object
2. **Prevents Leakage** → Fit only on training data; apply to test
3. **Maintains Correspondence** → Feature-to-transformation mapping is explicit
4. **Enables Serialization** → Save via `joblib` for production
5. **Supports Composition** → Can chain with models in `Pipeline`

### Pipeline Architecture

```
ColumnTransformer (fit on X_train only)
├── Numeric Transformer (8 features)
│   ├── SimpleImputer(strategy='median')
│   └── StandardScaler()
├── Categorical Transformer (3 features)
│   ├── SimpleImputer(strategy='most_frequent')
│   └── OneHotEncoder(handle_unknown='ignore', sparse_output=False)
└── Binary Transformer (7 features)
    └── SimpleImputer(strategy='most_frequent')

Result: Dense NumPy array, GPU-compatible
```

### Key Parameters

| Parameter | Value | Rationale |
|-----------|-------|------------|  
| `strategy='median'` (numeric) | Robust to outliers | Medical data often has extreme values |
| `strategy='most_frequent'` (categorical) | Fast, deterministic | Categorical modes are consistent |
| `sparse_output=False` | Dense array | GPU-compatible; no conversion overhead |
| `handle_unknown='ignore'` | Unseen categories → zeros | Test set might have rare categories |
| `remainder='drop'` | Drop unmapped columns | Removes extraneous columns |

In [26]:
# Numeric feature pipeline
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Categorical feature pipeline
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

# Binary feature pipeline
binary_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent"))
])

# Combine into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, NUMERIC_COLS),
        ("cat", categorical_pipeline, CATEGORICAL_COLS),
        ("bin", binary_pipeline, BINARY_COLS),
    ],
    remainder="drop",
    verbose_feature_names_out=True
)

print("=" * 60)
print("PREPROCESSING PIPELINE CREATED")
print("=" * 60)
print("Pipeline structure:")
print("  Numeric:     Imputer(median) → StandardScaler")
print("  Categorical: Imputer(mode) → OneHotEncoder")
print("  Binary:      Imputer(mode)")
print()
print("Pipeline is NOT yet fitted. Fitting happens in the next step.")

PREPROCESSING PIPELINE CREATED
Pipeline structure:
  Numeric:     Imputer(median) → StandardScaler
  Categorical: Imputer(mode) → OneHotEncoder
  Binary:      Imputer(mode)

Pipeline is NOT yet fitted. Fitting happens in the next step.


## Step 5: Fit on Training Data & Transform Both Sets

### Critical Principle: Data Leakage Prevention

**Data leakage** occurs when information from the test set influences the training process, leading to overoptimistic performance estimates.

#### How to Prevent Leakage

Let $\theta$ denote learned parameters (μ, σ, categories, modes).

**Correct Workflow:**

$$\theta \leftarrow \text{fit}(X_{\text{train}}) \quad \text{[Learn from training data only]}$$

$$X'_{\text{train}} = T(X_{\text{train}}; \theta)$$

$$X'_{\text{test}} = T(X_{\text{test}}; \theta) \quad \text{[Transform with SAME parameters]}$$

**The Key**: $\theta$ depends ONLY on $X_{\text{train}}$, not on $X_{\text{test}}$.

### Implementation

Scikit-Learn's API enforces this discipline:

```python
# CORRECT:
preprocessor.fit(X_train)                      # Learn params from training only
X_train_pre = preprocessor.transform(X_train)  # Apply to training
X_test_pre = preprocessor.transform(X_test)    # Apply same params to test

# WRONG (causes leakage):
preprocessor.fit(X_train_and_test)  # NEVER DO THIS
```

In [27]:
print("Fitting preprocessor on training data...")

# FIT on training data (learns parameters)
# TRANSFORM training data
X_train_pre = preprocessor.fit_transform(X_train)

# TRANSFORM test data (using parameters learned from training)
X_test_pre = preprocessor.transform(X_test)

# Extract feature names after transformation
feature_names = list(preprocessor.get_feature_names_out())

print("=" * 60)
print("PREPROCESSING COMPLETE")
print("=" * 60)
print(f"X_train_pre shape: {X_train_pre.shape}")
print(f"X_test_pre shape:  {X_test_pre.shape}")
print(f"Number of features (after OHE): {len(feature_names)}")
print(f"Output type: {type(X_train_pre).__name__}")

Fitting preprocessor on training data...
PREPROCESSING COMPLETE
X_train_pre shape: (247083, 29)
X_test_pre shape:  (61771, 29)
Number of features (after OHE): 29
Output type: ndarray


## Step 6: Validation & Sanity Checks

### Purpose: Silent Corruption Detection

Before exporting, we validate that preprocessing has **not introduced silent errors**:

| Check | Test | Why It Matters |
|-------|------|----------------| 
| **No Missing Values** | `np.isnan(X_train_pre).sum() == 0` | Imputation succeeded |
| **Shape Consistency** | `X_train_pre.shape[1] == X_test_pre.shape[1]` | Train/test alignment |
| **Sample Preservation** | `X_train_pre.shape[0] == X_train.shape[0]` | No data loss |
| **Stratification Integrity** | `abs(train_pct - test_pct) < 1.0%` | Class ratio preserved |
| **Type Consistency** | `isinstance(X_train_pre, np.ndarray)` | GPU-compatible format |

In [29]:
print("=" * 60)
print("SANITY CHECKS")
print("=" * 60)

# Check 1: No NaN values remain
nan_train = np.isnan(X_train_pre).sum()
nan_test = np.isnan(X_test_pre).sum()
print(f"\nNaN values in X_train_pre: {nan_train}")
print(f"NaN values in X_test_pre:  {nan_test}")

if nan_train == 0 and nan_test == 0:
    print("  → All missing values successfully imputed!")
else:
    print("WARNING: NaN values detected after preprocessing!")

# Check 2: Shape consistency
print(f"\nFeature count matches: {X_train_pre.shape[1] == X_test_pre.shape[1]}")
print(f"  Train features: {X_train_pre.shape[1]}")
print(f"  Test features:  {X_test_pre.shape[1]}")

# Check 3: Target balance preserved
train_positive_rate = y_train.mean() * 100
test_positive_rate = y_test.mean() * 100
print(f"\nTrain positive rate: {train_positive_rate:.2f}%")
print(f"Test positive rate:  {test_positive_rate:.2f}%")
print(f"  → Stratification preserved: {abs(train_positive_rate - test_positive_rate) < 1.0}")

SANITY CHECKS

NaN values in X_train_pre: 0
NaN values in X_test_pre:  0
  → All missing values successfully imputed!

Feature count matches: True
  Train features: 29
  Test features:  29

Train positive rate: 8.09%
Test positive rate:  8.08%
  → Stratification preserved: True


## Step 7: Export Model-Ready Artifacts

### Artifact Specification

We export **6 core artifacts** to the `processed/` directory:

| File | Data Type | Purpose | Primary Consumer |
|------|-----------|---------|------------------| 
| `X_train_ready.joblib` | NumPy array (n_train, n_features) | Training features (preprocessed) | Model training on HPC |
| `X_test_ready.joblib` | NumPy array (n_test, n_features) | Test features | Model evaluation |
| `y_train_ready.joblib` | 1D array (n_train,) | Training labels (0/1) | Model training targets |
| `y_test_ready.joblib` | 1D array (n_test,) | Test labels (0/1) | Model evaluation |
| `feature_names.joblib` | List of strings | Feature names after OHE | SHAP/LIME interpretability |
| `preprocessor.joblib` | Fitted ColumnTransformer | Serialized preprocessing pipeline | Production inference |

### Why `.joblib` Format?

- Optimized for scikit-learn objects
- Handles NumPy arrays efficiently
- Built-in compression reduces storage
- Industry standard in ML research

### Compression Strategy

We use **compression level 3** (default zlib):
- Saves ~100–300 MB
- Takes <1 sec per file
- Balances speed and storage for HPC workflows

In [31]:
import os

# Define compression level (3 is optimal for speed/size)
COMPRESSION = 3

# Export model-ready data with compression
joblib.dump(X_train_pre, PROCESSED_DIR / 'X_train_ready.joblib', compress=COMPRESSION)
joblib.dump(y_train,     PROCESSED_DIR / 'y_train_ready.joblib', compress=COMPRESSION)
joblib.dump(X_test_pre,  PROCESSED_DIR / 'X_test_ready.joblib',  compress=COMPRESSION)
joblib.dump(y_test,      PROCESSED_DIR / 'y_test_ready.joblib',  compress=COMPRESSION)
joblib.dump(feature_names, PROCESSED_DIR / 'feature_names.joblib', compress=COMPRESSION)
joblib.dump(preprocessor, PROCESSED_DIR / 'preprocessor.joblib', compress=COMPRESSION)

# Cleanup: Delete the raw CSV splits to keep the repo clean
raw_files = [x_train_path, x_test_path, y_train_path, y_test_path]
for f in raw_files:
    if f.exists():
        os.remove(f)
        print(f"Deleted intermediate file: {f.name}")

print("✅ Pipeline Cleaned: Only compressed .joblib files remain in /processed")

Deleted intermediate file: X_train_raw.csv
Deleted intermediate file: X_test_raw.csv
Deleted intermediate file: y_train_raw.csv
Deleted intermediate file: y_test_raw.csv
✅ Pipeline Cleaned: Only compressed .joblib files remain in /processed


## Summary: Preprocessing Pipeline Complete

### Workflow Recap

| Phase | Input | Process | Output | No-Leakage Guarantee |
|-------|-------|---------|--------|---------------------| 
| **1. Load & Split** | `HeartDisease.csv` | Stratified 80/20 split | X_train, X_test | ✓ Stratify by y |
| **2. Analyze Imbalance** | Train/Test labels | Compute positive rate | Class distribution | N/A |
| **3. Feature Engineering** | Raw columns | Define feature groups | Feature group lists | N/A |
| **4. Build Pipeline** | Feature specs | Compose ColumnTransformer | Unfitted preprocessor | N/A |
| **5. Fit & Transform** | X_train, X_test | Fit on train only | X_train_pre, X_test_pre | ✓ Fit on train only |
| **6. Validate** | Transformed arrays | Check NaN, shape, stratification | Validation report | ✓ No data loss |
| **7. Export** | Preprocessed arrays | Compress with joblib | `.joblib` artifacts | ✓ Frozen artifacts |

### Downstream Consumers

**Consumer 1: Local Model Development** (`cardiovascular_modelsREAL.ipynb`)
- Loads: `X_train_ready.joblib`, `X_test_ready.joblib`, labels
- Uses: `feature_names.joblib` for SHAP plots
- Trains: Logistic Regression, XGBoost, CatBoost

**Consumer 2: HPC Hyperparameter Tuning** (`cardiovascular_optuna_gpu.py`)
- Loads: Data → GPU memory on A40 cluster
- Optimizer: Optuna with TPE sampler
- Objective: Maximize PR-AUC while maintaining ≥85% recall

**Consumer 3: Explainability** (`cardio_SHAP.ipynb`)
- Loads: `feature_names.joblib`, best model
- Methods: SHAP TreeExplainer or KernelExplainer
- Outputs: Feature importance plots, clinical insights

### Key Principles

| Principle | Implementation | Benefit |
|-----------|----------------|---------|  
| **Stratified Splitting** | `stratify=y` | Preserves disease prevalence |
| **Fit on Train Only** | `preprocessor.fit(X_train)` | Prevents data leakage |
| **Type-Specific Pipelines** | Separate transformers | Appropriate handling |
| **Median Imputation** | Robust to outliers | Better for skewed distributions |
| **StandardScaler** | Zero mean, unit variance | Convergence, GPU stability |
| **One-Hot Encoding** | Dense format | GPU-compatible |
| **Class Weighting** | `class_weight='balanced'` | Handles imbalance without synthetic data |
| **Comprehensive Validation** | NaN/shape checks | Detects silent corruption |
| **Deterministic Export** | Fixed `RANDOM_STATE` | Reproducible artifacts |

## Conclusion

This preprocessing notebook **successfully transforms raw clinical data into production-ready artifacts**. The pipeline:

* **Prevents data leakage** via training-only fitting
* **Preserves clinical validity** through stratified splitting
* **Handles missing data** defensively
* **Optimizes for GPU** with dense arrays
* **Enables reproducibility** via fixed seeds
* **Documents reasoning** for interpretability

**Next Steps:**
1. Load artifacts in `cardiovascular_modelsREAL.ipynb` for local experiments
2. Deploy to HPC cluster and run `cardiovascular_optuna_gpu.py` for GPU tuning
3. Use best model in `cardio_SHAP.ipynb` for explainability analysis
4. Communicate findings to clinical stakeholders with SHAP plots