# Foundation PLR: Comprehensive Research Guide

**A complete guide for researchers who want to:**
1. Understand what data is stored and how
2. Reproduce our exact results
3. Train new classifiers (including future ones released in 2027+)
4. Apply new biostatistics methods
5. Run the pipeline on their own PLR data
6. Customize the featurization (bins, windows, etc.)

---

> **Who is this for?**
>
> - Researchers wanting to validate or extend our work
> - Clinicians/PIs who want to understand what we did
> - Developers implementing new classifiers or methods
> - Anyone with their own PLR data who wants to use our pipeline

---

# PART 1: Understanding the Data

## 1.1 What is PLR (Pupillary Light Reflex)?

> **ELI5 for PIs:**
>
> When you shine a light into someone's eye, their pupil gets smaller (constricts).
> When you turn the light off, it gets bigger again (redilates).
>
> **The key insight:** In glaucoma, certain cells in the eye (melanopsin-containing retinal ganglion cells)
> are damaged. These cells are responsible for a specific part of the pupil response - the
> "sustained" constriction after blue light.
>
> By measuring the pupil response precisely with different colors of light (red vs blue),
> we can potentially detect glaucoma damage.

## 1.2 The Data Pipeline Overview

```
Raw PLR Recording (pupillometer) 
    |
    v
[1] OUTLIER DETECTION - Remove blinks, artifacts
    |
    v
[2] IMPUTATION - Fill in missing data points
    |
    v
[3] FEATURIZATION - Extract meaningful numbers from the curves
    |                (e.g., "maximum constriction was 2.3mm at 0.8 seconds")
    v
[4] CLASSIFICATION - Train ML models to predict glaucoma
    |
    v
[5] EVALUATION - Bootstrap confidence intervals, effect sizes, calibration
```

## 1.3 What's in the Shared DuckDB Files?

We provide **two** DuckDB database files:

### File 1: `foundation_plr_results.db` (~4 MB)

Contains the **final results** - everything you need to reproduce our statistical analysis:

| Table | What it contains | Rows |
|-------|-----------------|------|
| `predictions` | Every single prediction made by every classifier | ~20,000 |
| `metrics_per_fold` | Performance metrics per cross-validation fold | ~10,000 |
| `metrics_aggregate` | Summary statistics (mean, CI) per classifier | ~14,000 |
| `mlflow_runs` | Experiment metadata | ~300 |

### File 2: `foundation_plr_distributions.db` (~40 MB)

Contains the **full distributions** for uncertainty quantification:

| Table | What it contains | Rows |
|-------|-----------------|------|
| `bootstrap_distributions` | All bootstrap iterations (not just summary) | ~1.3 million |
| `subject_predictions` | Per-subject predictions with uncertainty | ~115,000 |

In [None]:
# Setup
import duckdb
import pandas as pd
import numpy as np
from pathlib import Path

# Paths - adjust to where you saved the files
DATA_DIR = Path("../outputs")
RESULTS_DB = DATA_DIR / "foundation_plr_results.db"
DISTRIBUTIONS_DB = DATA_DIR / "foundation_plr_distributions.db"

print(f"Results DB exists: {RESULTS_DB.exists()}")
print(f"Distributions DB exists: {DISTRIBUTIONS_DB.exists()}")

## 1.4 Complete Column Reference

### Table: `predictions`

> **What is this?** Each row = one prediction for one subject by one classifier in one fold.

| Column | Type | What it means | Example |
|--------|------|---------------|--------|
| `prediction_id` | INT | Unique row identifier | 1, 2, 3... |
| `subject_id` | VARCHAR | Anonymous patient ID | "PLR1001" |
| `eye` | VARCHAR | Which eye | "OD" (right), "OS" (left) |
| `fold` | INT | Cross-validation fold (0-4) | 0, 1, 2, 3, 4 |
| `bootstrap_iter` | INT | Bootstrap iteration (0=original) | 0-999 |
| `outlier_method` | VARCHAR | How blinks/artifacts were detected | "ensemble-LOF-MOMENT-..." |
| `imputation_method` | VARCHAR | How missing data was filled | "SAITS" |
| `featurization` | VARCHAR | Feature extraction method | "simple1.0" |
| `classifier` | VARCHAR | ML algorithm | "TabM", "XGBOOST", "TabPFN", "LogisticRegression" |
| `source_name` | VARCHAR | Full pipeline configuration string | "XGBOOST_eval-auc__simple1.0__SAITS__ensemble-..." |
| `y_true` | INT | **Ground truth**: Does patient have glaucoma? | 0 (no), 1 (yes) |
| `y_pred` | INT | **Binary prediction**: Classifier's decision | 0 (no), 1 (yes) |
| `y_prob` | FLOAT | **Probability**: Classifier's confidence (0.0-1.0) | 0.73 = "73% confident it's glaucoma" |
| `mlflow_run_id` | VARCHAR | Experiment tracking ID | "abc123..." |

### Understanding `source_name`

> **ELI5:** The source_name encodes the ENTIRE pipeline configuration in one string.
>
> Format: `{CLASSIFIER}_{METRIC}__{FEATURIZATION}__{IMPUTATION}__{OUTLIER_DETECTION}`
>
> Example: `XGBOOST_eval-auc__simple1.0__SAITS__ensemble-LOF-MOMENT-OneClassSVM-PROPHET-SubPCA-TimesNet-UniTS-gt-finetune`
>
> Means:
> - Classifier: XGBoost optimized for AUC
> - Features: Simple v1.0 handcrafted features
> - Imputation: SAITS (Self-Attention Imputation for Time Series)
> - Outliers: Ensemble of 7 methods (LOF, MOMENT, OneClassSVM, etc.)

In [None]:
# Let's explore the predictions table
con = duckdb.connect(str(RESULTS_DB), read_only=True)

print("=== PREDICTIONS TABLE ===")
print("\nColumn info:")
print(con.execute("DESCRIBE predictions").fetchdf().to_string())

print("\n\nSample rows:")
print(con.execute("SELECT * FROM predictions LIMIT 3").fetchdf().to_string())

print("\n\nUnique classifiers:")
print(con.execute("SELECT DISTINCT classifier FROM predictions").fetchdf())

print("\n\nData summary:")
summary = con.execute("""
    SELECT 
        COUNT(*) as total_predictions,
        COUNT(DISTINCT subject_id) as unique_subjects,
        COUNT(DISTINCT classifier) as classifiers,
        ROUND(AVG(y_true), 3) as glaucoma_prevalence
    FROM predictions
""").fetchdf()
print(summary.to_string())

### Table: `metrics_per_fold`

> **What is this?** Performance metrics calculated separately for each CV fold.

| Column | What it means |
|--------|---------------|
| `metric_id` | Unique row ID |
| `classifier` | Which classifier |
| `fold` | Which CV fold (0-4) |
| `metric_name` | Which metric (see below) |
| `metric_value` | The value |
| `bootstrap_iter` | Bootstrap iteration |
| `source_name` | Full pipeline config |

**Available metrics:**

| Metric | What it measures | Range | Ideal |
|--------|-----------------|-------|-------|
| `auroc` | Area Under ROC Curve | 0-1 | 1.0 |
| `auprc` | Area Under Precision-Recall Curve | 0-1 | 1.0 |
| `brier` | Brier Score (calibration) | 0-1 | 0.0 |
| `accuracy` | Correct predictions / Total | 0-1 | 1.0 |
| `sensitivity` | True Positives / All Positives | 0-1 | 1.0 |
| `specificity` | True Negatives / All Negatives | 0-1 | 1.0 |

### Table: `bootstrap_distributions` (in distributions DB)

> **ELI5 for PIs:** Bootstrap is like asking "if we had slightly different patients, would we get the same result?"
> We resample the data 1000 times and calculate metrics each time.
> This gives us a distribution of possible values, not just one number.

| Column | What it means |
|--------|---------------|
| `dist_id` | Unique row ID |
| `classifier` | Which classifier |
| `metric_name` | Which metric |
| `fold` | Which CV fold |
| `bootstrap_iter` | Which bootstrap iteration (0-999) |
| `metric_value` | Metric value for THIS bootstrap sample |
| `source_name` | Pipeline config |

### Table: `subject_predictions` (in distributions DB)

> **What is this?** Per-subject predictions with uncertainty measures.
> Use this for the "probability distributions per outcome category" plots.

| Column | What it means |
|--------|---------------|
| `pred_id` | Unique row ID |
| `source_name` | Full pipeline config |
| `classifier` | Which classifier |
| `split` | "train" or "test" |
| `subject_code` | Anonymous subject ID |
| `y_true` | Ground truth (0=healthy, 1=glaucoma) |
| `y_pred_proba` | Predicted probability (0.0-1.0) |
| `y_pred` | Binary prediction (0 or 1) |
| `confidence` | Prediction confidence (if available) |
| `entropy_of_expected` | Uncertainty measure |
| `expected_entropy` | Uncertainty measure |
| `mutual_information` | Uncertainty measure |

---

# PART 2: Reproducing Our Results

## 2.1 Get the Exact Numbers from Our Paper

In [None]:
# Reproduce Table 1: Classifier Performance
print("=" * 70)
print("TABLE 1: Classification Performance for Glaucoma Detection")
print("=" * 70)

query = """
SELECT 
    classifier as Classifier,
    ROUND(mean, 3) as AUROC,
    '[' || ROUND(ci_lower, 3) || ', ' || ROUND(ci_upper, 3) || ']' as "95% CI",
    ROUND(std, 3) as SE
FROM metrics_aggregate
WHERE metric_name = 'auroc'
ORDER BY mean DESC
"""

result = con.execute(query).fetchdf()
print(result.to_string(index=False))

In [None]:
# Reproduce the pairwise comparisons with effect sizes
print("\n" + "=" * 70)
print("TABLE 2: Pairwise Comparisons (Effect Sizes)")
print("=" * 70)

# Get fold-level AUROCs for effect size calculation
fold_aurocs = con.execute("""
    SELECT classifier, fold, metric_value as auroc
    FROM metrics_per_fold
    WHERE metric_name = 'auroc' AND bootstrap_iter = 0
    ORDER BY classifier, fold
""").fetchdf()

# Calculate Cohen's d for each pair
from scipy import stats
import itertools

classifiers = fold_aurocs['classifier'].unique()
comparisons = []

for clf1, clf2 in itertools.combinations(classifiers, 2):
    x1 = fold_aurocs[fold_aurocs['classifier'] == clf1]['auroc'].values
    x2 = fold_aurocs[fold_aurocs['classifier'] == clf2]['auroc'].values
    
    # Cohen's d
    pooled_std = np.sqrt(((len(x1)-1)*np.var(x1, ddof=1) + (len(x2)-1)*np.var(x2, ddof=1)) / (len(x1)+len(x2)-2))
    d = (np.mean(x1) - np.mean(x2)) / pooled_std if pooled_std > 0 else 0
    
    # Paired t-test
    if len(x1) == len(x2):  # Same folds
        t_stat, p_val = stats.ttest_rel(x1, x2)
    else:
        t_stat, p_val = stats.ttest_ind(x1, x2)
    
    # Interpretation
    if abs(d) < 0.2:
        interp = "negligible"
    elif abs(d) < 0.5:
        interp = "small"
    elif abs(d) < 0.8:
        interp = "medium"
    else:
        interp = "large"
    
    comparisons.append({
        'Classifier 1': clf1,
        'Classifier 2': clf2,
        "Cohen's d": round(d, 3),
        'p-value': f"{p_val:.2e}" if p_val < 0.001 else f"{p_val:.4f}",
        'Effect': interp
    })

print(pd.DataFrame(comparisons).to_string(index=False))

## 2.2 Reproduce Figures

In [None]:
import matplotlib.pyplot as plt

# Figure: Probability distributions per outcome category (Van Calster 2024 requirement)
if DISTRIBUTIONS_DB.exists():
    con_dist = duckdb.connect(str(DISTRIBUTIONS_DB), read_only=True)
    
    # Get test set predictions
    df = con_dist.execute("""
        SELECT classifier, y_true, y_pred_proba 
        FROM subject_predictions 
        WHERE split = 'test'
        AND classifier IN ('TabM', 'XGBOOST', 'TabPFN', 'LogisticRegression')
    """).fetchdf()
    
    # Plot
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    for ax, clf in zip(axes.flatten(), ['TabM', 'XGBOOST', 'TabPFN', 'LogisticRegression']):
        clf_df = df[df['classifier'] == clf]
        
        controls = clf_df[clf_df['y_true'] == 0]['y_pred_proba']
        cases = clf_df[clf_df['y_true'] == 1]['y_pred_proba']
        
        ax.hist(controls, bins=30, alpha=0.6, label=f'Controls (n={len(controls)})', 
                color='#3498db', density=True)
        ax.hist(cases, bins=30, alpha=0.6, label=f'Glaucoma (n={len(cases)})', 
                color='#e74c3c', density=True)
        ax.axvline(x=0.5, color='gray', linestyle='--', label='Threshold')
        ax.set_xlabel('Predicted Probability')
        ax.set_ylabel('Density')
        ax.set_title(clf)
        ax.legend()
    
    plt.suptitle('Probability Distributions by True Outcome (Test Set)', fontsize=14)
    plt.tight_layout()
    plt.show()
    
    con_dist.close()
else:
    print("Distributions database not found")

---

# PART 3: Training New Classifiers

> **Scenario:** It's 2027 and a new amazing classifier called "SuperTabNet" was just released.
> You want to test it on our data.

## 3.1 Load the Features

The features we used are embedded in the predictions. For training new classifiers,
you need to extract the feature vectors.

In [None]:
# Method 1: Use the predictions to get labels and subject IDs,
# then join with features if you have access to them

# Get unique subjects and their labels from the test fold of the best classifier
query = """
SELECT DISTINCT 
    subject_id,
    eye,
    y_true as label
FROM predictions
WHERE classifier = 'TabM' AND fold = 0
ORDER BY subject_id, eye
"""

subjects = con.execute(query).fetchdf()
print(f"Subjects in the dataset: {len(subjects)}")
print(f"\nClass distribution:")
print(subjects['label'].value_counts())
print(f"\nFirst few subjects:")
print(subjects.head(10))

In [None]:
# Method 2: If you have the features database (foundation_plr_features.db),
# you can load the actual feature vectors

FEATURES_DB = DATA_DIR / "foundation_plr_features.db"

if FEATURES_DB.exists():
    con_feat = duckdb.connect(str(FEATURES_DB), read_only=True)
    
    # Get schema
    print("Feature columns:")
    schema = con_feat.execute("DESCRIBE plr_features").fetchdf()
    print(schema['column_name'].tolist())
    
    # Load features
    X_df = con_feat.execute("SELECT * FROM plr_features").fetchdf()
    
    # Separate features from metadata
    metadata_cols = ['subject_id', 'eye', 'source_name', 'has_glaucoma', 'split']
    feature_cols = [c for c in X_df.columns if c not in metadata_cols]
    
    X = X_df[feature_cols].values
    y = X_df['has_glaucoma'].values
    
    print(f"\nFeature matrix shape: {X.shape}")
    print(f"Labels shape: {y.shape}")
    print(f"Feature names: {feature_cols[:10]}...")
    
    con_feat.close()
else:
    print("Features database not found.")
    print("Using prediction probabilities as proxy features for demonstration.")
    
    # Get probabilities from all classifiers as pseudo-features
    query = """
    SELECT 
        subject_id,
        eye,
        MAX(CASE WHEN classifier = 'TabM' THEN y_prob END) as prob_TabM,
        MAX(CASE WHEN classifier = 'XGBOOST' THEN y_prob END) as prob_XGBOOST,
        MAX(CASE WHEN classifier = 'TabPFN' THEN y_prob END) as prob_TabPFN,
        MAX(CASE WHEN classifier = 'LogisticRegression' THEN y_prob END) as prob_LR,
        MAX(y_true) as label
    FROM predictions
    WHERE fold = 0
    GROUP BY subject_id, eye
    """
    
    df = con.execute(query).fetchdf().dropna()
    X = df[['prob_TabM', 'prob_XGBOOST', 'prob_TabPFN', 'prob_LR']].values
    y = df['label'].values
    
    print(f"Pseudo-feature matrix shape: {X.shape}")

## 3.2 Train a New Classifier

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import roc_auc_score, roc_curve

# Example: Train a Random Forest (pretend this is "SuperTabNet 2027")
from sklearn.ensemble import RandomForestClassifier

# Initialize your new classifier
new_clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    class_weight='balanced'
)

# Use 5-fold CV like we did
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Get cross-validated predictions
y_prob_cv = cross_val_predict(new_clf, X, y, cv=cv, method='predict_proba')[:, 1]

# Calculate AUROC
auroc = roc_auc_score(y, y_prob_cv)
print(f"\nYour new classifier AUROC: {auroc:.4f}")

# Compare to our results
print("\nComparison to our classifiers:")
our_results = con.execute("""
    SELECT classifier, ROUND(mean, 4) as auroc
    FROM metrics_aggregate
    WHERE metric_name = 'auroc'
    ORDER BY mean DESC
""").fetchdf()
print(our_results.to_string(index=False))

## 3.3 Bootstrap Confidence Intervals for Your Classifier

In [None]:
def bootstrap_auroc_ci(y_true, y_prob, n_iterations=1000, alpha=0.05, random_state=42):
    """
    Calculate bootstrap confidence interval for AUROC.
    
    This is the same method we used in the paper.
    """
    rng = np.random.default_rng(random_state)
    n = len(y_true)
    aurocs = []
    
    for _ in range(n_iterations):
        # Resample with replacement
        idx = rng.choice(n, n, replace=True)
        
        # Skip if only one class in sample
        if len(np.unique(y_true[idx])) < 2:
            continue
            
        aurocs.append(roc_auc_score(y_true[idx], y_prob[idx]))
    
    aurocs = np.array(aurocs)
    ci_lower = np.percentile(aurocs, 100 * alpha / 2)
    ci_upper = np.percentile(aurocs, 100 * (1 - alpha / 2))
    
    return {
        'mean': np.mean(aurocs),
        'std': np.std(aurocs),
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'n_valid_iterations': len(aurocs)
    }

# Calculate CI for your classifier
ci_result = bootstrap_auroc_ci(y.astype(int), y_prob_cv, n_iterations=1000)

print(f"Your classifier results:")
print(f"  AUROC: {ci_result['mean']:.4f}")
print(f"  95% CI: [{ci_result['ci_lower']:.4f}, {ci_result['ci_upper']:.4f}]")
print(f"  SE: {ci_result['std']:.4f}")

---

# PART 4: Applying New Biostatistics Methods

> **Scenario:** A new calibration method or clinical utility metric is published.
> You want to apply it to our predictions.

## 4.1 Get Raw Predictions for Custom Analysis

In [None]:
# Get all predictions for the best classifier (TabM)
query = """
SELECT 
    subject_id,
    eye,
    fold,
    y_true,
    y_prob,
    y_pred
FROM predictions
WHERE classifier = 'TabM'
  AND bootstrap_iter = 0  -- Original predictions (not bootstrap)
ORDER BY subject_id, eye, fold
"""

df_predictions = con.execute(query).fetchdf()

# Convert to numpy for analysis
y_true_all = df_predictions['y_true'].values
y_prob_all = df_predictions['y_prob'].values
y_pred_all = df_predictions['y_pred'].values

print(f"Loaded {len(df_predictions)} predictions")
print(f"Unique subjects: {df_predictions['subject_id'].nunique()}")
print(f"Folds: {sorted(df_predictions['fold'].unique())}")

## 4.2 Example: Custom Calibration Metric

In [None]:
def integrated_calibration_index(y_true, y_prob, n_bins=10):
    """
    Integrated Calibration Index (ICI) - a newer calibration metric.
    
    ICI = mean absolute difference between predicted and observed probabilities
    across the probability range.
    
    Lower is better (0 = perfect calibration).
    """
    from sklearn.calibration import calibration_curve
    
    # Get calibration curve
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins, strategy='uniform')
    
    # ICI = mean absolute deviation
    ici = np.mean(np.abs(prob_true - prob_pred))
    
    return ici

# Calculate ICI for each classifier
print("Integrated Calibration Index (ICI):")
print("(Lower is better, 0 = perfect)\n")

for clf in ['TabM', 'XGBOOST', 'TabPFN', 'LogisticRegression']:
    df_clf = con.execute(f"""
        SELECT y_true, y_prob 
        FROM predictions 
        WHERE classifier = '{clf}' AND bootstrap_iter = 0 AND fold = 0
    """).fetchdf()
    
    ici = integrated_calibration_index(df_clf['y_true'].values, df_clf['y_prob'].values)
    print(f"  {clf}: ICI = {ici:.4f}")

## 4.3 Example: Custom Clinical Utility Metric

In [None]:
def net_benefit_at_threshold(y_true, y_prob, threshold):
    """
    Calculate net benefit at a specific decision threshold.
    
    Net Benefit = TP/n - FP/n Ã— (pt / (1-pt))
    
    where pt is the threshold probability.
    
    This is from Vickers & Elkin (2006) Decision Curve Analysis.
    """
    n = len(y_true)
    y_pred = (y_prob >= threshold).astype(int)
    
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    
    if threshold >= 1 or threshold <= 0:
        return 0
    
    nb = (tp / n) - (fp / n) * (threshold / (1 - threshold))
    return nb


def standardized_net_benefit(y_true, y_prob, threshold):
    """
    Standardized Net Benefit (sNB) - normalized to [0, 1] range.
    
    sNB = NB / prevalence
    
    This makes it easier to compare across studies with different prevalences.
    """
    nb = net_benefit_at_threshold(y_true, y_prob, threshold)
    prevalence = np.mean(y_true)
    return nb / prevalence if prevalence > 0 else 0


# Calculate at clinically relevant thresholds
print("Net Benefit at Clinically Relevant Thresholds:")
print("(Higher is better, compare to 'treat all' baseline)\n")

thresholds = [0.1, 0.2, 0.3, 0.5]
prevalence = y_true_all.mean()

print(f"Prevalence: {prevalence:.3f}\n")

for t in thresholds:
    treat_all_nb = prevalence - (1 - prevalence) * (t / (1 - t))
    print(f"\nThreshold = {t}:")
    print(f"  Treat All NB: {treat_all_nb:.4f}")
    
    for clf in ['TabM', 'XGBOOST', 'TabPFN', 'LogisticRegression']:
        df_clf = con.execute(f"""
            SELECT y_true, y_prob 
            FROM predictions 
            WHERE classifier = '{clf}' AND bootstrap_iter = 0 AND fold = 0
        """).fetchdf()
        
        nb = net_benefit_at_threshold(df_clf['y_true'].values, df_clf['y_prob'].values, t)
        improvement = nb - max(treat_all_nb, 0)
        print(f"  {clf}: NB = {nb:.4f} (improvement: {improvement:+.4f})")

---

# PART 5: Running the Pipeline on Your Own PLR Data

> **Important:** This section is for researchers who have their own PLR measurements
> and want to use our exact pipeline.

## 5.1 Data Requirements

Your PLR data must be in a specific format:

### Required Columns

| Column | Type | Description | Example |
|--------|------|-------------|--------|
| `subject_code` | str | Unique subject ID | "SUBJ001" |
| `time` | float | Time in seconds | 0.0, 0.033, 0.067... |
| `pupil_raw` | float | Raw pupil size (mm) | 4.52, 4.48, 4.31... |
| `Red` | int | Red light on (1) or off (0) | 0, 0, 1, 1, 1, 0... |
| `Blue` | int | Blue light on (1) or off (0) | 0, 0, 0, 0, 1, 1... |
| `class_label` | int | 0=control, 1=glaucoma | 0, 1 |

### Optional Columns (if available)

| Column | Description |
|--------|--------------|
| `pupil_gt` | Ground truth denoised signal |
| `denoised` | CEEMD-denoised signal |
| `age` | Subject age |
| `sex` | Subject sex |

## 5.2 Expected PLR Recording Protocol

Our pipeline expects a specific recording protocol:

```
Timeline (seconds):

0     5      20     25     40     45     60     65
|-----|------|------|------|------|------|------|---->
  ^       ^          ^          ^          ^         
  |       |          |          |          |         
Baseline Red ON   Red OFF   Blue ON   Blue OFF      
         (15s)     (5s)     (15s)     (recovery)
```

- **Sampling rate**: 30 Hz (30 samples per second)
- **Total duration**: ~66 seconds (1981 samples)
- **Light stimuli**: Red (620nm) and Blue (470nm)
- **Stimulus duration**: ~15 seconds each

> **What if my protocol is different?**
>
> You'll need to modify the feature extraction windows (see Part 6).

## 5.3 Prepare Your Data

In [None]:
# Example: Creating properly formatted data

def prepare_plr_for_pipeline(your_data_df):
    """
    Convert your PLR data to the format expected by our pipeline.
    
    Parameters
    ----------
    your_data_df : pd.DataFrame
        Your data with columns: subject_id, timestamp_ms, pupil_diameter, 
        red_light_on, blue_light_on, has_glaucoma
    
    Returns
    -------
    pd.DataFrame
        Data formatted for the pipeline
    """
    # Rename columns to expected names
    column_map = {
        'subject_id': 'subject_code',
        'timestamp_ms': 'time',  # Will convert to seconds
        'pupil_diameter': 'pupil_raw',
        'red_light_on': 'Red',
        'blue_light_on': 'Blue',
        'has_glaucoma': 'class_label'
    }
    
    df = your_data_df.rename(columns=column_map).copy()
    
    # Convert time to seconds if needed
    if df['time'].max() > 100:  # Probably in milliseconds
        df['time'] = df['time'] / 1000
    
    # Ensure binary columns are integers
    df['Red'] = df['Red'].astype(int)
    df['Blue'] = df['Blue'].astype(int)
    df['class_label'] = df['class_label'].astype(int)
    
    # Validate
    assert 'subject_code' in df.columns
    assert 'time' in df.columns
    assert 'pupil_raw' in df.columns
    assert 'Red' in df.columns
    assert 'Blue' in df.columns
    assert 'class_label' in df.columns
    
    print(f"Data prepared: {df['subject_code'].nunique()} subjects")
    print(f"Time range: {df['time'].min():.2f} to {df['time'].max():.2f} seconds")
    print(f"Samples per subject: ~{len(df) // df['subject_code'].nunique()}")
    
    return df

# Example usage (with synthetic data)
print("Example data preparation:")
print("--" * 30)

# Create synthetic example
n_subjects = 3
n_samples = 1981  # Our standard length

example_data = []
for subj in range(n_subjects):
    for t in range(n_samples):
        time_s = t / 30  # 30 Hz
        example_data.append({
            'subject_id': f'SUBJ{subj:03d}',
            'timestamp_ms': t * (1000/30),
            'pupil_diameter': 4.5 + np.random.randn() * 0.3,
            'red_light_on': 1 if 5 < time_s < 20 else 0,
            'blue_light_on': 1 if 25 < time_s < 40 else 0,
            'has_glaucoma': subj % 2  # Alternating labels
        })

example_df = pd.DataFrame(example_data)
prepared_df = prepare_plr_for_pipeline(example_df)
print(f"\nFirst few rows:\n{prepared_df.head()}")

## 5.4 Save to DuckDB

In [None]:
def save_plr_to_duckdb(df, output_path):
    """
    Save prepared PLR data to DuckDB format for the pipeline.
    """
    import duckdb
    
    con = duckdb.connect(str(output_path))
    
    # Create table
    con.execute("""
        CREATE TABLE IF NOT EXISTS plr_recordings (
            subject_code VARCHAR,
            time FLOAT,
            pupil_raw FLOAT,
            Red INTEGER,
            Blue INTEGER,
            class_label INTEGER,
            pupil_gt FLOAT,  -- Optional
            denoised FLOAT   -- Optional
        )
    """)
    
    # Insert data
    con.register('df', df)
    con.execute("INSERT INTO plr_recordings SELECT * FROM df")
    
    # Verify
    count = con.execute("SELECT COUNT(*) FROM plr_recordings").fetchone()[0]
    print(f"Saved {count} rows to {output_path}")
    
    con.close()

# Example (commented out to avoid creating files)
# save_plr_to_duckdb(prepared_df, 'my_plr_data.db')

## 5.5 Run the Pipeline

Once your data is in DuckDB format:

```bash
# 1. Clone the repository
git clone https://github.com/YOUR_REPO/foundation-PLR.git
cd foundation-PLR

# 2. Install dependencies
uv venv --python 3.11
uv sync

# 3. Update config to point to your data
# Edit configs/defaults.yaml:
#   DATA:
#     filename_DuckDB: 'my_plr_data.db'

# 4. Run the pipeline
python src/pipeline_PLR.py

# 5. View results in MLflow
mlflow ui --port 5000
# Open http://localhost:5000 in browser
```

---

# PART 6: Customizing the Featurization (BINS)

> **ELI5 for PIs:**
>
> "Featurization" means extracting meaningful numbers from the pupil curves.
> Instead of using all 1981 data points, we calculate things like:
> - "Maximum constriction was 2.3mm" 
> - "Time to maximum constriction was 0.8 seconds"
> - "Post-illumination pupil response (PIPR) was 15% below baseline"
>
> The "bins" define WHAT to measure and WHEN (time windows).

## 6.1 How Our Features Are Defined

Each feature is defined by:

| Parameter | What it means | Options |
|-----------|--------------|--------|
| `time_from` | Reference point | `onset` (light turns on) or `offset` (light turns off) |
| `time_start` | Window start (seconds relative to reference) | Any number (negative = before) |
| `time_end` | Window end (seconds relative to reference) | Any number |
| `measure` | What to extract | `amplitude` (pupil size) or `timing` (latency) |
| `stat` | How to summarize | `min`, `max`, `mean`, `median`, `AUC` |

In [None]:
# The features we used (featuresSimple.yaml)

our_features = {
    'BASELINE': {
        'description': 'Pupil size before light stimulus',
        'time_from': 'onset',
        'time_start': -5,   # 5 seconds BEFORE light onset
        'time_end': 0,      # Up to light onset
        'measure': 'amplitude',
        'stat': 'median'
    },
    'MAX_CONSTRICTION': {
        'description': 'Maximum pupil constriction during light',
        'time_from': 'onset',
        'time_start': 0,    # From light onset
        'time_end': 15,     # To 15 seconds after
        'measure': 'amplitude',
        'stat': 'min'       # Minimum = maximum constriction
    },
    'PHASIC': {
        'description': 'Initial rapid constriction (first 5 seconds)',
        'time_from': 'onset',
        'time_start': 0,
        'time_end': 5,
        'measure': 'amplitude',
        'stat': 'min'
    },
    'SUSTAINED': {
        'description': 'Sustained constriction (last 5 seconds of light)',
        'time_from': 'offset',
        'time_start': -5,   # 5 seconds BEFORE light offset
        'time_end': 0,
        'measure': 'amplitude',
        'stat': 'min'
    },
    'PIPR': {
        'description': 'Post-Illumination Pupil Response (after light off)',
        'time_from': 'offset',
        'time_start': 0,    # From light offset
        'time_end': 15,     # To 15 seconds after
        'measure': 'amplitude',
        'stat': 'min'
    },
    'PIPR_AUC': {
        'description': 'Area under PIPR curve',
        'time_from': 'offset',
        'time_start': 0,
        'time_end': 12,
        'measure': 'amplitude',
        'stat': 'AUC'
    },
    'LATENCY': {
        'description': 'Time to reach maximum constriction',
        'time_from': 'onset',
        'time_start': 0,
        'time_end': 5,
        'measure': 'timing',  # Note: timing, not amplitude
        'stat': 'min'
    }
}

print("Our feature definitions:")
print("=" * 70)
for name, params in our_features.items():
    print(f"\n{name}:")
    print(f"  {params['description']}")
    print(f"  Window: {params['time_start']}s to {params['time_end']}s relative to {params['time_from']}")
    print(f"  Measure: {params['measure']} with {params['stat']}")

## 6.2 Creating Custom Features

> **When would you modify features?**
>
> 1. Your recording protocol has different timing (e.g., 10s light instead of 15s)
> 2. You want to test new hypotheses (e.g., "early vs late PIPR")
> 3. Your pupillometer has different sampling rate
> 4. You're studying a different disease with different expected responses

In [None]:
# Example: Create a custom feature config

custom_features_yaml = """
# Save this as configs/PLR_FEATURIZATION/featuresCustom.yaml

FEATURES_METADATA:
  name: 'custom'
  version: 1.0
  feature_method: 'handcrafted_features'
  
FEATURES:
  # Standard baseline
  BASELINE:
    time_from: 'onset'
    time_start: -3      # Only 3 seconds (shorter protocol)
    time_end: 0
    measure: 'amplitude'
    stat: 'median'
    
  # Your custom feature: very early constriction
  EARLY_CONSTRICTION:
    time_from: 'onset'
    time_start: 0
    time_end: 1         # Just first second!
    measure: 'amplitude'
    stat: 'min'
    
  # Your custom feature: late sustained response  
  LATE_SUSTAINED:
    time_from: 'onset'
    time_start: 8       # 8-10 seconds after onset
    time_end: 10
    measure: 'amplitude'
    stat: 'mean'
    
  # Your custom feature: early vs late PIPR ratio
  PIPR_EARLY:
    time_from: 'offset'
    time_start: 0
    time_end: 5         # First 5 seconds after light off
    measure: 'amplitude'
    stat: 'mean'
    
  PIPR_LATE:
    time_from: 'offset'
    time_start: 5
    time_end: 15        # 5-15 seconds after light off
    measure: 'amplitude'
    stat: 'mean'
"""

print(custom_features_yaml)

## 6.3 Visualizing Feature Windows

In [None]:
# Visualize what the feature windows mean on a PLR curve

def plot_feature_windows(features_dict):
    """
    Create a visual diagram showing feature time windows on a PLR curve.
    """
    import matplotlib.pyplot as plt
    import matplotlib.patches as mpatches
    
    fig, ax = plt.subplots(figsize=(14, 6))
    
    # Simulate a PLR curve
    t = np.linspace(0, 66, 1981)
    
    # Baseline
    pupil = np.ones_like(t) * 5.0
    
    # Light onset at t=5, offset at t=20 (Red)
    red_on = (t >= 5) & (t <= 20)
    pupil[red_on] = 5.0 - 1.5 * (1 - np.exp(-(t[red_on] - 5) / 0.5))
    
    # Recovery after red
    red_recovery = (t > 20) & (t <= 40)
    pupil[red_recovery] = 3.5 + 1.2 * (1 - np.exp(-(t[red_recovery] - 20) / 3))
    
    # Blue onset at t=40, offset at t=55
    blue_on = (t >= 40) & (t <= 55)
    pupil[blue_on] = 4.7 - 1.8 * (1 - np.exp(-(t[blue_on] - 40) / 0.5))
    
    # PIPR after blue (slower recovery - key glaucoma marker)
    pipr = t > 55
    pupil[pipr] = 2.9 + 1.3 * (1 - np.exp(-(t[pipr] - 55) / 8))
    
    # Plot pupil
    ax.plot(t, pupil, 'k-', linewidth=2, label='Pupil diameter')
    
    # Plot light stimuli
    ax.fill_between(t, 0, 1, where=red_on, alpha=0.3, color='red', 
                    label='Red light', transform=ax.get_xaxis_transform())
    ax.fill_between(t, 0, 1, where=blue_on, alpha=0.3, color='blue',
                    label='Blue light', transform=ax.get_xaxis_transform())
    
    # Plot feature windows
    colors = plt.cm.Set3(np.linspace(0, 1, len(features_dict)))
    y_offset = 1.5  # Start position for annotations
    
    # Reference points
    red_onset = 5
    red_offset = 20
    blue_onset = 40
    blue_offset = 55
    
    for i, (name, params) in enumerate(features_dict.items()):
        # Calculate absolute times (using Blue stimulus as example)
        if params['time_from'] == 'onset':
            ref = blue_onset
        else:
            ref = blue_offset
            
        t_start = ref + params['time_start']
        t_end = ref + params['time_end']
        
        # Draw bracket
        y_pos = y_offset + i * 0.25
        ax.annotate('', xy=(t_start, y_pos), xytext=(t_end, y_pos),
                   arrowprops=dict(arrowstyle='<->', color=colors[i], lw=2))
        ax.text((t_start + t_end) / 2, y_pos + 0.1, name, 
               ha='center', va='bottom', fontsize=8, color=colors[i])
    
    ax.set_xlabel('Time (seconds)', fontsize=12)
    ax.set_ylabel('Pupil diameter (mm)', fontsize=12)
    ax.set_title('Feature Extraction Windows on PLR Curve\n(Blue stimulus shown)', fontsize=14)
    ax.legend(loc='upper right')
    ax.set_xlim(0, 66)
    ax.set_ylim(1, 6)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Plot our features
plot_feature_windows(our_features)

## 6.4 Extracting Features from Your Data

In [None]:
def extract_feature(pupil, time, light_on, feature_params):
    """
    Extract a single feature from PLR data.
    
    Parameters
    ----------
    pupil : np.array
        Pupil diameter time series
    time : np.array
        Time in seconds
    light_on : np.array
        Binary array (1 when light on, 0 when off)
    feature_params : dict
        Feature definition
        
    Returns
    -------
    dict
        Feature value and statistics
    """
    # Find light onset and offset
    light_changes = np.diff(light_on)
    onset_idx = np.where(light_changes == 1)[0]
    offset_idx = np.where(light_changes == -1)[0]
    
    if len(onset_idx) == 0 or len(offset_idx) == 0:
        return {'value': np.nan, 'error': 'Light stimulus not found'}
    
    onset_time = time[onset_idx[0] + 1]
    offset_time = time[offset_idx[0] + 1]
    
    # Calculate absolute window
    if feature_params['time_from'] == 'onset':
        ref_time = onset_time
    else:
        ref_time = offset_time
    
    t_start = ref_time + feature_params['time_start']
    t_end = ref_time + feature_params['time_end']
    
    # Extract samples in window
    mask = (time >= t_start) & (time <= t_end)
    window_pupil = pupil[mask]
    window_time = time[mask]
    
    if len(window_pupil) == 0:
        return {'value': np.nan, 'error': 'Empty window'}
    
    # Compute statistic
    stat = feature_params['stat']
    
    if feature_params['measure'] == 'amplitude':
        if stat == 'min':
            value = np.min(window_pupil)
        elif stat == 'max':
            value = np.max(window_pupil)
        elif stat == 'mean':
            value = np.mean(window_pupil)
        elif stat == 'median':
            value = np.median(window_pupil)
        elif stat == 'AUC':
            value = np.trapz(window_pupil, window_time)
            
    elif feature_params['measure'] == 'timing':
        # Time to minimum (latency)
        min_idx = np.argmin(window_pupil)
        value = window_time[min_idx] - ref_time
    
    return {
        'value': value,
        'std': np.std(window_pupil),
        'n_samples': len(window_pupil)
    }


# Example: Extract features from synthetic data
# (In practice, you'd use your actual PLR recordings)

# Generate synthetic PLR
t = np.linspace(0, 66, 1981)
pupil = 5.0 * np.ones_like(t)

# Blue light on at t=40, off at t=55
blue = np.zeros_like(t, dtype=int)
blue[(t >= 40) & (t <= 55)] = 1

# Simulate pupil response
for i, ti in enumerate(t):
    if 40 <= ti <= 55:
        pupil[i] = 5.0 - 1.8 * (1 - np.exp(-(ti - 40) / 0.5))
    elif ti > 55:
        pupil[i] = 3.2 + 1.0 * (1 - np.exp(-(ti - 55) / 5))

# Extract features
print("Extracted features from synthetic PLR:")
print("-" * 50)

for name, params in our_features.items():
    result = extract_feature(pupil, t, blue, params)
    print(f"{name}: {result['value']:.3f}")

---

# PART 7: Quick Reference

## SQL Queries Cheatsheet

```sql
-- Get all results for one classifier
SELECT * FROM metrics_aggregate WHERE classifier = 'TabM';

-- Get predictions for analysis
SELECT y_true, y_prob FROM predictions WHERE classifier = 'TabM' AND fold = 0;

-- Get bootstrap distribution for custom CI
SELECT metric_value FROM bootstrap_distributions 
WHERE classifier = 'TabM' AND metric_name = 'auroc';

-- Compare classifiers
SELECT classifier, mean, ci_lower, ci_upper 
FROM metrics_aggregate WHERE metric_name = 'auroc' ORDER BY mean DESC;
```

## File Locations

| File | Purpose |
|------|--------|
| `configs/defaults.yaml` | Main configuration |
| `configs/PLR_FEATURIZATION/*.yaml` | Feature definitions |
| `configs/CLS_MODELS/*.yaml` | Classifier settings |
| `src/featurization/` | Feature extraction code |
| `src/classification/` | Classifier training code |
| `src/stats/` | Biostatistics modules |

## Contact

For questions about this data or pipeline:
- GitHub Issues: [repository URL]
- Email: [contact email]

In [None]:
# Clean up
con.close()
print("Tutorial complete!")