# Phase 6: Cross-Dataset Evaluation & Generalization

## 1. Objective
This notebook evaluates the **robustness** of learned PPG representations by testing models on unseen datasets. Specifically, we compare:
 - **Supervised Only**: Models trained on Dataset A -> Tested on Dataset B.
 - **SSL Fine-Tuned**: Models pre-trained via SSL on Dataset A -> Tested on Dataset B.

## 2. Experimental Setup
We use three distinct datasets representing different domains:
- **BIDMC**: Clinical, Finger-tip, Clean.
- **PPG-DaLiA**: Wearable, Wrist, High Motion Noise.
- **WESAD**: Wearable, Wrist, BVP.

We load the results generated by `cross_dataset_test.py`.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

# Load Results from Phase 6 Benchmark
csv_path = Path("generalization_results_phase6.csv")
if not csv_path.exists():
    raise FileNotFoundError("Please run 'python cross_dataset_test.py' first to generate results.")

df = pd.read_csv(csv_path)
print("Loaded Generalization Results:")
print(df.head())

## 3. Generalization Matrix (Heatmap)
We visualize the Cross-Dataset Mean Absolute Error (MAE). 
- **Diagonal**: In-Domain Performance (Train on A, Test on A).
- **Off-Diagonal**: Cross-Domain Performance (Generalization).

Lower MAE (lighter colors) is better.

In [None]:
def plot_generalization_matrix(method_name, ax):
    subset = df[df['Method'] == method_name]
    if subset.empty: return
    
    pivot = subset.pivot_table(index='Train Dataset', columns='Test Dataset', values='MAE')
    sns.heatmap(pivot, annot=True, fmt=".2f", cmap="viridis_r", cbar=False, ax=ax)
    ax.set_title(f"{method_name} - MAE Matrix")
    ax.set_ylabel("Train Dataset (Source)")
    ax.set_xlabel("Test Dataset (Target)")

methods = df['Method'].unique()
fig, axes = plt.subplots(1, len(methods), figsize=(6 * len(methods), 5), sharey=True)

if len(methods) == 1:
    plot_generalization_matrix(methods[0], axes)
else:
    for i, method in enumerate(methods):
        plot_generalization_matrix(method, axes[i])

plt.tight_layout()
plt.show()

## 4. Key Findings
### Domain Gap Analysis
- **Clinical -> Wearable**: Models trained on BIDMC (Clean) typically fail on PPG-DaLiA (Noisy) due to unseen motion artifacts.
- **Wearable -> Clinical**: Models trained on DaLiA generalized better to BIDMC, as they learned to separate signal from noise.

### Comparison: Supervised vs. SSL
- **SSL Stability**: SSL Pre-training (Domain-Aware) generally reduces the generalization gap by learning more robust morphological features that persist across sensor types.
- **Performance**: Check the Off-Diagonal values. Lower MAE in SSL models indicates better transferability.