# Pipeline Validation with Synthetic Data

**Purpose:** Validate that our ML pipelines work correctly by testing on synthetic data with known structure.

This notebook uses the **actual pipeline functions** (not raw sklearn) to verify:
1. The pipeline code itself is bug-free
2. Any poor performance on real data is due to data/signal, not code bugs

**Realistic conditions tested:**
- **Imbalanced data** (~20:1 ratio like real ABCD data)
- **Site batch effects** (removed by ComBat harmonization)
- **Downsampling** during training (100 iterations)
- **Testing on full imbalanced data**

**Tests:**
- Test 1: Strong signal with realistic imbalance (should achieve >80% balanced accuracy)
- Test 2: Moderate signal with realistic imbalance (should achieve >65% balanced accuracy)
- Test 3: Weak signal with realistic imbalance (should beat chance >55%)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.metrics import roc_curve, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

from core.config import initialize_notebook

# Initialize to get configs
env = initialize_notebook(regenerate_run_id=False)

SEED = env.configs.run['seed']
np.random.seed(SEED)

print(f"Pipeline Validation with Synthetic Data")
print(f"Random Seed: {SEED}")
print(f"Research Question: {env.configs.run['run_name']}")

## Helper Functions

Create synthetic DataFrames that match the expected pipeline format.

In [None]:
def create_synthetic_dataframe(
    n_control: int,
    n_clinical: int,
    n_features: int = 100,
    separation: float = 2.0,
    noise_scale: float = 1.0,
    seed: int = 42,
    n_informative: int = 20,
    site_effect_scale: float = 1.5,
) -> pd.DataFrame:
    """
    Create synthetic DataFrame matching pipeline expected format.
    
    Includes realistic site effects that ComBat should harmonize away,
    while preserving the clinical vs control signal.
    
    Args:
        n_control: Number of control samples
        n_clinical: Number of clinical samples
        n_features: Total number of imaging features
        separation: Mean shift between classes (higher = easier to classify)
        noise_scale: Standard deviation of noise
        seed: Random seed
        n_informative: Number of features with actual signal
        site_effect_scale: Magnitude of site batch effects (ComBat should remove these)
    
    Returns:
        DataFrame with imaging features, covariates, and group labels
    """
    rng = np.random.RandomState(seed)
    n_total = n_control + n_clinical
    
    # Define 3 sites with different batch effects
    sites = ['Siemens', 'GE', 'Philips']
    n_sites = len(sites)
    
    # Site-specific effects (mean shifts and scale differences)
    # These should be REMOVED by ComBat harmonization
    site_means = {
        'Siemens': rng.randn(n_features) * site_effect_scale,
        'GE': rng.randn(n_features) * site_effect_scale,
        'Philips': rng.randn(n_features) * site_effect_scale,
    }
    site_scales = {
        'Siemens': 1.0 + rng.uniform(-0.3, 0.3, n_features),
        'GE': 1.0 + rng.uniform(-0.3, 0.3, n_features),
        'Philips': 1.0 + rng.uniform(-0.3, 0.3, n_features),
    }
    
    # Assign subjects to sites (roughly equal distribution)
    site_assignments = rng.choice(sites, size=n_total)
    
    # Create base feature matrix with class signal
    # Control: centered at 0
    X_control = rng.randn(n_control, n_features) * noise_scale
    
    # Clinical: first n_informative features shifted by `separation`
    X_clinical = rng.randn(n_clinical, n_features) * noise_scale
    X_clinical[:, :n_informative] += separation
    
    X = np.vstack([X_control, X_clinical])
    groups = ['Control'] * n_control + ['Clinical'] * n_clinical
    
    # Apply site-specific batch effects
    # (ComBat should remove these while preserving class differences)
    for i in range(n_total):
        site = site_assignments[i]
        X[i, :] = X[i, :] * site_scales[site] + site_means[site]
    
    # Create DataFrame with imaging column names (using dmri prefix)
    imaging_cols = [f'dmri_dtimd_fibat_allfibers_{i}' for i in range(n_features)]
    df = pd.DataFrame(X, columns=imaging_cols)
    
    # Add required metadata columns
    df['src_subject_id'] = [f'NDAR_INV{i:08d}' for i in range(n_total)]
    df['eventname'] = 'baseline_year_1_arm_1'
    
    # Add site column (needed for ComBat harmonization)
    df['mri_info_manufacturer'] = site_assignments
    
    # Add covariates (needed for ComBat)
    # Age: slight site differences (realistic - different recruitment)
    base_age = rng.uniform(108, 132, size=n_total)
    for i, site in enumerate(site_assignments):
        if site == 'Siemens':
            base_age[i] += rng.uniform(-2, 2)
        elif site == 'GE':
            base_age[i] += rng.uniform(-1, 3)
    df['demo_brthdat_v2'] = base_age
    
    # Sex: balanced across sites
    df['demo_sex_v2'] = rng.choice([1, 2], size=n_total)
    df['sex_mapped'] = df['demo_sex_v2'].map({1: 'male', 2: 'female'})
    
    # Add group labels
    df['psych_group'] = groups
    df['anx_group'] = groups  # Same for anxiety
    
    # Add t-score (for anxiety research question compatibility)
    df['cbcl_scr_dsm5_anxdisord_t'] = np.where(
        df['psych_group'] == 'Control',
        rng.uniform(40, 54, size=n_total),
        rng.uniform(70, 90, size=n_total)
    )
    
    # Shuffle to mix classes and sites
    df = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    # Print site distribution
    print(f"Site distribution:")
    for site in sites:
        n_site = (df['mri_info_manufacturer'] == site).sum()
        n_ctrl = ((df['mri_info_manufacturer'] == site) & (df['psych_group'] == 'Control')).sum()
        n_clin = ((df['mri_info_manufacturer'] == site) & (df['psych_group'] == 'Clinical')).sum()
        print(f"  {site}: {n_site} total ({n_ctrl} control, {n_clin} clinical)")
    
    return df


def get_synthetic_task_config(name: str = 'synthetic_test') -> dict:
    """Create task config for synthetic classification."""
    return {
        'name': name,
        'positive_class': 'Clinical',
        'negative_class': 'Control',
    }


print("Helper functions defined.")
print("\nSynthetic data includes:")
print("  - Site-specific batch effects (mean shifts + scale differences)")
print("  - Class signal in first N features (clinical shifted from control)")
print("  - ComBat should remove site effects while preserving class signal")

## Test 1: Strong Signal with Realistic Imbalance

Create data with strong class separation but realistic imbalance (~20:1 ratio).
The pipeline should achieve **>80% balanced accuracy** despite the imbalance.

In [None]:
# Create synthetic data with strong signal and realistic imbalance
df_easy = create_synthetic_dataframe(
    n_control=2000,        # ~20:1 imbalance like real ABCD data
    n_clinical=100,
    n_features=100,
    separation=3.0,        # Strong separation
    noise_scale=0.5,       # Low noise
    n_informative=30,      # Many informative features
    site_effect_scale=1.5, # Realistic site effects
    seed=SEED,
)

print(f"\nTest 1: Strong Signal with Realistic Imbalance")
print(f"Total samples: {len(df_easy)}")
print(f"Imbalance ratio: 1:{2000/100:.0f}")
print(f"Feature columns: {len([c for c in df_easy.columns if c.startswith('dmri')])}")
print(f"\nThis mirrors real ABCD data structure:")
print(f"  - High imbalance (downsampling will balance training)")
print(f"  - Site effects (ComBat will harmonize)")
print(f"  - Strong signal (should be detectable after preprocessing)")

In [None]:
# Run SVM pipeline on easy synthetic data
from core.svm.pipeline import run_task_with_nested_cv

task_config_easy = get_synthetic_task_config('test1_easy')

print("Running SVM pipeline on linearly separable synthetic data...")
print("(This tests the full pipeline: ComBat → PCA → Downsampling → SVM)")
print()

results_easy = run_task_with_nested_cv(
    env, 
    df_easy, 
    task_config_easy, 
    use_wandb=False, 
    sweep_mode=True  # Suppress file saving
)

In [None]:
# Evaluate Test 1 results
test1_bal_acc = results_easy['svm']['overall']['balanced_accuracy']
test1_roc_auc = results_easy['svm']['overall']['roc_auc']
test1_passed = test1_bal_acc > 0.80  # Slightly lower threshold due to imbalance

print("="*60)
print("TEST 1 RESULTS: Strong Signal + Realistic Imbalance")
print("="*60)
print(f"\nSVM Performance:")
print(f"  Balanced Accuracy: {test1_bal_acc:.3f}")
print(f"  ROC-AUC: {test1_roc_auc:.3f}")
print(f"  Per-fold std: {results_easy['svm']['per_fold']['balanced_accuracy_std']:.3f}")
print(f"\nBaseline (Logistic Regression):")
print(f"  Balanced Accuracy: {results_easy['baseline']['overall']['balanced_accuracy']:.3f}")
print(f"\nExpected: >0.80 balanced accuracy")
print(f"Status: {'PASSED' if test1_passed else 'FAILED'}")
print("="*60)

## Test 2: Moderate Signal with Realistic Imbalance

Create data with moderate signal and realistic imbalance.
The pipeline should achieve **>65% balanced accuracy**.

In [None]:
# Create synthetic data with moderate signal and realistic imbalance
df_medium = create_synthetic_dataframe(
    n_control=2000,        # ~20:1 imbalance
    n_clinical=100,
    n_features=100,
    separation=1.5,        # Moderate separation
    noise_scale=1.0,       # Medium noise
    n_informative=15,      # Fewer informative features
    site_effect_scale=1.5,
    seed=SEED + 1,
)

print(f"\nTest 2: Moderate Signal with Realistic Imbalance")
print(f"Total samples: {len(df_medium)}")
print(f"Imbalance ratio: 1:{2000/100:.0f}")

In [None]:
# Run SVM pipeline on medium synthetic data
task_config_medium = get_synthetic_task_config('test2_medium')

print("Running SVM pipeline on moderately separable synthetic data...")
print()

results_medium = run_task_with_nested_cv(
    env, 
    df_medium, 
    task_config_medium, 
    use_wandb=False, 
    sweep_mode=True
)

In [None]:
# Evaluate Test 2 results
test2_bal_acc = results_medium['svm']['overall']['balanced_accuracy']
test2_roc_auc = results_medium['svm']['overall']['roc_auc']
test2_passed = test2_bal_acc > 0.65

print("="*60)
print("TEST 2 RESULTS: Moderate Signal + Realistic Imbalance")
print("="*60)
print(f"\nSVM Performance:")
print(f"  Balanced Accuracy: {test2_bal_acc:.3f}")
print(f"  ROC-AUC: {test2_roc_auc:.3f}")
print(f"  Per-fold std: {results_medium['svm']['per_fold']['balanced_accuracy_std']:.3f}")
print(f"\nBaseline (Logistic Regression):")
print(f"  Balanced Accuracy: {results_medium['baseline']['overall']['balanced_accuracy']:.3f}")
print(f"\nExpected: >0.65 balanced accuracy")
print(f"Status: {'PASSED' if test2_passed else 'FAILED'}")
print("="*60)

## Test 3: Weak Signal with Realistic Imbalance (Hardest)

Create data with weak signal and realistic imbalance - similar to real ABCD data conditions.
The pipeline should **beat chance significantly** (>55% balanced accuracy).

In [None]:
# Create synthetic data with weak signal and realistic imbalance (hardest test)
df_hard = create_synthetic_dataframe(
    n_control=2000,        # ~20:1 imbalance
    n_clinical=100,
    n_features=100,
    separation=0.8,        # Weak separation (similar to real data?)
    noise_scale=1.2,       # Higher noise
    n_informative=10,      # Few informative features
    site_effect_scale=1.5,
    seed=SEED + 2,
)

print(f"\nTest 3: Weak Signal with Realistic Imbalance (Hardest)")
print(f"Total samples: {len(df_hard)}")
print(f"Imbalance ratio: 1:{2000/100:.0f}")
print(f"\nThis is closest to real ABCD data conditions:")

In [None]:
# Run SVM pipeline on hard synthetic data
task_config_hard = get_synthetic_task_config('test3_hard')

print("Running SVM pipeline on imbalanced synthetic data...")
print()

results_hard = run_task_with_nested_cv(
    env, 
    df_hard, 
    task_config_hard, 
    use_wandb=False, 
    sweep_mode=True
)

In [None]:
# Evaluate Test 3 results
test3_bal_acc = results_hard['svm']['overall']['balanced_accuracy']
test3_roc_auc = results_hard['svm']['overall']['roc_auc']
test3_passed = test3_bal_acc > 0.55  # Must beat chance significantly

print("="*60)
print("TEST 3 RESULTS: Weak Signal + Realistic Imbalance")
print("="*60)
print(f"\nSVM Performance:")
print(f"  Balanced Accuracy: {test3_bal_acc:.3f}")
print(f"  ROC-AUC: {test3_roc_auc:.3f}")
print(f"  Per-fold std: {results_hard['svm']['per_fold']['balanced_accuracy_std']:.3f}")
print(f"\nBaseline (Logistic Regression):")
print(f"  Balanced Accuracy: {results_hard['baseline']['overall']['balanced_accuracy']:.3f}")
print(f"\nExpected: >0.55 balanced accuracy (significantly better than chance)")
print(f"Status: {'PASSED' if test3_passed else 'FAILED'}")
print("="*60)
print(f"\nNote: If real ABCD data achieves similar or worse performance,")
print(f"it suggests the biological signal is weak, not that the pipeline is broken.")

## Test Random Forest Pipeline

In [None]:
# Run Random Forest pipeline on easy synthetic data
from core.randomforest.pipeline import run_task_with_nested_cv as run_rf_task

task_config_rf = get_synthetic_task_config('test_rf_easy')

print("Running Random Forest pipeline on linearly separable synthetic data...")
print()

results_rf = run_rf_task(
    env, 
    df_easy, 
    task_config_rf, 
    use_wandb=False, 
    sweep_mode=True
)

In [None]:
# Evaluate RF results
rf_bal_acc = results_rf['rf']['overall']['balanced_accuracy']
rf_roc_auc = results_rf['rf']['overall']['roc_auc']
rf_passed = rf_bal_acc > 0.80  # Same threshold as SVM for strong signal

print("="*60)
print("RANDOM FOREST TEST RESULTS: Strong Signal + Imbalance")
print("="*60)
print(f"\nRandom Forest Performance:")
print(f"  Balanced Accuracy: {rf_bal_acc:.3f}")
print(f"  ROC-AUC: {rf_roc_auc:.3f}")
print(f"\nExpected: >0.80 balanced accuracy")
print(f"Status: {'PASSED' if rf_passed else 'FAILED'}")
print("="*60)

## Test MLP Pipeline

In [None]:
# Run MLP pipeline on easy synthetic data
from core.mlp.pipeline import run_task_with_nested_cv as run_mlp_task

task_config_mlp = get_synthetic_task_config('test_mlp_easy')

print("Running MLP pipeline on linearly separable synthetic data...")
print()

results_mlp = run_mlp_task(
    env, 
    df_easy, 
    task_config_mlp, 
    use_wandb=False, 
    sweep_mode=True
)

In [None]:
# Evaluate MLP results
mlp_bal_acc = results_mlp['mlp']['overall']['balanced_accuracy']
mlp_roc_auc = results_mlp['mlp']['overall']['roc_auc']
mlp_passed = mlp_bal_acc > 0.75  # Slightly lower - MLP can struggle with small minority class

print("="*60)
print("MLP TEST RESULTS: Strong Signal + Imbalance")
print("="*60)
print(f"\nMLP Performance:")
print(f"  Balanced Accuracy: {mlp_bal_acc:.3f}")
print(f"  ROC-AUC: {mlp_roc_auc:.3f}")
print(f"\nExpected: >0.75 balanced accuracy")
print(f"Status: {'PASSED' if mlp_passed else 'FAILED'}")
print("="*60)

## Validation Summary

In [None]:
# Comprehensive validation summary
print("\n" + "="*80)
print("PIPELINE VALIDATION SUMMARY")
print("="*80)
print("\nAll tests use realistic conditions:")
print("  - 20:1 class imbalance (like real ABCD data)")
print("  - Site batch effects (harmonized by ComBat)")
print("  - Downsampling during training")
print("  - Testing on full imbalanced data")

all_tests = [
    ("SVM - Strong signal", test1_bal_acc, 0.80, test1_passed),
    ("SVM - Moderate signal", test2_bal_acc, 0.65, test2_passed),
    ("SVM - Weak signal", test3_bal_acc, 0.55, test3_passed),
    ("Random Forest - Strong", rf_bal_acc, 0.80, rf_passed),
    ("MLP - Strong", mlp_bal_acc, 0.75, mlp_passed),
]

print(f"\n{'Test':<30} | {'Bal Acc':^10} | {'Threshold':^10} | {'Status':^10}")
print("-" * 70)

all_passed = True
for test_name, bal_acc, threshold, passed in all_tests:
    status = "PASSED" if passed else "FAILED"
    print(f"{test_name:<30} | {bal_acc:^10.3f} | {threshold:^10.2f} | {status:^10}")
    if not passed:
        all_passed = False

print("="*80)
if all_passed:
    print("\nALL PIPELINE VALIDATION TESTS PASSED!")
    print("\nConclusion:")
    print("  - The pipeline (ComBat → PCA → Downsampling → ML) works correctly")
    print("  - It can detect signal when signal exists, even with 20:1 imbalance")
    print("  - Poor performance on real ABCD data = weak biological signal, not code bugs")
else:
    print("\nSOME TESTS FAILED - Review pipeline implementations")
print("="*80)

In [None]:
# Create comprehensive summary visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Pipeline Validation Results (20:1 Imbalance)', fontsize=14, fontweight='bold')

# 1. SVM by signal strength
ax = axes[0, 0]
difficulties = ['Strong\n(sep=3.0)', 'Moderate\n(sep=1.5)', 'Weak\n(sep=0.8)']
svm_accs = [test1_bal_acc, test2_bal_acc, test3_bal_acc]
thresholds = [0.80, 0.65, 0.55]
colors = ['green' if a > t else 'red' for a, t in zip(svm_accs, thresholds)]

bars = ax.bar(difficulties, svm_accs, color=colors, alpha=0.7)
for i, t in enumerate(thresholds):
    ax.axhline(t, xmin=i/3, xmax=(i+1)/3, color='black', linestyle='--', alpha=0.5)
ax.set_ylabel('Balanced Accuracy')
ax.set_title('SVM: Performance by Signal Strength\n(all with 20:1 imbalance)')
ax.set_ylim([0.4, 1])
ax.axhline(0.5, color='red', linestyle=':', alpha=0.3, label='Chance')
ax.legend()

# 2. Model comparison on strong signal data
ax = axes[0, 1]
models = ['SVM', 'Random\nForest', 'MLP']
model_accs = [test1_bal_acc, rf_bal_acc, mlp_bal_acc]
model_thresholds = [0.80, 0.80, 0.75]
model_colors = ['green' if a > t else 'red' for a, t in zip(model_accs, model_thresholds)]

ax.bar(models, model_accs, color=model_colors, alpha=0.7)
ax.axhline(0.80, color='black', linestyle='--', alpha=0.5, label='Threshold')
ax.set_ylabel('Balanced Accuracy')
ax.set_title('Model Comparison: Strong Signal\n(20:1 imbalance)')
ax.set_ylim([0.4, 1])
ax.legend()

# 3. ROC-AUC comparison
ax = axes[0, 2]
models_full = ['SVM\nStrong', 'SVM\nMod', 'SVM\nWeak', 'RF', 'MLP']
aucs = [test1_roc_auc, test2_roc_auc, test3_roc_auc, rf_roc_auc, mlp_roc_auc]

ax.bar(models_full, aucs, color='steelblue', alpha=0.7)
ax.axhline(0.5, color='red', linestyle=':', alpha=0.5, label='Chance')
ax.set_ylabel('ROC-AUC')
ax.set_title('ROC-AUC Across Tests')
ax.set_ylim([0.4, 1])
ax.legend()

# 4. Test results summary
ax = axes[1, 0]
test_names = ['SVM\nStrong', 'SVM\nMod', 'SVM\nWeak', 'RF', 'MLP']
test_results = [test1_passed, test2_passed, test3_passed, rf_passed, mlp_passed]
result_colors = ['green' if r else 'red' for r in test_results]

ax.bar(test_names, [1 if r else 0 for r in test_results], color=result_colors, alpha=0.7)
ax.set_ylabel('Status')
ax.set_title('Test Pass/Fail Summary')
ax.set_ylim([0, 1.3])
ax.set_yticks([0, 1])
ax.set_yticklabels(['Failed', 'Passed'])

for i, (name, status) in enumerate(zip(test_names, test_results)):
    ax.text(i, 0.5, '\u2713' if status else '\u2717', ha='center', va='center',
            fontsize=24, color='white', fontweight='bold')

# 5. Per-fold variance
ax = axes[1, 1]
fold_stds = [
    results_easy['svm']['per_fold']['balanced_accuracy_std'],
    results_medium['svm']['per_fold']['balanced_accuracy_std'],
    results_hard['svm']['per_fold']['balanced_accuracy_std'],
]
ax.bar(['Strong', 'Moderate', 'Weak'], fold_stds, color='purple', alpha=0.7)
ax.set_ylabel('Std Dev (Balanced Accuracy)')
ax.set_title('SVM Per-Fold Variance')
ax.set_ylim([0, max(fold_stds) * 1.5 if max(fold_stds) > 0 else 0.1])

# 6. Summary table
ax = axes[1, 2]
ax.axis('off')

table_data = [
    ['Test', 'Model', 'Bal Acc', 'ROC-AUC', 'Status'],
    ['Strong', 'SVM', f'{test1_bal_acc:.3f}', f'{test1_roc_auc:.3f}', 'PASS' if test1_passed else 'FAIL'],
    ['Moderate', 'SVM', f'{test2_bal_acc:.3f}', f'{test2_roc_auc:.3f}', 'PASS' if test2_passed else 'FAIL'],
    ['Weak', 'SVM', f'{test3_bal_acc:.3f}', f'{test3_roc_auc:.3f}', 'PASS' if test3_passed else 'FAIL'],
    ['Strong', 'RF', f'{rf_bal_acc:.3f}', f'{rf_roc_auc:.3f}', 'PASS' if rf_passed else 'FAIL'],
    ['Strong', 'MLP', f'{mlp_bal_acc:.3f}', f'{mlp_roc_auc:.3f}', 'PASS' if mlp_passed else 'FAIL'],
]

table = ax.table(cellText=table_data, cellLoc='center', loc='center',
                 colWidths=[0.18, 0.15, 0.18, 0.18, 0.15])
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1, 1.8)

# Style header
for i in range(len(table_data[0])):
    table[(0, i)].set_facecolor('#4CAF50')
    table[(0, i)].set_text_props(weight='bold', color='white')

# Color status column
for i in range(1, len(table_data)):
    status_cell = table[(i, 4)]
    if table_data[i][4] == 'PASS':
        status_cell.set_facecolor('#c8e6c9')
    else:
        status_cell.set_facecolor('#ffcdd2')

ax.set_title('Results Summary', fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\nValidation complete!")

## Interpretation

**All tests use realistic ABCD-like conditions:**
- 20:1 class imbalance (2000 controls vs 100 clinical)
- Site batch effects that ComBat must harmonize
- Downsampling during training (100 iterations)
- Testing on full imbalanced test set

**If all tests pass:**
- The full pipeline (ComBat → PCA → Downsampling → ML) works correctly
- It can detect signal even with severe class imbalance
- Poor performance on real ABCD data indicates **weak biological signal**, not code bugs

**If tests fail:**
- There may be bugs in the pipeline
- Check: ComBat harmonization, PCA, downsampling logic, threshold optimization

**Comparing to real ABCD results:**
- If real data performs similar to "Weak signal" test → biological signal is weak
- If real data performs worse than "Weak signal" test → may indicate no real signal or confounds