# Batch Processing Testing - Phase 4B

This notebook demonstrates and tests the new batch processing infrastructure for analyzing multiple subjects efficiently.

## Features Tested
- Parallel processing with automatic core detection
- Progress monitoring with progress bars
- Error handling per subject (failures don't stop the batch)
- Batch export to CSV/TSV for group analysis
- Performance comparison: sequential vs parallel

## ‚ö†Ô∏è Important: Binary Lesion Masks Required

**RegionalDamage analysis requires binary lesion masks (values 0 and 1 only).**

Real neuroimaging data often contains continuous probability values. This notebook includes a binarization step to convert continuous lesion maps to binary masks before analysis.

## 1. Setup and Imports

In [1]:
import sys

sys.path.insert(0, "/home/marvin/projects/lesion_decoding_toolkit/src")

from pathlib import Path
import time
import pandas as pd

from ldk import LesionData, batch_process
from ldk.analysis import RegionalDamage, AtlasAggregation
from ldk.io import export_results_to_csv, batch_export_to_csv

print("‚úì All imports successful")

‚úì All imports successful


## 2. Configure Lesion File Paths

Update this cell with paths to your lesion files. You can use:
- List of file paths
- Glob pattern to find files
- BIDS dataset directory

In [24]:
lesion_paths[0]

PosixPath('/media/moritz/Storage2/projects_marvin/202509_PSCI_DISCONNECTIVITY/data/raw/lesion_masks/acuteinfarct/SB-11177-1_infarct.nii.gz')

In [2]:
from pathlib import Path

# Option 1: Explicit list of lesion files
lesion_dir = Path(
    "/media/moritz/Storage2/projects_marvin/202509_PSCI_DISCONNECTIVITY/data/raw/lesion_masks/acuteinfarct/"
)

# Option 2: Use glob pattern to find all lesion files
lesion_paths = list(lesion_dir.glob("*.nii.gz"))[:100]

# Option 3: Find files matching a pattern
# lesion_dir = Path("/path/to/lesions")
# lesion_paths = list(lesion_dir.glob("sub-*_lesion.nii.gz"))

print(f"Found {len(lesion_paths)} lesion files")
for i, path in enumerate(lesion_paths[:5], 1):
    print(f"  {i}. {Path(path).name}")
if len(lesion_paths) > 5:
    print(f"  ... and {len(lesion_paths) - 5} more files")

Found 100 lesion files
  1. SB-11177-1_infarct.nii.gz
  2. SB-08224-1_infarct.nii.gz
  3. GRECOGVASC214_infarct.nii.gz
  4. STRIDE319_infarct.nii.gz
  5. USCOG_047_infarct.nii.gz
  ... and 95 more files


## 3. Load Lesion Data

Load all lesion files into LesionData objects. This step validates each file and prepares them for batch processing.

In [4]:
from ldk.core import LesionData

print("Loading lesion files...")
lesions = []
failed_loads = []

for i, lesion_path in enumerate(lesion_paths):
    try:
        # Extract subject ID from filename
        subject_id = Path(lesion_path).stem.split("_")[0]  # Adjust based on your naming convention

        # Load lesion
        lesion = LesionData.from_nifti(lesion_path)

        # Add subject ID to metadata
        lesion.metadata["subject_id"] = subject_id

        lesions.append(lesion)
        print(f"‚úì Loaded {subject_id}: {Path(lesion_path).name}")
    except Exception as e:
        failed_loads.append((lesion_path, str(e)))
        print(f"‚úó Failed to load {Path(lesion_path).name}: {e}")

print("\n" + "=" * 60)
print(f"Successfully loaded: {len(lesions)} subjects")
print(f"Failed to load: {len(failed_loads)} subjects")
print("=" * 60)

Loading lesion files...
‚úì Loaded SB-11177-1: SB-11177-1_infarct.nii.gz
‚úì Loaded SB-08224-1: SB-08224-1_infarct.nii.gz
‚úì Loaded GRECOGVASC214: GRECOGVASC214_infarct.nii.gz
‚úì Loaded STRIDE319: STRIDE319_infarct.nii.gz
‚úì Loaded USCOG: USCOG_047_infarct.nii.gz
‚úì Loaded GRECOGVASC380: GRECOGVASC380_infarct.nii.gz
‚úì Loaded H0018: H0018_infarct.nii.gz
‚úì Loaded L131: L131_infarct.nii.gz
‚úì Loaded CROMIS057: CROMIS057_infarct.nii.gz
‚úì Loaded H0366: H0366_infarct.nii.gz
‚úì Loaded H0715: H0715_infarct.nii.gz
‚úì Loaded STRIDE357: STRIDE357_infarct.nii.gz
‚úì Loaded CROMIS030: CROMIS030_infarct.nii.gz
‚úì Loaded SB-04124-1: SB-04124-1_infarct.nii.gz
‚úì Loaded DEDEMAS016: DEDEMAS016_infarct.nii.gz
‚úì Loaded GRECOGVASC423: GRECOGVASC423_infarct.nii.gz
‚úì Loaded SB-11272-1: SB-11272-1_infarct.nii.gz
‚úì Loaded STRIDE121: STRIDE121_infarct.nii.gz
‚úì Loaded SB-03783-1: SB-03783-1_infarct.nii.gz
‚úì Loaded CODECS013: CODECS013_infarct.nii.gz
‚úì Loaded SB-11534-1: SB-11534-1_infa

## 4. Test Batch Processing - Sequential Mode

First, let's run in sequential mode (n_jobs=1) to establish a baseline and ensure everything works.

In [None]:
from ldk.analysis import RegionalDamage
import time
from ldk.batch import batch_process

# Create analysis instance
analysis = RegionalDamage(
    atlas_names=["Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm"]
)  # Uses bundled atlases by default

print(f"Analysis: {analysis.__class__.__name__}")
print(f"Batch strategy: {analysis.batch_strategy}")
print(f"Number of subjects: {len(lesions)}")
print("\n‚ö†Ô∏è  Using BINARY lesions (binarized in previous cell)")
print("   RegionalDamage requires binary masks (0/1 only)\n")
print("Starting sequential processing...\n")

# Run batch processing in sequential mode with BINARY lesions
start_time = time.time()
results_sequential = batch_process(
    lesion_data_list=lesions,  # ‚Üê Use binary lesions!
    analysis=analysis,
    n_jobs=8,  # Sequential processing
    show_progress=True,
    backend="loky",
)
sequential_time = time.time() - start_time

print(f"\n‚úì Sequential processing complete!")
print(f"Processed: {len(results_sequential)}/{len(lesions)} subjects")
print(f"Time: {sequential_time:.2f} seconds ({sequential_time / len(lesions):.2f}s per subject)")

In [23]:
results_sequential[0].metadata

{'subject_id': 'sub-unknown'}

In [21]:
results_sequential[0].results

{'RegionalDamage': {'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_1': 9.188416160171613,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_2': 47.30861244019139,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_3': 5.2526062550120285,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_4': 86.84127485011045,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_5': 74.46364719904648,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_6': 2.7591085956844714,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_7': 37.49157113958193,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_8': 0.0,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_9': 36.74630261660978,
  'Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_10': 59.08045977011495,
  'Schaefer2018_400Parc

### Diagnose Issue: Check Lesion Data Values

Let's inspect the lesion data to see if they're binary or continuous values.

In [None]:
import numpy as np

# Check the first few lesions for their value distribution
print("Checking lesion data values...\n")

for i, lesion in enumerate(lesions[:3]):
    lesion_array = lesion.lesion_img.get_fdata()
    unique_vals = np.unique(lesion_array)
    is_binary = np.all(np.isin(unique_vals, [0, 1]))

    subject_id = lesion.metadata.get("subject_id", f"subject_{i}")
    print(f"Subject {subject_id}:")
    print(f"  Shape: {lesion_array.shape}")
    print(f"  Data type: {lesion_array.dtype}")
    print(f"  Value range: [{lesion_array.min():.4f}, {lesion_array.max():.4f}]")
    print(f"  Unique values: {len(unique_vals)} unique")
    print(f"  Is binary (0/1): {is_binary}")
    if not is_binary and len(unique_vals) <= 10:
        print(f"  Unique values: {unique_vals}")
    print()

# Check if lesions need binarization
needs_binarization = not all(
    np.all(np.isin(np.unique(l.lesion_img.get_fdata()), [0, 1]))
    for l in lesions[:10]  # Check first 10
)

if needs_binarization:
    print("‚ö†Ô∏è  ISSUE FOUND: Lesion masks are not binary!")
    print("   RegionalDamage requires binary masks (0 and 1 only)")
    print("   Real lesion data often contains continuous probability values")
    print("\nüí° SOLUTION: Binarize the lesions before processing")
else:
    print("‚úÖ Lesion masks are binary - should work fine")

### Solution: Binarize Lesions

Since `RegionalDamage` requires binary masks, we need to binarize the continuous probability maps.

In [None]:
import nibabel as nib


def binarize_lesion(lesion_data, threshold=0.5):
    """Binarize a lesion mask using a threshold."""
    lesion_array = lesion_data.lesion_img.get_fdata()
    binary_array = (lesion_array > threshold).astype(np.uint8)

    # Create new binary image
    binary_img = nib.Nifti1Image(binary_array, lesion_data.lesion_img.affine)

    # Create new LesionData with binary mask
    binary_lesion = LesionData(
        lesion_img=binary_img,
        anatomical_img=lesion_data.anatomical_img,
        metadata=lesion_data.metadata.copy(),
        results=lesion_data.results.copy(),
        provenance=lesion_data.provenance.copy(),
    )

    return binary_lesion


# Binarize all lesions
print(f"Binarizing {len(lesions)} lesions (threshold=0.5)...\n")
lesions = []

for lesion in lesions:
    binary_lesion = binarize_lesion(lesion, threshold=0.5)
    lesions.append(binary_lesion)

    # Show progress for first few
    if len(lesions) <= 3:
        subject_id = lesion.metadata.get("subject_id", "unknown")
        orig_nonzero = np.sum(lesion.lesion_img.get_fdata() > 0)
        binary_nonzero = np.sum(binary_lesion.lesion_img.get_fdata() > 0)
        print(f"‚úì {subject_id}: {orig_nonzero} ‚Üí {binary_nonzero} non-zero voxels")

print(f"\n‚úÖ Binarized {len(lesions)} lesions")
print("   Now ready for RegionalDamage analysis!")

In [13]:
results_sequential

[]

## 5. Test Batch Processing - Parallel Mode

Now let's use parallel processing to leverage all CPU cores.

**‚ú® Solution for Jupyter:** Use `backend='threading'` to enable parallel processing in notebooks without pickling issues!

### Understanding Backend Options

The `backend` parameter controls how parallel processing works:

- **`'threading'`** (recommended for Jupyter):
  - ‚úÖ Works perfectly in Jupyter notebooks (no pickling issues)
  - ‚úÖ No serialization overhead
  - ‚ö†Ô∏è Limited by Python's Global Interpreter Lock (GIL)
  - üìä Still provides good speedup for I/O-bound operations
  
- **`'loky'`** (best for standalone scripts):
  - ‚úÖ True multiprocessing (no GIL limitation)
  - ‚úÖ Best performance for CPU-bound operations
  - ‚ùå Requires pickling (may fail in Jupyter)
  - üìä Provides maximum speedup (4-8x on multi-core systems)

For this notebook, we'll use `backend='threading'` to ensure everything works smoothly!

In [None]:
import os

# Detect available cores
n_cores = len(os.sched_getaffinity(0)) if hasattr(os, "sched_getaffinity") else os.cpu_count()
print(f"Available CPU cores: {n_cores}")
print(f"Number of subjects: {len(lesions)}")
print("\nStarting parallel processing...\n")

# Run batch processing in parallel mode with BINARY lesions
# Using backend='threading' for Jupyter compatibility (no pickling issues!)
# Note: 'threading' uses threads (GIL-limited) vs 'loky' uses processes (true parallelism)
# For standalone scripts, you can use backend='loky' for better performance

start_time = time.time()
results_parallel = batch_process(
    lesion_data_list=lesions,  # ‚Üê Use binary lesions!
    analysis=analysis,
    n_jobs=-1,  # Use all cores
    show_progress=True,
    backend="threading",  # ‚ú® Works in Jupyter! Use 'loky' in standalone scripts
)
parallel_time = time.time() - start_time

print(f"\n‚úì Parallel processing complete!")
print(f"Processed: {len(results_parallel)}/{len(lesions)} subjects")
print(f"Time: {parallel_time:.2f} seconds ({parallel_time / len(lesions):.2f}s per subject)")
print(f"\nüöÄ Speedup: {sequential_time / parallel_time:.2f}x faster than sequential")

## 6. Inspect Results

Let's look at the results from one subject to verify the analysis worked correctly.

In [25]:
results_parallel = results_sequential

In [26]:
if results_parallel:
    # Get first result
    sample_result = results_parallel[0]
    subject_id = sample_result.metadata.get("subject_id", "unknown")

    print(f"Subject: {subject_id}")
    print(f"\nAvailable analyses: {list(sample_result.results.keys())}")

    # Show AtlasAggregation results (from RegionalDamage)
    if "AtlasAggregation" in sample_result.results:
        regional_results = sample_result.results["AtlasAggregation"]
        print(f"\nNumber of regions analyzed: {len(regional_results)}")

        # Show top 10 most damaged regions
        sorted_regions = sorted(regional_results.items(), key=lambda x: x[1], reverse=True)
        print("\nTop 10 most damaged regions:")
        for i, (region, damage_pct) in enumerate(sorted_regions[:10], 1):
            print(f"  {i}. {region}: {damage_pct:.1f}%")
else:
    print("No results available (all subjects may have failed)")

Subject: sub-unknown

Available analyses: ['RegionalDamage']


## 7. Export Results to CSV/TSV

Export batch results to CSV for group-level statistical analysis.

In [27]:
# Create output directory
output_dir = Path("~/_tmp/batch_results").expanduser()
output_dir.mkdir(parents=True, exist_ok=True)

# Export to CSV using batch export
csv_path = output_dir / "batch_regional_damage.csv"
batch_export_to_csv(results_parallel, csv_path)

print(f"‚úì Results exported to: {csv_path}")
print(f"File size: {csv_path.stat().st_size / 1024:.1f} KB")

# Also export to TSV (BIDS-compatible)
tsv_path = output_dir / "batch_regional_damage.tsv"
from ldk.io import batch_export_to_tsv

batch_export_to_tsv(results_parallel, tsv_path)

print(f"‚úì Results exported to: {tsv_path}")
print(f"File size: {tsv_path.stat().st_size / 1024:.1f} KB")

‚úì Results exported to: /home/marvin/_tmp/batch_results/batch_regional_damage.csv
File size: 207.6 KB
‚úì Results exported to: /home/marvin/_tmp/batch_results/batch_regional_damage.tsv
File size: 207.6 KB


## 8. Load and Inspect CSV Results

Let's load the CSV back and inspect the structure for group analysis.

In [28]:
# Load CSV
df = pd.read_csv(csv_path)

print(f"Shape: {df.shape[0]} subjects √ó {df.shape[1]} columns")
print(f"\nColumns (first 20):")
for col in df.columns[:20]:
    print(f"  - {col}")
if len(df.columns) > 20:
    print(f"  ... and {len(df.columns) - 20} more columns")

print("\nFirst few rows:")
display(df.head())

Shape: 95 subjects √ó 403 columns

Columns (first 20):
  - subject_id
  - session_id
  - coordinate_space
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_1
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_2
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_3
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_4
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_5
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_6
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_7
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_8
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_9
  - RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI1

Unnamed: 0,subject_id,session_id,coordinate_space,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_1,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_2,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_3,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_4,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_5,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_6,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_LH_Vis_7,...,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_PFCdPFCm_13,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_1,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_2,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_3,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_4,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_5,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_6,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_7,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_8,RegionalDamage.Schaefer2018_400Parcels_7Networks_order_FSLMNI152_1mm_7Networks_RH_Default_pCunPCC_9
0,sub-unknown,,native,9.188416,47.308612,5.252606,86.841275,74.463647,2.759109,37.491571,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,sub-unknown,,native,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,sub-unknown,,native,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,sub-unknown,,native,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,sub-unknown,,native,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 9. Performance Comparison Summary

In [None]:
import matplotlib.pyplot as plt

# Create performance comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart: Total time
modes = ["Sequential\n(n_jobs=1)", "Parallel\n(n_jobs=-1)"]
times = [sequential_time, parallel_time]
colors = ["#3498db", "#2ecc71"]

bars = ax1.bar(modes, times, color=colors)
ax1.set_ylabel("Total Time (seconds)")
ax1.set_title("Batch Processing Performance")
ax1.grid(axis="y", alpha=0.3)

# Add value labels on bars
for bar, time_val in zip(bars, times):
    height = bar.get_height()
    ax1.text(
        bar.get_x() + bar.get_width() / 2.0, height, f"{time_val:.2f}s", ha="center", va="bottom"
    )

# Speedup indicator
speedup = sequential_time / parallel_time
ax2.text(
    0.5,
    0.5,
    f"{speedup:.2f}x\nSpeedup",
    ha="center",
    va="center",
    fontsize=48,
    fontweight="bold",
    transform=ax2.transAxes,
    color="#2ecc71",
)
ax2.axis("off")
ax2.set_title("Parallel vs Sequential", pad=20)

plt.tight_layout()
plt.show()

# Print summary statistics
print("\n" + "=" * 60)
print("BATCH PROCESSING PERFORMANCE SUMMARY")
print("=" * 60)
print(f"Number of subjects: {len(lesions)}")
print(f"Successfully processed: {len(results_parallel)}")
print(f"CPU cores used: {n_cores}")
print(f"\nSequential time: {sequential_time:.2f}s ({sequential_time / len(lesions):.2f}s/subject)")
print(f"Parallel time: {parallel_time:.2f}s ({parallel_time / len(lesions):.2f}s/subject)")
print(f"\nüöÄ Speedup: {speedup:.2f}x")
print(
    f"‚ö° Time saved: {sequential_time - parallel_time:.2f}s ({100 * (1 - parallel_time / sequential_time):.1f}%)"
)
print("=" * 60)

## 10. Test with Different Number of Jobs

Let's test how performance scales with different numbers of parallel workers.

In [None]:
# Test different n_jobs values
n_jobs_values = [1, 2, 4, n_cores]
times_by_njobs = {}

print("Testing different parallelization levels...\n")

for n_jobs in n_jobs_values:
    print(f"Testing n_jobs={n_jobs}...")
    start = time.time()
    results = batch_process(
        lesion_data_list=lesions,  # ‚Üê Use binary lesions!
        analysis=analysis,
        n_jobs=n_jobs,
        show_progress=False,  # Disable progress bar for cleaner output
        backend="threading",  # Use threading backend for Jupyter
    )
    elapsed = time.time() - start
    times_by_njobs[n_jobs] = elapsed
    print(f"  Time: {elapsed:.2f}s\n")

# Plot scaling
fig, ax = plt.subplots(figsize=(10, 6))
jobs = list(times_by_njobs.keys())
times = list(times_by_njobs.values())

ax.plot(jobs, times, marker="o", linewidth=2, markersize=8)

## 11. Test Error Handling

Batch processing gracefully handles individual subject failures without stopping the entire batch.

**Note:** The code now explicitly sets `keep_masked_labels=False` in the `NiftiLabelsMasker` to:
- Suppress the nilearn deprecation warning
- Remove empty region signals from output (future nilearn default)
- Align with best practices for handling masked regions

In [None]:
# Create a mix of valid and invalid lesion data
print("Testing error handling with mixed valid/invalid data...\n")

# Add a mock invalid lesion (this should fail gracefully)
from unittest.mock import Mock

invalid_lesion = Mock(spec=LesionData)
invalid_lesion.metadata = {"subject_id": "INVALID_MOCK"}
invalid_lesion.lesion_img = Mock()
invalid_lesion.results = {}

# Mix valid and invalid (use binary lesions)
mixed_lesions = lesions[:2] + [invalid_lesion] + lesions[2:4]

print(f"Processing {len(mixed_lesions)} subjects (including 1 intentionally invalid)...\n")

# Process - should warn but continue
import warnings

with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")

    results_mixed = batch_process(
        lesion_data_list=mixed_lesions, analysis=analysis, n_jobs=1, show_progress=False
    )

    print(f"\n‚úì Batch processing completed despite errors")
    print(f"Successfully processed: {len(results_mixed)}/{len(mixed_lesions)} subjects")

    # Filter out nilearn deprecation warnings (expected and handled)
    relevant_warnings = [
        warning for warning in w if "keep_masked_labels" not in str(warning.message)
    ]
    print(f"Warnings issued: {len(relevant_warnings)}")

    if relevant_warnings:
        print("\nWarning messages:")
        for warning in relevant_warnings:
            print(f"  - {warning.message}")

Testing error handling with mixed valid/invalid data...

Processing 5 subjects (including 1 intentionally invalid)...


‚úì Batch processing completed despite errors
Successfully processed: 4/5 subjects

  - Applying "mask_img" before signal extraction may result in empty region signals in the output. These are currently kept. Starting from version 0.13, the default behavior will be changed to remove them by setting "keep_masked_labels=False". "keep_masked_labels" parameter will be removed in version 0.15.
  - Applying "mask_img" before signal extraction may result in empty region signals in the output. These are currently kept. Starting from version 0.13, the default behavior will be changed to remove them by setting "keep_masked_labels=False". "keep_masked_labels" parameter will be removed in version 0.15.
  - Analysis failed for subject INVALID_MOCK (index 2): RegionalDamage requires binary lesion mask (0 and 1 only).
Found values: [<Mock name='mock.lesion_img.get_fdata()' id='12854

## 12. Summary and Next Steps

### ‚úÖ Batch Processing Features Verified

1. **Parallel Processing**: Successfully utilized multiple CPU cores for speedup
2. **Progress Monitoring**: Real-time progress bars during processing
3. **Error Handling**: Individual failures don't stop the batch
4. **Batch Export**: Easy export to CSV/TSV for statistical analysis
5. **Scalability**: Performance scales well with number of workers

### üìä Use Cases

- **Group Analysis**: Process entire cohorts efficiently
- **Statistical Comparison**: Export to CSV for R/SPSS/Python analysis
- **Quality Control**: Batch process to identify outliers or errors
- **Reproducibility**: Same analysis across all subjects automatically

### üöÄ Next Steps

1. Try with your own lesion data
2. Experiment with different analyses (AtlasAggregation with different methods)
3. Chain multiple analyses in batch
4. Integrate with your statistical analysis workflow

## 13. Alternative: Standalone Script (Recommended for Production)

For production workflows or when encountering Jupyter pickling issues, use the standalone script which doesn't have multiprocessing limitations.

In [None]:
# The standalone script provides full parallel processing without Jupyter limitations
# Location: examples/batch_processing_example.py

print("=" * 70)
print("STANDALONE SCRIPT USAGE")
print("=" * 70)
print()
print("For production batch processing, use the standalone script:")
print()
print("Basic usage:")
print("  python examples/batch_processing_example.py \\")
print("    --lesion-dir /path/to/lesions \\")
print("    --output-dir /path/to/output")
print()
print("Advanced options:")
print("  python examples/batch_processing_example.py \\")
print("    --lesion-dir /path/to/lesions \\")
print("    --pattern 'sub-*_lesion.nii.gz' \\")
print("    --output-dir ~/results \\")
print("    --n-jobs -1 \\  # Use all CPU cores")
print("    --analysis regional \\")
print("    --limit 50  # Process only first 50 subjects")
print()
print("Example with your data:")
lesion_dir_example = "/media/moritz/Storage2/projects_marvin/202509_PSCI_DISCONNECTIVITY/data/raw/lesion_masks/acuteinfarct/"
print(f"  python examples/batch_processing_example.py \\")
print(f"    --lesion-dir {lesion_dir_example} \\")
print(f"    --output-dir ~/psci_batch_results \\")
print(f"    --n-jobs -1")
print()
print("=" * 70)
print()
print("üí° The standalone script:")
print("  ‚úì Works perfectly with parallel processing (no pickling issues)")
print("  ‚úì Provides timing comparisons (sequential vs parallel)")
print("  ‚úì Auto-exports results to CSV/TSV")
print("  ‚úì Better for production workflows")
print("  ‚úì Can be integrated into shell scripts and pipelines")
print("=" * 70)

### Run the Standalone Script from Jupyter

You can even run the standalone script from within this notebook using `!` command:

In [None]:
# Uncomment to run the standalone script with a subset of your data
# This will use true parallel processing without Jupyter's pickling limitations

# !python ../examples/batch_processing_example.py \
#     --lesion-dir {lesion_dir} \
#     --output-dir ~/batch_test_results \
#     --n-jobs -1 \
#     --limit 10

print("üí° Uncomment the lines above to test the standalone script")
print("   It will process 10 subjects with full parallel processing")