# 2. Compute DWPC for Dataset 2 (HPC Subset)

This notebook computes Degree-Weighted Path Counts (DWPC) for a subset of Dataset 2 files:
- 2016 real data (1 file)
- 2016 permutation 001 (1 file)
- 2016 random 001 (1 file)
- 2024 real data (1 file)
- 2024 permutation 001 (1 file)
- 2024 random 001 (1 file)

**Total: 6 datasets**

**Expected runtime:** 2-3 hours

**Outputs:**
- 6 DWPC result CSVs in `output/dwpc_com/dataset2/results/`
- 6 histograms in `output/dwpc_com/dataset2/histograms/`
- Intermediate parquet files in `output/dwpc_com/dataset2/metapaths_parquet/`

## Setup Hetionet Docker API

Before running this notebook, ensure the Hetionet Docker API is running:

```bash
cd connectivity-search-backend
sudo ./run_stack.sh
```

In [None]:
import os
import sys
import time
import logging
import warnings
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

os.environ['CONNECTIVITY_SEARCH_API'] = 'http://localhost:8015'
os.environ['NEO4J_HOST'] = 'localhost'

src_path = str(Path.cwd().parent / 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)

from dwpc_api import run_metapaths_for_df, validate_parquet_files

QUIET_MODE = True

if QUIET_MODE:
    warnings.filterwarnings("ignore", category=FutureWarning, module=r"pandas\..*")
    warnings.filterwarnings("ignore", category=UserWarning, module=r"pyarrow\..*")
    warnings.filterwarnings("ignore", category=FutureWarning, module=r"httpx\..*")
    warnings.filterwarnings("ignore", category=FutureWarning, module=r"urllib3\..*")
    warnings.filterwarnings("ignore", category=DeprecationWarning, module=r"tqdm\..*")
    for name in ("httpx", "urllib3", "chardet"):
        logging.getLogger(name).setLevel(logging.WARNING)

print("Imports loaded successfully")
print("DWPC API functions imported from src/dwpc_api.py")

## Safety Check Configuration

Data quality checks to detect failures early and prevent wasted computation.

In [2]:
ENABLE_EARLY_VALIDATION = True
ENABLE_PERIODIC_MONITORING = True
ENABLE_FINAL_VALIDATION = True

EARLY_CHECK_THRESHOLD = 10
PERIODIC_CHECK_INTERVAL = 100
PERIODIC_FAILURE_THRESHOLD = 0.5
FINAL_MIN_COMPLETION_RATE = 0.2

print("Safety check configuration:")
print(f"  Early validation: {'ENABLED' if ENABLE_EARLY_VALIDATION else 'DISABLED'} (check first {EARLY_CHECK_THRESHOLD} pairs)")
print(f"  Periodic monitoring: {'ENABLED' if ENABLE_PERIODIC_MONITORING else 'DISABLED'} (every {PERIODIC_CHECK_INTERVAL} pairs)")
print(f"  Final validation: {'ENABLED' if ENABLE_FINAL_VALIDATION else 'DISABLED'} (min {FINAL_MIN_COMPLETION_RATE:.0%} completion)")

Safety check configuration:
  Early validation: ENABLED (check first 10 pairs)
  Periodic monitoring: ENABLED (every 100 pairs)
  Final validation: ENABLED (min 20% completion)


## Dataset Configuration

Define 6 datasets to process:
- 2 real datasets (2016 and 2024)
- 2 permutation datasets (perm_001 for each year)
- 2 random datasets (random_001 for each year)

In [3]:
datasets_to_process = [
    # 2016 Real
    {
        'name': 'dataset2_2016_real',
        'path': '../output/intermediate/hetio_bppg_dataset2_filtered.csv',
        'col_source': 'neo4j_source_id',
        'col_target': 'neo4j_target_id',
        'year': 2016,
        'type': 'real'
    },
    # 2016 Permutation 001
    {
        'name': 'dataset2_2016_perm_001',
        'path': '../output/permutations/dataset2_2016/perm_001.csv',
        'col_source': 'neo4j_source_id',
        'col_target': 'neo4j_target_id',
        'year': 2016,
        'type': 'permuted'
    },
    # 2016 Random 001
    {
        'name': 'dataset2_2016_random_001',
        'path': '../output/random_samples/dataset2_2016/random_001.csv',
        'col_source': 'neo4j_source_id',
        'col_target': 'neo4j_pseudo_target_id',
        'year': 2016,
        'type': 'random'
    },
    # 2024 Real
    {
        'name': 'dataset2_2024_real',
        'path': '../output/intermediate/hetio_bppg_dataset2_2024_filtered.csv',
        'col_source': 'neo4j_source_id',
        'col_target': 'neo4j_target_id',
        'year': 2024,
        'type': 'real'
    },
    # 2024 Permutation 001
    {
        'name': 'dataset2_2024_perm_001',
        'path': '../output/permutations/dataset2_2024/perm_001.csv',
        'col_source': 'neo4j_source_id',
        'col_target': 'neo4j_target_id',
        'year': 2024,
        'type': 'permuted'
    },
    # 2024 Random 001
    {
        'name': 'dataset2_2024_random_001',
        'path': '../output/random_samples/dataset2_2024/random_001.csv',
        'col_source': 'neo4j_source_id',
        'col_target': 'neo4j_pseudo_target_id',
        'year': 2024,
        'type': 'random'
    },
]

print(f"Configured {len(datasets_to_process)} datasets for processing")
print(f"  - 2016: 1 real + 1 permutation + 1 random = 3 datasets")
print(f"  - 2024: 1 real + 1 permutation + 1 random = 3 datasets")
print(f"  - Total: 6 datasets")

Configured 6 datasets for processing
  - 2016: 1 real + 1 permutation + 1 random = 3 datasets
  - 2024: 1 real + 1 permutation + 1 random = 3 datasets
  - Total: 6 datasets


## Batch Processing Loop

Process all 6 datasets sequentially. Each dataset will:
1. Load GO-gene pairs from CSV
2. Compute DWPC for all unique pairs via async API calls
3. Merge parquet results and save to CSV
4. Generate histogram (if 'dwpc' column exists)

Expected runtime: 2-3 hours

In [4]:
import matplotlib.pyplot as plt
import numpy as np

Path('../output/dwpc_com/dataset2/results').mkdir(parents=True, exist_ok=True)
Path('../output/dwpc_com/dataset2/histograms').mkdir(parents=True, exist_ok=True)

batch_start_time = time.perf_counter()
batch_summary = []

print("="*80)
print(f"STARTING BATCH PROCESSING: {len(datasets_to_process)} datasets")
print("="*80)

try:
    for i, dataset in enumerate(datasets_to_process, 1):
        dataset_start = time.perf_counter()

        print(f"\n{'='*80}")
        print(f"Dataset {i}/{len(datasets_to_process)}: {dataset['name']}")
        print(f"  Year: {dataset['year']}, Type: {dataset['type']}")
        print(f"  Path: {dataset['path']}")
        print(f"{'='*80}\n")

        csv_path = f'../output/dwpc_com/dataset2/results/res_{dataset["name"]}.csv'

        if Path(csv_path).exists():
            try:
                existing_df = pd.read_csv(csv_path)
                if len(existing_df) > 0:
                    print(f"CHECKPOINT: Dataset already processed ({len(existing_df):,} rows)")
                    print(f"  Skipping to next dataset...\n")
                    batch_summary.append({
                        'dataset_number': i,
                        'dataset_name': dataset['name'],
                        'year': dataset['year'],
                        'type': dataset['type'],
                        'n_input_pairs': None,
                        'n_output_rows': len(existing_df),
                        'time_seconds': 0,
                        'status': 'skipped (already complete)'
                    })
                    continue
            except Exception as e:
                print(f"Warning: Could not read existing CSV: {e}")
                print(f"  Will reprocess dataset...\n")

        df = pd.read_csv(dataset['path'])
        n_pairs = len(df)
        n_unique_pairs = len(df[[dataset['col_source'], dataset['col_target']]].dropna().drop_duplicates())
        print(f"Loaded {n_pairs:,} GO-gene pairs ({n_unique_pairs:,} unique)\n")

        max_attempts = 3
        for attempt in range(1, max_attempts + 1):
            if attempt > 1:
                print(f"\n{'='*60}")
                print(f"RETRY ATTEMPT {attempt}/{max_attempts}")
                print(f"{'='*60}\n")

            summary = await run_metapaths_for_df(
                df,
                col_source=dataset['col_source'],
                col_target=dataset['col_target'],
                base_out_dir=Path('../output/dwpc_com/dataset2/metapaths_parquet'),
                group=dataset['name'],
                clear_group=False,
                max_concurrency=120,
                retries=10,
                backoff_first=10.0,
            )

            parquet_dir = Path(f"../output/dwpc_com/dataset2/metapaths_parquet/{dataset['name']}")
            parquet_files = list(parquet_dir.glob("*.parquet"))

            if not parquet_files:
                raise FileNotFoundError(f"No parquet files found in {parquet_dir}")

            n_completed = len(parquet_files)
            completion_rate = (n_completed / n_unique_pairs) * 100

            print(f"\nCompletion check:")
            print(f"  Expected pairs: {n_unique_pairs:,}")
            print(f"  Completed pairs: {n_completed:,}")
            print(f"  Completion rate: {completion_rate:.2f}%")

            if n_completed == n_unique_pairs:
                print(f"  Status: COMPLETE")
                break
            else:
                missing = n_unique_pairs - n_completed
                print(f"  Status: INCOMPLETE ({missing:,} pairs missing)")
                
                if attempt < max_attempts:
                    print(f"  Will retry missing pairs (attempt {attempt + 1}/{max_attempts})...")
                else:
                    print(f"  WARNING: Maximum retry attempts reached")
                    print(f"  Proceeding with {n_completed:,} of {n_unique_pairs:,} pairs")

        parquet_files = sorted(parquet_dir.glob("*.parquet"))
        
        if ENABLE_FINAL_VALIDATION:
            try:
                n_valid, n_empty, completion_rate = validate_parquet_files(
                    parquet_dir, 
                    n_unique_pairs, 
                    min_completion_rate=FINAL_MIN_COMPLETION_RATE
                )
                print(f"\nFinal validation passed:")
                print(f"  Valid files: {n_valid}")
                print(f"  Empty files: {n_empty}")
                print(f"  Completion rate: {completion_rate:.1%}")
            except RuntimeError as e:
                print(f"\nFinal validation failed - skipping merge")
                batch_summary.append({
                    'dataset_number': i,
                    'dataset_name': dataset['name'],
                    'year': dataset['year'],
                    'type': dataset['type'],
                    'n_input_pairs': n_pairs,
                    'n_unique_pairs': n_unique_pairs,
                    'n_completed_pairs': len(parquet_files),
                    'completion_pct': (len(parquet_files) / n_unique_pairs) * 100,
                    'n_output_rows': 0,
                    'time_seconds': time.perf_counter() - dataset_start,
                    'status': 'validation failed'
                })
                continue
        
        print(f"\nMerging {len(parquet_files)} parquet files...")
        res_df = pd.concat(
            [pd.read_parquet(fp) for fp in parquet_files],
            ignore_index=True
        )
        print(f"  Merged {len(res_df):,} rows")

        res_df.to_csv(csv_path, index=False)
        print(f"  Saved: {csv_path}")
        
        if 'dwpc' in res_df.columns:
            dwpc_vals = res_df.loc[res_df["dwpc"] > 0, "dwpc"].dropna()
            
            if len(dwpc_vals) > 0:
                dwpc_mean = dwpc_vals.mean()
                
                plt.figure(figsize=(6, 4))
                plt.hist(dwpc_vals, bins=50, density=False, edgecolor="black", linewidth=0.5)
                plt.axvline(dwpc_mean, color="red", linestyle="--", linewidth=1.2, 
                           label=f"Mean = {dwpc_mean:.2f}")
                plt.title(f"DWPC Distribution: {dataset['name']}", fontsize=12, weight="bold")
                plt.xlabel("DWPC", fontsize=11)
                plt.ylabel("Count", fontsize=11)
                plt.legend(fontsize=10)
                plt.tick_params(axis="both", labelsize=10)
                plt.grid(axis="y", linestyle="--", linewidth=0.5, alpha=0.7)
                plt.tight_layout()
                
                hist_path = f"../output/dwpc_com/dataset2/histograms/hist_{dataset['name']}.png"
                plt.savefig(hist_path, dpi=300, bbox_inches="tight")
                plt.close()
                print(f"  Saved histogram: {hist_path}")
            else:
                print(f"  Skipping histogram: no positive DWPC values")
        else:
            print(f"  Skipping histogram: 'dwpc' column not found")
        
        dataset_time = time.perf_counter() - dataset_start
        
        final_completion = (len(parquet_files) / n_unique_pairs) * 100
        status = 'completed' if final_completion == 100.0 else f'incomplete ({final_completion:.1f}%)'
        
        batch_summary.append({
            'dataset_number': i,
            'dataset_name': dataset['name'],
            'year': dataset['year'],
            'type': dataset['type'],
            'n_input_pairs': n_pairs,
            'n_unique_pairs': n_unique_pairs,
            'n_completed_pairs': len(parquet_files),
            'completion_pct': final_completion,
            'n_output_rows': len(res_df),
            'time_seconds': dataset_time,
            'status': status
        })
        
        print(f"\nCompleted {dataset['name']} in {dataset_time/60:.1f} minutes")
        
        elapsed = time.perf_counter() - batch_start_time
        avg_time_per_dataset = elapsed / i
        remaining = (len(datasets_to_process) - i) * avg_time_per_dataset
        print(f"  Progress: {i}/{len(datasets_to_process)} datasets")
        print(f"  Elapsed: {elapsed/3600:.2f} hours")
        print(f"  Estimated remaining: {remaining/3600:.2f} hours")
    
    batch_time = time.perf_counter() - batch_start_time
    print(f"\n{'='*80}")
    print(f"BATCH PROCESSING COMPLETE")
    print(f"  Total time: {batch_time/3600:.2f} hours")
    print(f"  Datasets processed: {len(datasets_to_process)}")
    print(f"  All results saved to: output/dwpc_com/dataset2/results/")
    print(f"{'='*80}")

except Exception as e:
    print(f"\n{'='*80}")
    print(f"ERROR: Batch processing failed at dataset {i}: {dataset['name']}")
    print(f"Error message: {str(e)}")
    print(f"{'='*80}")
    
    batch_summary.append({
        'dataset_number': i,
        'dataset_name': dataset['name'],
        'year': dataset['year'],
        'type': dataset['type'],
        'n_input_pairs': None,
        'n_unique_pairs': None,
        'n_completed_pairs': None,
        'completion_pct': None,
        'n_output_rows': None,
        'time_seconds': None,
        'status': 'failed'
    })
    
    raise

batch_summary_df = pd.DataFrame(batch_summary)
print("\nBatch Processing Summary:")
print(batch_summary_df.to_string(index=False))

incomplete_datasets = batch_summary_df[
    (batch_summary_df['status'] != 'skipped (already complete)') & 
    (batch_summary_df['completion_pct'] < 100.0)
]

if len(incomplete_datasets) > 0:
    print("\n" + "="*80)
    print("WARNING: Some datasets are incomplete!")
    print("="*80)
    for _, row in incomplete_datasets.iterrows():
        print(f"  {row['dataset_name']}: {row['completion_pct']:.1f}% complete "
              f"({row['n_completed_pairs']}/{row['n_unique_pairs']} pairs)")
    print("\nRecommendation: Rerun this notebook to fill gaps automatically.")

STARTING BATCH PROCESSING: 6 datasets

Dataset 1/6: dataset2_2016_real
  Year: 2016, Type: real
  Path: ../output/intermediate/hetio_bppg_dataset2_filtered.csv



CHECKPOINT: Dataset already processed (1,521,268 rows)
  Skipping to next dataset...


Dataset 2/6: dataset2_2016_perm_001
  Year: 2016, Type: permuted
  Path: ../output/permutations/dataset2_2016/perm_001.csv



CHECKPOINT: Dataset already processed (1,543,256 rows)
  Skipping to next dataset...


Dataset 3/6: dataset2_2016_random_001
  Year: 2016, Type: random
  Path: ../output/random_samples/dataset2_2016/random_001.csv



CHECKPOINT: Dataset already processed (1,524,065 rows)
  Skipping to next dataset...


Dataset 4/6: dataset2_2024_real
  Year: 2024, Type: real
  Path: ../output/intermediate/hetio_bppg_dataset2_2024_filtered.csv



CHECKPOINT: Dataset already processed (1,022,372 rows)
  Skipping to next dataset...


Dataset 5/6: dataset2_2024_perm_001
  Year: 2024, Type: permuted
  Path: ../output/permutations/dataset2_2024/perm_001.csv

Loaded 19,661 GO-gene pairs (19,661 unique)

Checking Neo4j health at http://localhost:8015/v1/nodes/...


  Neo4j is healthy!


CHECKPOINT: Found 15276 existing pair results
  Skipping 15276 completed pairs
  Processing 4385 remaining pairs
Connection pooling configuration:
  Max connections: 250
  Keepalive connections: 200
  Global concurrency: 120
  Timeout: 90.0s



Processing [dataset2_2024_perm_001]:   0%|                                                                                      | 0/4385 [00:00<?, ?task/s]


Processing [dataset2_2024_perm_001]:   3%|██                                                                          | 120/4385 [01:13<43:31,  1.63task/s]


Processing [dataset2_2024_perm_001]:   5%|████▏                                                                       | 240/4385 [02:43<48:00,  1.44task/s]


Processing [dataset2_2024_perm_001]:   8%|██████▏                                                                     | 360/4385 [04:12<48:01,  1.40task/s]


Processing [dataset2_2024_perm_001]:  11%|████████▎                                                                   | 480/4385 [05:36<46:09,  1.41task/s]


Processing [dataset2_2024_perm_001]:  14%|██████████▍                                                                 | 600/4385 [07:00<44:34,  1.42task/s]


Processing [dataset2_2024_perm_001]:  16%|████████████▍                                                               | 720/4385 [08:23<42:48,  1.43task/s]


Processing [dataset2_2024_perm_001]:  19%|██████████████▌                                                             | 840/4385 [09:42<40:36,  1.45task/s]


Processing [dataset2_2024_perm_001]:  22%|████████████████▋                                                           | 960/4385 [11:00<38:33,  1.48task/s]


Processing [dataset2_2024_perm_001]:  25%|██████████████████▍                                                        | 1080/4385 [12:18<36:41,  1.50task/s]


Processing [dataset2_2024_perm_001]:  27%|████████████████████▌                                                      | 1200/4385 [13:46<36:27,  1.46task/s]


Processing [dataset2_2024_perm_001]:  30%|██████████████████████▌                                                    | 1320/4385 [15:12<35:35,  1.44task/s]


Processing [dataset2_2024_perm_001]:  33%|████████████████████████▋                                                  | 1440/4385 [16:45<35:21,  1.39task/s]


Processing [dataset2_2024_perm_001]:  36%|██████████████████████████▋                                                | 1560/4385 [18:19<34:52,  1.35task/s]


Processing [dataset2_2024_perm_001]:  38%|████████████████████████████▋                                              | 1680/4385 [19:43<32:47,  1.38task/s]


Processing [dataset2_2024_perm_001]:  41%|██████████████████████████████▊                                            | 1800/4385 [21:12<31:28,  1.37task/s]


Processing [dataset2_2024_perm_001]:  44%|████████████████████████████████▊                                          | 1920/4385 [22:46<30:40,  1.34task/s]


Processing [dataset2_2024_perm_001]:  47%|██████████████████████████████████▉                                        | 2040/4385 [24:23<29:56,  1.31task/s]


Processing [dataset2_2024_perm_001]:  49%|████████████████████████████████████▉                                      | 2160/4385 [25:52<28:06,  1.32task/s]


Processing [dataset2_2024_perm_001]:  52%|██████████████████████████████████████▉                                    | 2280/4385 [27:18<26:09,  1.34task/s]


Processing [dataset2_2024_perm_001]:  55%|█████████████████████████████████████████                                  | 2400/4385 [28:49<24:47,  1.33task/s]


Processing [dataset2_2024_perm_001]:  57%|███████████████████████████████████████████                                | 2520/4385 [30:20<23:24,  1.33task/s]


Processing [dataset2_2024_perm_001]:  60%|█████████████████████████████████████████████▏                             | 2640/4385 [32:00<22:36,  1.29task/s]


Processing [dataset2_2024_perm_001]:  63%|███████████████████████████████████████████████▏                           | 2760/4385 [33:32<20:57,  1.29task/s]


Processing [dataset2_2024_perm_001]:  66%|█████████████████████████████████████████████████▎                         | 2880/4385 [35:01<19:10,  1.31task/s]


Processing [dataset2_2024_perm_001]:  68%|███████████████████████████████████████████████████▎                       | 3000/4385 [36:36<17:50,  1.29task/s]


Processing [dataset2_2024_perm_001]:  71%|█████████████████████████████████████████████████████▎                     | 3120/4385 [38:01<15:53,  1.33task/s]


Processing [dataset2_2024_perm_001]:  74%|███████████████████████████████████████████████████████▍                   | 3240/4385 [39:14<13:33,  1.41task/s]


Processing [dataset2_2024_perm_001]:  77%|█████████████████████████████████████████████████████████▍                 | 3360/4385 [40:26<11:33,  1.48task/s]


Processing [dataset2_2024_perm_001]:  79%|███████████████████████████████████████████████████████████▌               | 3480/4385 [41:42<10:00,  1.51task/s]


Processing [dataset2_2024_perm_001]:  82%|█████████████████████████████████████████████████████████████▌             | 3600/4385 [42:54<08:24,  1.56task/s]


Processing [dataset2_2024_perm_001]:  85%|███████████████████████████████████████████████████████████████▋           | 3720/4385 [44:09<07:03,  1.57task/s]


Processing [dataset2_2024_perm_001]:  88%|█████████████████████████████████████████████████████████████████▋         | 3840/4385 [45:24<05:46,  1.57task/s]


Processing [dataset2_2024_perm_001]:  90%|███████████████████████████████████████████████████████████████████▋       | 3960/4385 [46:35<04:24,  1.61task/s]


Processing [dataset2_2024_perm_001]:  93%|█████████████████████████████████████████████████████████████████████▊     | 4080/4385 [47:53<03:12,  1.58task/s]


Processing [dataset2_2024_perm_001]:  96%|███████████████████████████████████████████████████████████████████████▊   | 4200/4385 [49:11<01:57,  1.57task/s]


Processing [dataset2_2024_perm_001]:  99%|█████████████████████████████████████████████████████████████████████████▉ | 4320/4385 [50:23<00:40,  1.60task/s]


Processing [dataset2_2024_perm_001]: 100%|███████████████████████████████████████████████████████████████████████████| 4385/4385 [50:53<00:00,  1.68task/s]


Processing [dataset2_2024_perm_001]: 100%|███████████████████████████████████████████████████████████████████████████| 4385/4385 [50:53<00:00,  1.44task/s]




## Validation

Verify that all 6 datasets were processed successfully.

In [5]:
results_dir = Path('../output/dwpc_com/dataset2/results')
histograms_dir = Path('../output/dwpc_com/dataset2/histograms')

validation_results = []

print("="*80)
print("VALIDATION RESULTS")
print("="*80)
print(f"\n{'Dataset Name':<30} {'CSV Exists':<12} {'Rows':<12} {'Histogram':<12} {'Status':<10}")
print("-"*80)

all_passed = True

for dataset in datasets_to_process:
    name = dataset['name']
    csv_path = results_dir / f"res_{name}.csv"
    hist_path = histograms_dir / f"hist_{name}.png"
    
    csv_exists = csv_path.exists()
    hist_exists = hist_path.exists()
    
    if csv_exists:
        try:
            df = pd.read_csv(csv_path)
            n_rows = len(df)
            status = "PASS" if n_rows > 0 else "FAIL"
        except Exception as e:
            n_rows = "ERROR"
            status = "FAIL"
    else:
        n_rows = "N/A"
        status = "FAIL"
    
    csv_status = "YES" if csv_exists else "NO"
    hist_status = "YES" if hist_exists else "NO"
    
    print(f"{name:<30} {csv_status:<12} {str(n_rows):<12} {hist_status:<12} {status:<10}")
    
    validation_results.append({
        'dataset': name,
        'csv_exists': csv_exists,
        'n_rows': n_rows if csv_exists else 0,
        'histogram_exists': hist_exists,
        'status': status
    })
    
    if status != "PASS":
        all_passed = False

print("-"*80)
if all_passed:
    print(f"\nAll {len(datasets_to_process)} datasets processed successfully!")
    print(f"  Results: {results_dir}")
    print(f"  Histograms: {histograms_dir}")
else:
    print("\nSome datasets failed validation. Check output files.")

validation_df = pd.DataFrame(validation_results)
print(f"\nSummary:")
print(f"  Total datasets: {len(datasets_to_process)}")
print(f"  Passed: {sum(1 for v in validation_results if v['status'] == 'PASS')}")
print(f"  Failed: {sum(1 for v in validation_results if v['status'] == 'FAIL')}")