# Input Data Inspector Demo

This notebook demonstrates the `intccms.metrics.inspector` module for characterizing input ROOT files.

The inspector allows you to:
- Extract metadata from ROOT files (events, file sizes, branch sizes, compression ratios)
- Run distributed inspection using Dask
- Aggregate statistics across datasets
- Create visualizations of input data characteristics

**Works directly with DatasetManager**

In [1]:
import copy
import sys
from pathlib import Path

# Add src directory to Python path
repo_root = Path.cwd()
src_dir = repo_root / "src"
examples_dir = repo_root/"example_cms"
if str(src_dir) not in sys.path:
    sys.path.insert(0, str(src_dir))
if str(examples_dir) not in sys.path:
    sys.path.insert(0, str(examples_dir))
print(f"✅ Added {src_dir} to Python path")
print(f"✅ Added {examples_dir} to Python path")


from dask.distributed import Client, PipInstall
import matplotlib.pyplot as plt

import cloudpickle
import intccms
import example_cms

# Register modules for cloud pickle
cloudpickle.register_pickle_by_value(intccms)
cloudpickle.register_pickle_by_value(example_cms)

from intccms.datasets import DatasetManager
from intccms.metrics.inspector import (
    extract_files_from_dataset_manager,
    get_dataset_file_counts,
    inspect_dataset_distributed,
    aggregate_statistics,
    group_by_dataset,
    compute_dataset_statistics,
    compute_compression_stats,
    format_error_summary,
    plot,
)

✅ Added /home/cms-jovyan/intccms-agc-demo-10/src to Python path
✅ Added /home/cms-jovyan/intccms-agc-demo-10/example_cms to Python path


In [2]:
def acquire_client():
    client = Client("tls://localhost:8786")
    cluster = None  # no local cluster in this mode
    return client, cluster

## Step 1: Load Dataset Configuration

Load the dataset configuration from `example_cms/configs/skim.py`.
This uses the same configuration as your processing workflow.

In [3]:
from example_cms.configs.configuration import config as original_config
from intccms.schema import Config, load_config_with_restricted_cli
# Configuration setup
config = copy.deepcopy(original_config)

cli_args = []
full_config = load_config_with_restricted_cli(config, cli_args)
validated_config = Config(**full_config)

# Create DatasetManager
dm = DatasetManager(validated_config.datasets)

print("Available datasets:")
for name in dm.datasets.keys():
    print(f"  - {name}")

Available datasets:
  - signal
  - ttbar_semilep
  - ttbar_had
  - ttbar_lep
  - wjets
  - dyjets
  - single_top
  - qcd
  - diboson
  - data


## Step 2: Quick File Count Summary

Get a quick count of files per dataset without full inspection.

In [4]:
file_counts = get_dataset_file_counts(dm)

print("\nFile counts per dataset:")
for dataset, count in file_counts.items():
    print(f"  {dataset}: {count} files")


File counts per dataset:
  signal: 26 files
  ttbar_semilep: 826 files
  ttbar_had: 684 files
  ttbar_lep: 303 files
  wjets: 7989 files
  dyjets: 529 files
  single_top: 792 files
  qcd: 1150 files
  diboson: 165 files
  data: 1057 files


## Step 3: Extract Files for Inspection

Extract file paths from DatasetManager. You can:
- Inspect all datasets
- Inspect specific processes
- Limit files per process (useful for quick sampling)

In [5]:
# Option A: Sample first 5 files per dataset for quick testing
# file_list, dataset_map = extract_files_from_dataset_manager(
#     dm,
#     max_files_per_process=5,
# )

# Option B: Inspect specific datasets
# file_list, dataset_map = extract_files_from_dataset_manager(
#     dm,
#     processes=["signal", "ttbar_semilep"],
#     max_files_per_process=10,
# )

# Option C: Inspect all files (can be slow for large datasets!)
file_list, dataset_map = extract_files_from_dataset_manager(dm)

print(f"\nExtracted {len(file_list)} files for inspection")
print(f"Example file: {file_list[0]}")


Extracted 13521 files for inspection
Example file: root://xcache//store/mc/RunIISummer20UL16NanoAODv9/ZPrimeToTT_M2000_W200_TuneCP2_13TeV-madgraph-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v2/270000/7A83E108-8CB7-5647-8848-C2C48F430E09.root


## Step 4: Distributed Inspection with Dask

Run distributed file inspection using Dask.
This extracts metadata from all files in parallel.

In [6]:
try:
    # Start a local Dask cluster
    client, _ = acquire_client()
    print(f"Dask dashboard: {client.dashboard_link}")
    # Run distributed inspection
    # Note: max_branches limits the number of branches inspected per file for faster results
    results, errors = inspect_dataset_distributed(
        client,
        file_list,
        max_branches=500,  # Limit to first 100 branches for speed
    )
    
    # Print formatted error summary
    print(format_error_summary(errors))
    
    if results:
        print(f"\n=== Example Result ===")
        print(f"  File: {results[0]['filepath']}")
        print(f"  Events: {results[0]['num_events']:,}")
        print(f"  Branches: {results[0]['num_branches']}")
    else:
        print("\nNo files were successfully inspected!")
        
finally:
    client.close()
    

Dask dashboard: /user/mohamed.aly@cern.ch/proxy/8787/status




=== Inspection Summary ===
Total files:   13521
Successful:       0
Failed:        13521
Success rate:   0.0%

=== Failures ===
File       Error Type            Error Message                                               
─────────────────────────────────────────────────────────────────────────────────────────────
ALL FILES  FutureCancelledError  Dask compute failed: ('lambda-8876cedc64ba6ede4d693ada671...

No files were successfully inspected!


## Step 5: Aggregate Statistics

Compute aggregate statistics across all inspected files.

### Optional: Fetch File Sizes from Rucio

Use the inspector's Rucio helper to retrieve authoritative file sizes. This
requires a valid Rucio environment (credentials and network access). If the
lookup fails, the notebook continues with locally derived statistics.

In [7]:
from intccms.metrics.inspector import rucio as inspector_rucio
from rich.console import Console

size_summary = None
try:
    size_summary = inspector_rucio.fetch_file_sizes(
        dm,
        processes=["signal", "ttbar_semilep"],
        max_files_per_process=5,
    )
    console = Console(force_jupyter=False)
    console.print(inspector_rucio.format_dataset_size_table(size_summary))
except Exception as exc:
    print("Skipping Rucio size lookup (set size_summary=None):", exc)
    size_summary = None

Wrong Rucio configuration, impossible to create client
Skipping Rucio size lookup (set size_summary=None): VOMS proxy expirend or non-existing: please run `voms-proxy-init -voms cms -rfc --valid 168:0`



Couldn't find a valid proxy.



In [8]:
from rich.console import Console
from intccms.metrics.inspector import (
    format_overall_stats_table,
    format_branch_stats_table,
    format_dataset_stats_table,
    format_compression_stats_table,
)

# Create console
console = Console(force_jupyter=False)

# Aggregate and display statistics
stats = aggregate_statistics(results, size_summary=None)
table = format_overall_stats_table(stats)
console.print(table)

[3m         Overall Statistics         [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃[1;36m [0m[1;36mMetric                [0m[1;36m [0m┃[1;36m [0m[1;36m  Value[0m[1;36m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│[36m [0m[36mTotal Files           [0m[36m [0m│[32m [0m[32m      0[0m[32m [0m│
│[36m [0m[36mTotal Events          [0m[36m [0m│[32m [0m[32m      0[0m[32m [0m│
│[36m [0m[36mTotal Size            [0m[36m [0m│[32m [0m[32m0.00 GB[0m[32m [0m│
├────────────────────────┼─────────┤
│[36m [0m[36mAvg Events/File       [0m[36m [0m│[32m [0m[32m      0[0m[32m [0m│
│[36m [0m[36mMedian Events/File    [0m[36m [0m│[32m [0m[32m      0[0m[32m [0m│
│[36m [0m[36mP90 Events/File       [0m[36m [0m│[32m [0m[32m      0[0m[32m [0m│
├────────────────────────┼─────────┤
│[36m [0m[36mAvg File Size         [0m[36m [0m│[32m [0m[32m0.00 GB[0m[32m [0m│
│[36m [0m[36mMedian File Size      [0m[36m [0m│[32m [0m[32

## Step 6: Branch statistics

Analyze branch size and compression distributions.

In [9]:
from intccms.metrics.inspector.aggregator import compute_branch_statistics

branch_stats = compute_branch_statistics(results)

table = format_branch_stats_table(branch_stats)
console.print(table)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


ValueError: max() iterable argument is empty

## Step 7: Per-Dataset Statistics

Group results by dataset and compute per-dataset statistics.

In [None]:
# Group by dataset
grouped = group_by_dataset(results, dataset_map)

# Compute per-dataset statistics
dataset_stats = compute_dataset_statistics(grouped)

# Display as rich table
table = format_dataset_stats_table(dataset_stats)
console.print(table)

## Step 8: Compression Statistics

Analyze compression ratios across all files.

In [None]:
compression_stats = compute_compression_stats(results)
# Display as rich table
table = format_compression_stats_table(compression_stats)
console.print(table)

## Step 9: Visualizations

Create plots to visualize the inspection results.

# Events per file ratio by dataset

### Events per file ratio by dataset

In [None]:
fig, ax = plot.plot_events_per_file_by_dataset(dataset_stats)
plt.show()

### Event Distribution

In [None]:
fig, ax = plot.plot_event_distribution(results)
plt.show()

### Dataset Comparison

In [None]:
fig, (ax1, ax2) = plot.plot_dataset_comparison(dataset_stats)
plt.show()

### Understanding Box Plots

The box plots in this analysis show statistical distributions:

- **Box**: Contains the middle 50% of data (interquartile range, IQR)
- **Line inside box**: Median value (50th percentile)
- **Whiskers**: Extend to the 5th and 95th percentiles
- **Points beyond whiskers**: Outliers outside the 5th-95th percentile range

This visualization helps identify data skewness, outliers, and distribution characteristics.


### Branch Size Distribution

In [None]:
fig, ax = plot.plot_branch_size_distribution(results)
plt.show()

### Branch Compression Distribution

In [None]:
fig, ax = plot.plot_branch_compression_distribution(results)
if fig is not None:
    plt.show()
else:
    print("No compression data available")

### Branch Distributions by Dataset

In [None]:
fig, (ax1, ax2) = plot.plot_branch_distributions_by_dataset(results, dataset_map)
plt.show()

### File Size Distribution

In [None]:
fig, ax = plot.plot_file_size_distribution(results)
if fig is not None:
    plt.show()
else:
    print("No file size data available (all files are remote)")

### Summary Dashboard

Create a comprehensive dashboard with all key plots.

In [None]:
fig = plot.plot_summary_dashboard(results, dataset_stats, dataset_map)
plt.show()

## Step 10: Save Results

You can save plots to files and export statistics to JSON.

In [None]:
import json

# Save summary dashboard
# fig = plot.plot_summary_dashboard(
#     results, dataset_stats, top_branches,
#     save_path="input_summary.png"
# )

# Export statistics to JSON
# output = {
#     "overall_stats": stats,
#     "dataset_stats": dataset_stats,
#     "compression_stats": compression_stats,
#     "top_branches": [
#         {"name": name, "size_bytes": size, "compression_ratio": ratio}
#         for name, size, ratio in top_branches
#     ],
# }
# 
# with open("inspection_results.json", "w") as f:
#     json.dump(output, f, indent=2)

print("Done! You can save plots and export statistics as needed.")