# Input Data Inspector Demo

This notebook demonstrates the `intccms.metrics.inspector` module for characterizing input ROOT files.

The inspector allows you to:
- Extract metadata from ROOT files (events, file sizes, branch sizes, compression ratios)
- Run distributed inspection using Dask
- Aggregate statistics across datasets
- Create visualizations of input data characteristics

**Key feature**: Works directly with DatasetManager - no metadata preprocessing required!

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from dask.distributed import Client, LocalCluster
import matplotlib.pyplot as plt

from intccms.datasets import DatasetManager
from intccms.metrics.inspector import (
    extract_files_from_dataset_manager,
    get_dataset_file_counts,
    inspect_dataset_distributed,
    aggregate_statistics,
    group_by_dataset,
    compute_dataset_statistics,
    compute_compression_stats,
    format_error_summary,
    plot,
)

## Step 1: Load Dataset Configuration

Load the dataset configuration from `example_cms/configs/skim.py`.
This uses the same configuration as your processing workflow.

In [None]:
# Load dataset configuration
from example_cms.configs.skim import datasets_config

# Create DatasetManager
dm = DatasetManager(datasets_config)

print("Available datasets:")
for name in dm.datasets.keys():
    print(f"  - {name}")

## Step 2: Quick File Count Summary

Get a quick count of files per dataset without full inspection.

In [None]:
file_counts = get_dataset_file_counts(dm)

print("\nFile counts per dataset:")
for dataset, count in file_counts.items():
    print(f"  {dataset}: {count} files")

## Step 3: Extract Files for Inspection

Extract file paths from DatasetManager. You can:
- Inspect all datasets
- Inspect specific processes
- Limit files per process (useful for quick sampling)

In [None]:
# Option A: Sample first 5 files per dataset for quick testing
file_list, dataset_map = extract_files_from_dataset_manager(
    dm,
    max_files_per_process=5,
)

# Option B: Inspect specific datasets
# file_list, dataset_map = extract_files_from_dataset_manager(
#     dm,
#     processes=["signal", "ttbar_semilep"],
#     max_files_per_process=10,
# )

# Option C: Inspect all files (can be slow for large datasets!)
# file_list, dataset_map = extract_files_from_dataset_manager(dm)

print(f"\nExtracted {len(file_list)} files for inspection")
print(f"Example file: {file_list[0]}")

## Step 4: Distributed Inspection with Dask

Run distributed file inspection using Dask.
This extracts metadata from all files in parallel.

In [None]:
# Start a local Dask cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1, processes=True)
client = Client(cluster)

print(f"Dask dashboard: {client.dashboard_link}")

In [None]:
# Run distributed inspection with error handling
# Note: max_branches limits the number of branches inspected per file for faster results
results, errors = inspect_dataset_distributed(
    client,
    file_list,
    max_branches=100,  # Limit to first 100 branches for speed
)

# Print formatted error summary
print(format_error_summary(errors))

if results:
    print(f"\n=== Example Result ===")
    print(f"  File: {results[0]['filepath']}")
    print(f"  Events: {results[0]['num_events']:,}")
    print(f"  Branches: {results[0]['num_branches']}")
    print(f"  File size: {results[0]['file_size_bytes'] / 1024**2:.1f} MB")
else:
    print("\nNo files were successfully inspected!")

## Step 5: Aggregate Statistics

Compute aggregate statistics across all inspected files.

### Optional: Fetch File Sizes from Rucio

Use the inspector's Rucio helper to retrieve authoritative file sizes. This
requires a valid Rucio environment (credentials and network access). If the
lookup fails, the notebook continues with locally derived statistics.

In [None]:
from intccms.metrics.inspector import rucio as inspector_rucio
from rich.console import Console

size_summary = None
try:
    size_summary = inspector_rucio.fetch_file_sizes(
        dm,
        processes=["signal", "ttbar_semilep"],
        max_files_per_process=5,
    )
    console = Console(force_jupyter=False)
    console.print(inspector_rucio.format_dataset_size_table(size_summary))
except Exception as exc:
    print("Skipping Rucio size lookup (set size_summary=None):", exc)
    size_summary = None


In [None]:
from rich.console import Console
from intccms.metrics.inspector import (
    format_overall_stats_table,
    format_branch_stats_table,
    format_dataset_stats_table,
    format_compression_stats_table,
)

# Create console
console = Console(force_jupyter=False)

# Aggregate and display statistics
stats = aggregate_statistics(results, size_summary=size_summary)
table = format_overall_stats_table(stats)
console.print(table)


### Understanding Box Plots

The box plots in this analysis show statistical distributions:

- **Box**: Contains the middle 50% of data (interquartile range, IQR)
- **Line inside box**: Median value (50th percentile)
- **Whiskers**: Extend to the 5th and 95th percentiles
- **Points beyond whiskers**: Outliers outside the 5th-95th percentile range

This visualization helps identify data skewness, outliers, and distribution characteristics.


## Step 6: Branch Statistics

Analyze branch size and compression distributions.

In [None]:
from intccms.metrics.inspector.aggregator import compute_branch_statistics

branch_stats = compute_branch_statistics(results)
table = format_branch_stats_table(branch_stats)
console.print(table)


## Step 7: Per-Dataset Statistics

Group results by dataset and compute per-dataset statistics.

In [None]:
# Group by dataset
grouped = group_by_dataset(results, dataset_map)

# Compute per-dataset statistics
dataset_stats = compute_dataset_statistics(grouped, size_summary=size_summary)

print("\n=== Per-Dataset Statistics ===")
for dataset_name, ds_stats in dataset_stats.items():
    print(f"\n{dataset_name}:")
    print(f"  Files: {ds_stats['num_files']}")
    print(f"  Total events: {ds_stats['total_events']:,}")
    print(f"  Avg events/file: {ds_stats['avg_events_per_file']:,.0f}")
    if ds_stats['total_size_bytes'] > 0:
        print(f"  Total size: {ds_stats['total_size_bytes'] / 1024**3:.2f} GB")
        print(f"  Avg file size: {ds_stats['avg_file_size_bytes'] / 1024**2:.1f} MB")

In [None]:
# Group by dataset
grouped = group_by_dataset(results, dataset_map)

# Compute per-dataset statistics
dataset_stats = compute_dataset_statistics(grouped, size_summary=size_summary)

# Display as rich table
table = format_dataset_stats_table(dataset_stats)
console.print(table)


In [None]:
compression_stats = compute_compression_stats(results)

print("\n=== Compression Statistics ===")
print(f"Files with compression info: {compression_stats['files_with_compression']}")
print(f"Overall compression ratio: {compression_stats['overall_compression_ratio']:.2f}x")
print(f"Average tree compression: {compression_stats['avg_tree_compression_ratio']:.2f}x")
print(f"Median tree compression: {compression_stats['median_tree_compression_ratio']:.2f}x")
print(f"Total compressed: {compression_stats['total_compressed_bytes'] / 1024**3:.2f} GB")
print(f"Total uncompressed: {compression_stats['total_uncompressed_bytes'] / 1024**3:.2f} GB")

## Step 9: Visualizations

Create plots to visualize the inspection results.

In [None]:
compression_stats = compute_compression_stats(results)

# Display as rich table
table = format_compression_stats_table(compression_stats)
console.print(table)


In [None]:
# Events per file ratio by dataset
from intccms.metrics.inspector.plot import plot_events_per_file_by_dataset

fig, ax = plot_events_per_file_by_dataset(dataset_stats)
plt.show()


In [None]:
fig, ax = plot.plot_event_distribution(results)
plt.show()

### Dataset Comparison

In [None]:
fig, (ax1, ax2) = plot.plot_dataset_comparison(dataset_stats)
plt.show()

### Branch Size Distribution

In [None]:
fig, ax = plot.plot_branch_size_distribution(results)
plt.show()

### Branch Compression Distribution

In [None]:
fig, ax = plot.plot_branch_compression_distribution(results)
if fig is not None:
    plt.show()
else:
    print("No compression data available")

### Branch Distributions by Dataset

In [None]:
fig, (ax1, ax2) = plot.plot_branch_distributions_by_dataset(results, dataset_map)
plt.show()

### File Size Distribution

In [None]:
fig, ax = plot.plot_file_size_distribution(results, size_summary=size_summary)
if fig is not None:
    plt.show()
else:
    print("No file size data available (all files are remote)")

### Summary Dashboard

Create a comprehensive dashboard with all key plots.

In [None]:
fig = plot.plot_summary_dashboard(results, dataset_stats, dataset_map)
plt.show()

## Step 10: Save Results

You can save plots to files and export statistics to JSON.

In [None]:
import json

# Save summary dashboard
# fig = plot.plot_summary_dashboard(
#     results, dataset_stats, dataset_map,
#     save_path="input_summary.png"
# )

# Export statistics to JSON
# output = {
#     "overall_stats": stats,
#     "dataset_stats": dataset_stats,
#     "compression_stats": compression_stats,
#     "branch_stats": branch_stats,
# }
# 
# with open("inspection_results.json", "w") as f:
#     json.dump(output, f, indent=2)

print("Done! You can save plots and export statistics as needed.")

## Cleanup

In [None]:
# Close Dask client and cluster
client.close()
cluster.close()