# IRIS-HEP CMS Integration Challenge: Workflow Demonstration

This notebook demonstrates the workflow of the IRIS-HEP CMS integration challenge. The workflow is modular and consists of the following steps:

1. Environment Setup
2. Configuration Construction
3. Metadata Extraction
4. Skimming
5. Histogramming
6. Statistical Analysis
7. Plotting

## 1. Environment Setup

To set up the environment, use the following command to launch JupyterLab with the required dependencies:

```sh
pixi run lab
```

Alternatively, if you prefer a Conda-like environment, activate it using the provided script:

```sh
source pixi_activate.sh
```

In [None]:
# Dask stuff
# Register some modules for dask workers to pickle/unpickle
import cloudpickle
import utils, user
cloudpickle.register_pickle_by_value(utils)
cloudpickle.register_pickle_by_value(user)

# Setup some functions for dask client handling
from dask.distributed import Client, LocalCluster, PipInstall
from dask.distributed import WorkerPlugin
import os

DIST = True  # Use distributed/remote bucket
S3 = True    # Use S3 protocol (vs local filesystem)

AWS_ID = "M4S6FZBLCE2MT5R5R3TE" if DIST else os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET = "1POhQ20OjZWViVuIan0bqVuegnYR8LUKmGFbpB7T" if DIST else os.environ["AWS_SECRET_ACCESS_KEY"]
env_vars = {
    "AWS_ACCESS_KEY_ID": AWS_ID,
    "AWS_SECRET_ACCESS_KEY": AWS_SECRET,
}

deps = ["git+https://github.com/coffea/coffea.git@master"]  # , "s3pathlib", "aiobotocore[boto3]==2.24.1"]

# --- helpers for plugin approach ---
class EnvSetter(WorkerPlugin):
    def __init__(self, env_vars: dict):
        self.env_vars = env_vars

    def setup(self, worker):
        import os
        os.environ.update(self.env_vars)


# --- helpers for the client.run approach ---
def _set_env_on_worker(env: dict):
    import os
    try:
        os.environ.update(env)
    except:
        raise ValueError("Failed to setup env...")
    return True

def _get_env_on_worker(keys):
    import os
    return {k: os.environ.get(k) for k in keys}

def get_client(
    use_plugin: bool = True,
    env_vars: dict | None = env_vars,
    dependencies: list[str] | None = deps,
    scheduler_address: str = "tls://localhost:8786",
) -> Client:
    """
    Connect to an existing Dask scheduler, install deps, set env on workers.

    Parameters
    ----------
    use_plugin : bool
        If True, set env using WorkerPlugin; else use client.run.
    env_vars : dict | None
        Environment variables to propagate. Defaults to AWS creds from os.environ.
    dependencies : list[str] | None
        Packages to install on workers via PipInstall.
    scheduler_address : str
        Existing scheduler address to connect to.
    """
    client = Client(scheduler_address)

    # 1) install dependencies first (workers may restart)
    client.register_plugin(PipInstall(packages=dependencies))
    client.wait_for_workers(1)

    # 2) set env on workers
    if use_plugin:
        client.register_plugin(EnvSetter(env_vars))
    else:
        client.run(_set_env_on_worker, env_vars)

    return client

def close_client(client=None):
    if client:
        client.close()


# Install required packages
!pip install omegaconf

## Configuration Flags

The workflow behavior is controlled by two flags defined in the Dask setup cell:

### `DIST` - Storage Location
- `DIST = True`: Use remote/distributed S3 storage at `https://red-s3.unl.edu/cmsaf-test-oshadura`
- `DIST = False`: Use local Ceph storage at `http://rook-ceph-rgw-my-store.rook-ceph.svc/...`

### `S3` - Storage Protocol  
- `S3 = True`: Use S3 protocol with storage_options (for remote S3 or S3-compatible storage)
- `S3 = False`: Use local filesystem protocol (no storage_options)

### Common Combinations

| DIST | S3 | Description | Use Case |
|------|-----|-------------|----------|
| `True` | `True` | Remote S3 with S3 protocol | Distributed processing with remote bucket |
| `False` | `True` | Local Ceph with S3 protocol | Local cluster with S3-compatible storage |
| `False` | `False` | Local filesystem | Development/testing on local machine |

**Note:** `DIST=True, S3=False` is not a logical combination. Things will just run locally. 

---

## 2. Configuration Construction

A lightweight Python config defines general settings, datasets, cuts, observables, and systematics — all are type-checked and (mostly) CLI-overrideable.
Everything else in the workflow reads from it.

In [None]:
# Import Rich-based Configuration Display from logging module
from utils.logging import display_config_table, get_config_logger

# Create a global config logger instance for this notebook
config_logger = get_config_logger()

In [None]:
# Example: Demonstrate configuration comparison and change detection
from example_cms.configs.configuration import config as original_config
from utils.schema import Config, load_config_with_restricted_cli

import copy

# Save the original configuration for comparison
config = copy.deepcopy(original_config)

print("=== Full configuration ===")
display_config_table(config, expand=False)

=== Full configuration ===


In [None]:
# Let's look at the datasets config
display_config_table({"datasets": config["datasets"]}, expand=True)

In [None]:
# Make a modification
config["datasets"]["max_files"] = 1 # Limit to 1 files per dataset
config["general"]["processes"] = ["signal", "ttbar_semilep"]

print("=== SHOW ONLY CHANGES ===")
display_config_table({"datasets": config["datasets"]},
                    expand=True,
                    compare_with={"datasets": original_config["datasets"]},
                    show_only_changes=True)

cli_args = []  # No CLI overrides for this demo
full_config = load_config_with_restricted_cli(config, cli_args)

print("✅ Processed CLI arguments and loaded full configuration")
validated_config = Config(**full_config)  # This is the key validation step!

# 3. Output Manager & Directory Structure

The framework uses a centralized `OutputDirectoryManager` (`utils/output_manager.py`) to organize all analysis outputs.

In [None]:
from utils.output_manager import OutputDirectoryManager

output_manager = OutputDirectoryManager(
    root_output_dir=validated_config.general.output_dir,
    cache_dir=validated_config.general.cache_dir,
    metadata_dir=validated_config.general.metadata_dir,
    skimmed_dir=validated_config.general.skimmed_dir
)

print("✅ Created OutputDirectoryManager with validated paths")


## 4. Metadata Extraction

Metadata extraction runs `coffea`'s preprocessor to build analysis workitems, a `coffea`-compatible fileset and a summary of the processed `NanoAOD`s. Its outputs stay in memory, but are also stored in `JSON` files to allow re-runing subsequent steps without pre-processing every time.

In [None]:
from utils.metadata_extractor import NanoAODMetadataGenerator
from utils.datasets import ConfigurableDatasetManager

# Step 1: Create dataset manager instance
dataset_manager = ConfigurableDatasetManager(validated_config.datasets)

# Step 2: Create metadata generator instance
validated_config.general.run_metadata_generation=False

client, cluster = None, None #get_client()
metadata_generator = NanoAODMetadataGenerator(
    dataset_manager=dataset_manager,
    output_manager=output_manager
)
# In the back this uses coffea's preprocess function
metadata = metadata_generator.run(
    generate_metadata=validated_config.general.run_metadata_generation,
    processes_filter=validated_config.general.processes if hasattr(validated_config.general, 'processes') else None
)
#close_client(client, None)

# 5. Skimming & Event Selection

The skimming step processes the workitems from previous step, and applies event and branch selections with `dask` and a `coffea`-like processor (see [Alex's issue](https://github.com/scikit-hep/coffea/issues/1393)). Currently the skimming results are in-memory events, but also stored on disk as `ROOT` files to avoid having to run this multiple times. The events are also cached in `.pkl` files for faster re-reading. Eventually this step needs to be integrated with subsequent steps in a complete `coffea` or `coffea`-like processor.

In [None]:
from utils.skimming import process_and_load_events

# Extract workitems and fileset from metadata generator
fileset = metadata_generator.fileset
workitems = metadata_generator.workitems

print(f"📊 Processing {len(workitems)} workitems across {len(fileset)} datasets")

# Disable caching for demonstration
validated_config.general.read_from_cache = False
validated_config.general.run_skimming = True

# Configure output based on DIST and S3 flags
if DIST:
    new_endpoint = "https://red-s3.unl.edu/cmsaf-test-oshadura"
else:
    new_endpoint = "http://rook-ceph-rgw-my-store.rook-ceph.svc/triton-116ed3e4-b173-48c1-aea0-affee451feda"

if S3:
    # Use S3 protocol with storage_options
    validated_config.preprocess.skimming.output.protocol = "s3"
    validated_config.preprocess.skimming.output.base_uri = "s3://"
    validated_config.preprocess.skimming.output.to_kwargs["storage_options"]["client_kwargs"]["endpoint_url"] = new_endpoint
    validated_config.preprocess.skimming.output.from_kwargs["storage_options"]["client_kwargs"]["endpoint_url"] = new_endpoint
    print(f"🔗 Using S3 protocol with endpoint: {new_endpoint}")
else:
    # Use local protocol without storage_options
    validated_config.preprocess.skimming.output.protocol = "local"
    validated_config.preprocess.skimming.output.base_uri = None
    # Remove storage_options for local protocol
    if "storage_options" in validated_config.preprocess.skimming.output.to_kwargs:
        del validated_config.preprocess.skimming.output.to_kwargs["storage_options"]
    if "storage_options" in validated_config.preprocess.skimming.output.from_kwargs:
        del validated_config.preprocess.skimming.output.from_kwargs["storage_options"]
    print(f"📁 Using local filesystem protocol")

# Skim data with dask according to the workitems
datasets = process_and_load_events(
    workitems,
    validated_config,
    output_manager,
    datasets,
    metadata_generator.nanoaods_summary
)

# Display the structure of processed datasets
print(f"\n✅ Skimming complete! Processed datasets structure:")
for dataset in datasets:
    if dataset.events:
        print(f"  📁 {dataset.name} processed")
        for i, (events, metadata) in enumerate(dataset.events):
            print(f"    └── File {i+1}: {len(events)} events, {len(events.fields)} branches")

print(f"\n💾 Skimmed data saved to: {output_manager.get_skimmed_dir()}")

for dataset in datasets:
    print(dataset.events)

In [None]:
# Enable caching to show how it speeds up subsequent runs
validated_config.general.read_from_cache = True
validated_config.general.run_skimming = False
# Run the skimming again - it will look for cached files
# if they don't exist, it will fallback to skim regularly
cached_datasets = process_and_load_events(
    workitems,
    validated_config,
    output_manager,
    fileset,
    metadata_generator.nanoaods_summary
)

# Show cache directory contents
import os
cache_files = os.listdir(output_manager.get_skimmed_dir())
print(f"💾 Cached files in skimmed directory: {len(cache_files)} files")
print(f"📁 Cache location: {output_manager.get_cache_dir()}")

# 6. Analysis & Histogramming

This step encapsulates a few underlying function calls:
1. Global event selection is applied
2. If MVA training is configured and enabled, an MVA training will be triggered
3. Apply corrections from `correctionlib`
4. Ghost observables will be computed and added to event record
6. Channels will be built with their corresponding selections
7. Compute all observables (once nomina, once per systematic variation)
8. Create histograms

In [None]:
# Import the analysis class for the analysis
from analysis.nondiff import NonDiffAnalysis
from utils.output_files import save_histograms_to_root

validated_config.general.run_histogramming=True

# Initialize the analysis object
# This analysis object will handle histogram creation, observable computation, and systematic variations
nondiff_analysis = NonDiffAnalysis(validated_config, datasets, output_manager)

print(f"🔬 Analysis initialized for {len(nondiff_analysis.datasets)} datasets")

# Loop over each dataset and its associated event data
histogram_count = 0
for dataset in datasets:
    if dataset.events:
        for events, metadata in dataset.events:
            print(f"   • Processing {len(events)} events with metadata: {metadata['process']}")
            nondiff_analysis.process(events, metadata)
            histogram_count += 1

save_histograms_to_root(
    nondiff_analysis.nD_hists_per_region,
    output_file=nondiff_analysis.output_manager.get_histograms_dir() / "histograms.root",
)

print(f"📈 Generated histograms for channels: {[ch.name for ch in validated_config.channels]}")

# 7. Statistical Analysis

The statistical analysis step performs a fit from the histograms and visualize results. This uses `cabinetry` with a manually created config. It also creates and stores the corresponding `pyhf` workspace.

**Note: the following fit is performed with the full CMS open-data dataset, not the small example here** 

In [None]:
import cabinetry
validated_config.statistics.cabinetry_config = "./example-old/outputs/cabinetry/"
cabinetry_config = cabinetry.configuration.load(
    validated_config.statistics.cabinetry_config
)
data, fit_results, pre_fit_predictions, postfit_predictions = (
    nondiff_analysis.run_fit(cabinetry_config=cabinetry_config)
)
cabinetry.visualize.data_mc(
    pre_fit_predictions,
    data,
    close_figure=False,
    config=cabinetry_config,
    figure_folder=nondiff_analysis.output_manager.get_statistics_dir(),

)
cabinetry.visualize.data_mc(
    postfit_predictions,
    data,
    close_figure=False,
    config=cabinetry_config,
    figure_folder=nondiff_analysis.output_manager.get_statistics_dir(),
)
cabinetry.visualize.pulls(fit_results, close_figure=False, figure_folder=nondiff_analysis.output_manager.get_statistics_dir(),)

## 8. Full Workflow

The above workflow is broken down for demonstration, but the simplest way to use it is by putting the main pieces together in one simple steering script and running it from the command-line with CLI overrides of config options. The current implementation is steered by `analysis.py` and can be simply executed with:

```sh
python analysis.py
```

---