# CMS Z' single-lepton: Skimming

This notebook extends the full_run workflow to focus on skimming: filtering events and saving to disk in a configurable output format. Supports Parquet on S3, ROOT TTree on XRootD, and ROOT RNTuple on XRootD.

A verification cell at the end reads back all skimmed files and checks event counts.

## Workflow Overview

1. Setup Python path for intccms package
2. Install dependencies and register modules for cloud pickle
3. Acquire Dask client from AF environment
4. Configure analysis parameters (skimming mode)
5. Configure skimming output (format, destination)
6. Run metadata extraction (`coffea` preprocessing)
7. Run processor with coffea.processor.Runner
8. Verify skimmed output by reading back all files

## AF flag
We might want to run this code on different facilities, which may each have their own limitations or require different dask client setups. To make it easy to switch between facilities, just set the `AF` variable to the one of your choice. If your `AF` does not exist yet, you can introduce it in this notebook in the relevant sections.

In [None]:
AF="coffeacasa-condor" # options currently supported: [coffeacasa-condor, coffeacasa-gateway, purdue-af]
AUTO_CLOSE_CLIENT=False # the client setup is done with a contextmanager -- this flag decides if we automatically close the client as we exit the manager. If False, you handle closing manually. 

## Imports and dependencies

### The intccms package
The CMS implementation of the integration challenge is set in a package-like structure, which means we hae to add the source code to the python path. The package is referred to as `intccms`.

In [None]:
# Setup Python path to include intccms package
import sys
from pathlib import Path

# Add src directory to Python path
repo_root = Path.cwd()
src_dir = repo_root / "src"
examples_dir = repo_root
if str(src_dir) not in sys.path:
    sys.path.insert(0, str(src_dir))
if str(examples_dir) not in sys.path:
    sys.path.insert(0, str(examples_dir))
print(f"\u2705 Added {src_dir} to Python path")
print(f"\u2705 Added {examples_dir} to Python path")

### Installig extra dependencies
The `intccms` package requires `omegaconf` and `roastcoffea`, which is not by default on an AF. `roastcoffea` is a tool developed while working on this project and it provides an API to extract metrics from coffea-processor workflows. 

In [None]:
try:
    import omegaconf
except ImportError:
    print("\u26a0\ufe0f omegaconf not found, installing...")
    ! pip install omegaconf;

try:
    import roastcoffea
except ImportError:
    print("\u26a0\ufe0f roastcoffea not found, installing...")
    ! pip install roastcoffea;

### Alternative coffea version
In some cases, we might need to install our own `coffea` version which is not on the AF. For example, when testing a new feature or using a recently realased version with a fix.

In [None]:
COFFEA_VERSION = "2025.12.0"
COFFEA_PIP = f"coffea=={COFFEA_VERSION}" if "git" not in COFFEA_VERSION else COFFEA_VERSION

! pip install $COFFEA_PIP ;

# Pip-installable dependencies to install on workers
WORKER_DEPENDENCIES = [COFFEA_PIP, "roastcoffea==0.1.2"]

### Imports from stdlib and other libraries

In this notebook we use `dask` and `coffea`. 

In [None]:
# stdlib
import cloudpickle
import copy
import os
import time

from coffea.processor import DaskExecutor, IterativeExecutor
from coffea.nanoevents import NanoAODSchema

### Imports from intccms and other integration-challenge specific tooling

In [None]:
# intccms
from intccms.schema import Config, load_config_with_restricted_cli
from intccms.skimming.io import get_reader
from intccms.utils.output import OutputDirectoryManager
from intccms.utils.tools import load_dotenv
from intccms.metadata_extractor import DatasetMetadataManager
from intccms.datasets import DatasetManager
from intccms.analysis import run_processor_workflow

### Registering packages with cloudpickle
The intccms cannot be installed on the workers via `pip`, and the configuration files are in python modules which also cannot be installed on the workers. So we need to register them with `cloudpickle` to allow dask to serialize them and send them out.

In [None]:
import intccms
import example_cms

# Register modules for cloud pickle
cloudpickle.register_pickle_by_value(intccms)
cloudpickle.register_pickle_by_value(example_cms)

## Dask client setup

This notebook uses the `DaskExecutor` from `coffea` to distribute the task graph on the AF. The client setup varies in different facilities, so we implement a function which returns the correct client. The function does so by providing a context manager, within which the client is alive.

In [None]:
from intccms.utils.dask_client import acquire_client

## Configuration Setup

The CMS analysis implementation is configurable via python modules, which we have to import. For this notebook, the configuration files are found in `example_cms/configs/`. You can modify the modules in this directory manually, or you can dynamically change settings using python dictionary manipulation. Below are some settings of interest that you might want to tune when you are testing your setup.

In [None]:
# intccms configuration import
from example_cms.configs.configuration import config as original_config

# Create a deepcopy that we can manipulate
config = copy.deepcopy(original_config)

# Limit files for testing
config["datasets"]["max_files"] = None # None would run over all availale files

# Use local output directory
config["general"]["output_dir"] = "example_cms/outputs/"

# Preprocessing (coffea) can be executed once and results loaded
config["general"]["run_metadata_generation"] = False # If True, run analysis pre-processing

# Processer = Skimming (filter and save) + Analysis
config["general"]["run_processor"] = True  # If True, the coffea processor is executed
config["general"]["run_analysis"] = False # Skimming only, no analysis
config["general"]["save_skimmed_output"] = True  # Save skimmed events to disk
config["general"]["use_skimmed_input"] = False  # Use skimmed events

# Analysis = Systematics + histogramming + statsitics
config["general"]["run_histogramming"] = False
config["general"]["run_systematics"] = False
config["general"]["run_statistics"] = False

# Datasets to process, by default this is all datasets
#config["general"]["processes"] = ["data"] 

cli_args = [] # the code can be ran from CLI, but we don't care here
full_config = load_config_with_restricted_cli(config, cli_args)

# Validated config gives us a dictionary object with all settings checked to be safe with pydantic
validated_config = Config(**full_config)

## Skimming Output Configuration

Override the skimming output settings to test different backends.

| Backend | `OUTPUT_FORMAT` | `OUTPUT_DIR` example |
|---------|----------------|---------------------|
| Parquet on S3 | `"parquet"` | `"s3:///skim_out"` |
| ROOT TTree on XRootD | `"ttree"` | `"root://xrootd-local.unl.edu:1094//store/user/maly/skim_out/"` |
| ROOT RNTuple on XRootD | `"rntuple"` | `"root://xrootd-local.unl.edu:1094//store/user/maly/skim_out/"` |

In [None]:
# options: "parquet", "ttree", "rntuple"
OUTPUT_FORMAT = "rntuple" #"ttree" #"parquet"
OUTPUT_DIR = "root://xrootd-local.unl.edu:1094//store/user/maly/TEST_180226_SKIM_RNTUP/" # "s3:///TEST_180226_SKIM"

# S3 settings (only used when OUTPUT_DIR starts with "s3://")
S3_ENDPOINT = "https://red-s3.unl.edu/cmsaf-test-oshadura"

# --- Derived configuration (no need to edit below) ---

to_kwargs = {}
from_kwargs = {}
PROPAGATE_AWS = False

if OUTPUT_DIR.startswith("s3://"):
    load_dotenv()
    storage_options = {
        "key": os.environ["AWS_ACCESS_KEY_ID"],
        "secret": os.environ["AWS_SECRET_ACCESS_KEY"],
        "client_kwargs": {"endpoint_url": S3_ENDPOINT},
    }
    to_kwargs["storage_options"] = storage_options
    to_kwargs["compression"] = "zstd"
    from_kwargs["storage_options"] = storage_options
    PROPAGATE_AWS = True

validated_config.preprocess.skimming.output.format = OUTPUT_FORMAT
validated_config.preprocess.skimming.output.output_dir = OUTPUT_DIR
validated_config.preprocess.skimming.output.to_kwargs = to_kwargs
validated_config.preprocess.skimming.output.from_kwargs = from_kwargs

print(f"Output format: {OUTPUT_FORMAT}")
print(f"Output dir:    {OUTPUT_DIR}")
if to_kwargs:
    print(f"Writer kwargs: {to_kwargs}")
if PROPAGATE_AWS:
    print("AWS credentials will be propagated to workers")

## Running the Workflow

Running the CMS integration challenge workflow is split into a few steps, with a modular design that allows us flexibility. The steps are:

1. Setting up output directories
2. Building an input dataset manager
3. Running or loading the coffea preprocessing
4. Run the coffea processor

### Output manager setup

In [None]:
output_manager = OutputDirectoryManager(
    root_output_dir=validated_config.general.output_dir,
    cache_dir=validated_config.general.cache_dir,
    metadata_dir=validated_config.general.metadata_dir,
    skimmed_dir=validated_config.general.skimmed_dir
)

## Configure Data Redirector (Optional)

Override the redirector in the config for accessing dataset files. Useful for testing different storage backends. You can also change this in `example_cms/configs/skim.py`

In [None]:
# Override redirector for all datasets
# Examples:
#   "root://xcache/"                    
#   "root://cmsxrootd.fnal.gov/"
#   "root://cms-xrd-global.cern.ch/"
REDIRECTOR = "root://xcache/"  # Change this to use a different redirector

print(f"Initial redirector  {validated_config.datasets.datasets[0].name}: {validated_config.datasets.datasets[0].redirector}")

# Apply to all datasets in config
for dataset in validated_config.datasets.datasets:
 dataset.redirector = REDIRECTOR

print(f"Redirector set to: {REDIRECTOR}")

# Verify the change
print(f"New redirector:  {validated_config.datasets.datasets[0].name}: {validated_config.datasets.datasets[0].redirector}")

### Input dataset manager setup

In [None]:
dataset_manager = DatasetManager(validated_config.datasets)

### Coffea preprocessing

In [None]:
metadata_generator = DatasetMetadataManager(
  dataset_manager=dataset_manager,
  output_manager=output_manager,
  config=validated_config,
)

if metadata_generator.generate_metadata:
  with acquire_client(AF, close_after=AUTO_CLOSE_CLIENT, pip_packages=WORKER_DEPENDENCIES) as (client, cluster):
      metadata_generator.run(executor=DaskExecutor(client=client))
else:
  metadata_generator.run()  # No client needed

# Build metadata lookup and extract workitems
metadata_lookup = metadata_generator.build_metadata_lookup()
workitems = metadata_generator.workitems
;

### Run processor

In [None]:
# Run processor workflow
from intccms.analysis.processor import UnifiedProcessor

with acquire_client(AF, close_after=AUTO_CLOSE_CLIENT, pip_packages=WORKER_DEPENDENCIES, propagate_aws_env=PROPAGATE_AWS) as (client, cluster):
    # Create processor instance for MetricsCollector
    unified_processor = UnifiedProcessor(
        config=validated_config,
        output_manager=output_manager,
        metadata_lookup=metadata_lookup,
    )

    t0 = time.perf_counter()
    output, report = run_processor_workflow(
        config=validated_config,
        output_manager=output_manager,
        metadata_lookup=metadata_lookup,
        workitems=workitems,
        executor=DaskExecutor(client=client, treereduction=8, retries=0),
        schema=NanoAODSchema,
    )
    t1 = time.perf_counter()


print(f"Processor workflow complete in {t1-t0:.1f} seconds!")

# Print summary
print(f"Total events processed: {output.get('processed_events', 0):,}")
print(f"Events after skim: {output.get('skimmed_events', 0):,}")
;

## Verification

Read back all skimmed files and verify event counts match.

In [None]:
# from intccms.skimming.fileset_manager import FilesetManager

# skim_output = validated_config.preprocess.skimming.output
# fileset_manager = FilesetManager(
#     skimmed_dir=output_manager.skimmed_dir,
#     format=skim_output.format,
# )

# datasets = list(set(md["dataset"] for md in metadata_lookup.values()))
# fileset = fileset_manager.build_fileset(datasets)

# print(fileset)

# reader = get_reader(skim_output.format)
# reader_kwargs = dict(skim_output.from_kwargs) if skim_output.from_kwargs else {}
# tree_name = validated_config.preprocess.skimming.tree_name 

# total_readback = 0
# for dataset_name, entry in fileset.items():
#     for path in entry["files"]:
#         events = reader.read(path, tree_name=tree_name, **reader_kwargs)
#         n = len(events)
#         total_readback += n
#         print(f"  [{dataset_name}] {path.split('/')[-1]}: {n:,} events")

# print(f"\nTotal read back:  {total_readback:,}")
# print(f"Total from skim:  {output.get('skimmed_events', 0):,}")
# assert total_readback == output.get("skimmed_events", 0), \
#     f"Event count mismatch! readback={total_readback} vs skim={output.get('skimmed_events', 0)}"
# print("Verification passed!")

## Second pass: Analysis on skimmed files

Re-run the processor using the skimmed fileset (`use_skimmed_input=True`). This skips re-reading original NanoAOD and loads the smaller skimmed files instead.

In [None]:
# Reconfigure for analysis on skimmed input
validated_config.general.use_skimmed_input = True
validated_config.general.save_skimmed_output = False
validated_config.general.run_analysis = True
validated_config.general.run_histogramming = True
validated_config.general.run_systematics = True

with acquire_client(AF, close_after=AUTO_CLOSE_CLIENT, pip_packages=WORKER_DEPENDENCIES, propagate_aws_env=PROPAGATE_AWS) as (client, cluster):
    t0 = time.perf_counter()
    analysis_output, analysis_report = run_processor_workflow(
        config=validated_config,
        output_manager=output_manager,
        metadata_lookup=metadata_lookup,
        workitems=None,  # Not needed â€” fileset built from manifests
        executor=DaskExecutor(client=client, treereduction=8, retries=0),
        schema=NanoAODSchema,
    )
    t1 = time.perf_counter()

print(f"Analysis on skimmed files complete in {t1-t0:.1f} seconds!")
print(f"Total events processed: {analysis_output.get('processed_events', 0):,}")

In [None]:
analysis_report

In [None]:
if AF != "iterative" and not AUTO_CLOSE_CLIENT:
    client.close()