# CMS search for a Z' boson in the single-lepton channel with full Run 2 dataset

This notebook demonstrates the Z' → tt̄ single-lepton analysis workflow on various AFs, including skimming, analysis, histogramming, and statistics steps.

## Workflow Overview

1. Setup Python path for intccms package
2. Install dependencies and register modules for cloud pickle
3. Acquire Dask client from AF environment
4. Configure analysis parameters
5. Run metadata extraction (`coffea` preprocessing)
6. Initialize analysis processor
7. Run processor with coffea.processor.Runner
8. Save histograms and run statistical analysis (if enabled)

## AF flag
We might want to run this code on different facilities, which may each have their own limitations or require different dask client setups. To make it easy to switch between facilities, just set the `AF` variable to the one of your choice. If your `AF` does not exist yet, you can introduce it in this notebook in the relevant sections.

In [None]:
AF="coffeacasa-condor" # options currently supported: [coffeacasa-condor, coffeacasa-gateway, purdue-af]
AUTO_CLOSE_CLIENT=False # the client setup is done with a contextmanager -- this flag decides if we automatically close the client as we exit the manager. If False, you handle closing manually. 

## Imports and dependencies

### The intccms package
The CMS implementation of the integration challenge is set in a package-like structure, which means we hae to add the source code to the python path. The package is referred to as `intccms`.

In [None]:
# Setup Python path to include intccms package
import sys
from pathlib import Path

# Add src directory to Python path
repo_root = Path.cwd()
src_dir = repo_root / "src"
examples_dir = repo_root
if str(src_dir) not in sys.path:
    sys.path.insert(0, str(src_dir))
if str(examples_dir) not in sys.path:
    sys.path.insert(0, str(examples_dir))
print(f"✅ Added {src_dir} to Python path")
print(f"✅ Added {examples_dir} to Python path")

### Installing extra dependencies
The `intccms` package requires `omegaconf` and `roastcoffea`, which is not by default on an AF. `roastcoffea` is a tool developed while working on this project and it provides an API to extract metrics from coffea-processor workflows. 

In [None]:
try:
    import omegaconf
except ImportError:
    print("⚠️ omegaconf not found, installing...")
    ! pip install omegaconf;

try:
    import roastcoffea
except ImportError:
    print("⚠️ roastcoffea not found, installing...")
    ! pip install roastcoffea;

### Alternative coffea version
In some cases, we might need to install our own `coffea` version which is not on the AF. For example, when testing a new feature or using a recently realased version with a fix.

In [None]:
COFFEA_VERSION = "2025.12.0"
COFFEA_PIP = f"coffea=={COFFEA_VERSION}" if "git" not in COFFEA_VERSION else COFFEA_VERSION

! pip install $COFFEA_PIP ;

# Pip-installable dependencies to install on workers
WORKER_DEPENDENCIES = [COFFEA_PIP, "roastcoffea==0.1.2"]

### Imports from stdlib and other libraries

In this notebook we use `dask` and `coffea`. 

In [None]:
# stdlib
import cloudpickle
import copy
import os
import time

from coffea.processor import DaskExecutor, IterativeExecutor
from coffea.nanoevents import NanoAODSchema

### Imports from intccms and other integration-challenge specific tooling

In [None]:
# intccms
from intccms.schema import Config, load_config_with_restricted_cli
from intccms.utils.output import OutputDirectoryManager
from intccms.metadata_extractor import DatasetMetadataManager
from intccms.datasets import DatasetManager
from intccms.analysis import run_processor_workflow

### Registering packages with cloudpickle
The intccms cannot be installed on the workers via `pip`, and the configuration files are in python modules which also cannot be installed on the workers. So we need to register them with `cloudpickle` to allow dask to serialize them and send them out.

In [None]:
import intccms
import example_cms

# Register modules for cloud pickle
cloudpickle.register_pickle_by_value(intccms)
cloudpickle.register_pickle_by_value(example_cms)

## Dask client setup

This notebook uses the `DaskExecutor` from `coffea` to distribute the task graph on the AF. The client setup varies in different facilities, so we implement a function which returns the correct client. The function does so by providing a context manager, within which the client is alive.

In [None]:
from intccms.utils.dask_client import acquire_client, live_prints

## Configuration Setup

The CMS analysis implementation is configurable via python modules, which we have to import. For this notebook, the configuration files are found in `example_cms/configs/`. You can modify the modules in this directory manually, or you can dynamically change settings using python dictionary manipulation. Below are some settings of interest that you might want to tune when you are testing your setup.

In [None]:
# intccms configuration import
from example_cms.configs.configuration import config as original_config

# Create a deepcopy that we can manipulate
config = copy.deepcopy(original_config)

# Limit files for testing
config["datasets"]["max_files"] = 5 #None # None would run over all availale files

# Use local output directory
config["general"]["output_dir"] = "example_cms/outputs/"

# Preprocessing (coffea) can be executed once and results loaded
config["general"]["run_metadata_generation"] = False # If True, run analysis pre-processing

# Processer = Skimming (filter and save) + Analysis
config["general"]["run_processor"] = True  # If True, the coffea processor is executed
config["general"]["run_analysis"] = True # If True, the analysis part of the processor is executed
config["general"]["save_skimmed_output"] = False  # If True, skimmed events are saved to disk, otherwise filter executed on-the-fly

# Analysis = Systematics + histogramming + statsitics
config["general"]["run_histogramming"] = True
config["general"]["run_systematics"] = True
config["general"]["run_statistics"] = False

# Datasets to process, by default this is all datasets
#config["general"]["processes"] = ["data"] 

cli_args = [] # the code can be ran from CLI, but we don't care here
full_config = load_config_with_restricted_cli(config, cli_args)

# Validated config gives us a dictionary object with all settings checked to be safe with pydantic
validated_config = Config(**full_config)

## Running the Workflow

Running the CMS integration challenge workflow is split into a few steps, with a modular design that allows us flexibility. The steps are:

1. Setting up output directories
2. Building an input dataset manager
3. Running or loading the coffea preprocessing
4. Run the coffea processor

### Output manager setup

In [None]:
output_manager = OutputDirectoryManager(
    root_output_dir=validated_config.general.output_dir,
    cache_dir=validated_config.general.cache_dir,
    metadata_dir=validated_config.general.metadata_dir,
    skimmed_dir=validated_config.general.skimmed_dir
)

### Configure Data Redirector (Optional)

Override the redirector in the config for accessing dataset files. Useful for testing different storage backends. You can also change this in `example_cms/configs/skim.py`

In [None]:
# Override redirector for all datasets
# Examples:
#   "root://xcache/"                    
#   "root://cmsxrootd.fnal.gov/"
#   "root://cms-xrd-global.cern.ch/"
REDIRECTOR = "root://xcache/"  # Change this to use a different redirector

print(f"Initial redirector  {validated_config.datasets.datasets[0].name}: {validated_config.datasets.datasets[0].redirector}")

# Apply to all datasets in config
for dataset in validated_config.datasets.datasets:
 dataset.redirector = REDIRECTOR

print(f"Redirector set to: {REDIRECTOR}")

# Verify the change
print(f"New redirector:  {validated_config.datasets.datasets[0].name}: {validated_config.datasets.datasets[0].redirector}")

### Input dataset manager setup

In [None]:
dataset_manager = DatasetManager(validated_config.datasets)

### Coffea preprocessing

In [None]:
metadata_generator = DatasetMetadataManager(
  dataset_manager=dataset_manager,
  output_manager=output_manager,
  config=validated_config,
)

if metadata_generator.generate_metadata:
  with acquire_client(AF, close_after=AUTO_CLOSE_CLIENT, pip_packages=WORKER_DEPENDENCIES) as (client, cluster):
      metadata_generator.run(executor=DaskExecutor(client=client))
else:
  metadata_generator.run()  # No client needed

# Build metadata lookup and extract workitems
metadata_lookup = metadata_generator.build_metadata_lookup()
workitems = metadata_generator.workitems
;

### Run Analysis Processor

In [None]:
# Run processor workflow
with acquire_client(AF, close_after=AUTO_CLOSE_CLIENT, pip_packages=WORKER_DEPENDENCIES) as (client, cluster):
    t0 = time.perf_counter()
    stop = live_prints(client)
    output, report = run_processor_workflow(
        config=validated_config,
        output_manager=output_manager,
        metadata_lookup=metadata_lookup,
        workitems=workitems,
        executor=DaskExecutor(client=client, treereduction=8, retries=0),
        schema=NanoAODSchema,
    )
    stop.set()
    t1 = time.perf_counter()


print(f"Processor workflow complete in {t1-t0:.1f} seconds!")

# Print summary
print(f"Total events processed: {output.get('processed_events', 0):,}")
print(f"Events after skim: {output.get('skimmed_events', 0):,}")
;

In [None]:
report

## Systematic variation diagnostics

Grid of ratio-to-nominal plots for each MC process across all systematic variations. Useful for verifying that year-decorrelated systematics have the correct nominal fills.

In [None]:
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from intccms.utils.output import load_histograms_from_pickle

histograms_pkl = output_manager.histograms_dir / "processor_histograms.pkl"

if not histograms_pkl.exists():
    print(f"No saved histograms at {histograms_pkl} -- run the processor first.")
else:
    histograms = load_histograms_from_pickle(histograms_pkl)

    # Known year keys for grouping decorrelated variations
    corrections_cfg = validated_config.corrections
    years = sorted(corrections_cfg.keys(), key=len, reverse=True) if isinstance(corrections_cfg, dict) else []

    def get_base_name(source_name):
        """Strip year suffix to find the base variation name."""
        for year in years:
            if source_name.endswith(f"_{year}"):
                return source_name[: -len(f"_{year}")]
        return source_name

    for channel in validated_config.channels:
        channel_name = channel.name
        obs_name = channel.fit_observable
        if channel_name not in histograms or obs_name not in histograms[channel_name]:
            continue

        h = histograms[channel_name][obs_name]
        processes = sorted(p for p in h.axes["process"] if p != "data")
        all_variations = [v for v in h.axes["variation"] if v != "nominal"]

        if not processes or not all_variations:
            continue

        # all variations is up, down for each systematic, so len()/2 gives true number of systematics implemented
        print(f"Channel: {channel_name} | Processes: {len(processes)} | Systematic variations: {len(all_variations)/2}")

        # Group variations by base name (year-decorrelated share a row)
        groups = {}
        for var in all_variations:
            if var.endswith("_up"):
                source = var[:-3]
            elif var.endswith("_down"):
                source = var[:-5]
            else:
                source = var
            base = get_base_name(source)
            groups.setdefault(base, set()).add(var)

        group_names = sorted(groups)
        nrows = len(group_names)
        ncols = len(processes)
        fig, axes = plt.subplots(
            nrows, ncols,
            figsize=(5 * ncols, 2.5 * nrows),
            squeeze=False,
        )

        for row, base_name in enumerate(group_names):
            var_names = sorted(groups[base_name])

            # Collect all ratios across processes for this row to set shared y-limits
            row_ratios = []
            for col, proc in enumerate(processes):
                ax = axes[row][col]
                nom_vals = h[{"process": proc, "variation": "nominal"}].values(flow=False)
                bin_centers = h.axes["observable"].centers

                for var in var_names:
                    var_vals = h[{"process": proc, "variation": var}].values(flow=False)
                    with np.errstate(divide="ignore", invalid="ignore"):
                        ratio = np.where(nom_vals > 0, var_vals / nom_vals, 1.0)
                    ax.step(bin_centers, ratio, where="mid", label=var, linewidth=0.8)
                    row_ratios.append(ratio)

                ax.axhline(1.0, color="black", linestyle="--", linewidth=0.5)

                if row == 0:
                    ax.set_title(proc, fontsize=10)
                if row == nrows - 1:
                    ax.set_xlabel(h.axes["observable"].label)
                else:
                    ax.set_xticklabels([])
                if col == 0:
                    ax.set_ylabel(f"{base_name}\nvar / nom", fontsize=8)

                ax.legend(fontsize=5, ncol=2, loc="upper right")

            # Symmetric y-limits from the max deviation across the whole row
            all_ratios = np.concatenate(row_ratios)
            max_dev = max(np.nanmax(np.abs(all_ratios - 1.0)), 0.01)
            margin = max_dev * 1.2
            for col in range(ncols):
                axes[row][col].set_ylim(1.0 - margin, 1.0 + margin)

        fig.tight_layout()
        plt.show()