# Coffea Processors
Coffea relies mainly on [uproot](https://github.com/scikit-hep/uproot) to provide access to ROOT files for analysis.
As a usual analysis will involve processing tens to thousands of files, totalling gigabytes to terabytes of data, there is a certain amount of work to be done to build a parallelized framework to process the data in a reasonable amount of time. 

Since the beginning a `coffea.processor` module was provided to encapsulate the core functionality of the analysis, which could be run locally or distributed via a number of Executors. This allowed users to worry just about the actual analysis code and not about how to implement efficient parallelization, assuming that the parallization is a trivial map-reduce operation (e.g. filling histograms and adding them together). This API ceased to exist for some time but we brought it back.

In coffa 202x (CalVer), you also have the option of deeper integration with `dask` (via `dask_awkward` and `uproot.dask`), and whether an analysis is to be executed on local or distributed resources, a TaskGraph encapsulating the analysis is created in this case. We will demonstrate how to use callable code to build these TGs.

We'll always be showcasing both ways of using coffea to write and execute your analyis

Let's start by writing a simple processor class that reads some CMS open data and plots a dimuon mass spectrum.
We'll start by copying the [ProcessorABC](https://coffea-hep.readthedocs.io/en/latest/api/coffea.processor.ProcessorABC.html#coffea.processor.ProcessorABC) skeleton and filling in some details:

 * Remove `flag`, as we won't use it
 * Adding a new histogram for $m_{\mu \mu}$
 * Building a [Candidate](https://coffea-hep.readthedocs.io/en/latest/api/coffea.nanoevents.methods.candidate.PtEtaPhiMCandidate.html#coffea.nanoevents.methods.candidate.PtEtaPhiMCandidate) record for muons, since we will read it with `BaseSchema` interpretation (the files used here could be read with `NanoAODSchema` but we want to show how to build vector objects from other TTree formats) 
 * Calculating the dimuon invariant mass

In [None]:
import awkward as ak
import dask_awkward as dak
from coffea import processor
from coffea.nanoevents.methods import candidate
import hist
import hist.dask
import dask

class MyProcessor(processor.ProcessorABC):
    def __init__(self, mode="virtual"):
        assert mode in ["eager", "virtual", "dask"]
        self._mode = mode

    def process(self, events):
        dataset = events.metadata['dataset']
        muons = ak.zip(
            {
                "pt": events.Muon_pt,
                "eta": events.Muon_eta,
                "phi": events.Muon_phi,
                "mass": events.Muon_mass,
                "charge": events.Muon_charge,
            },
            with_name="PtEtaPhiMCandidate",
            behavior=candidate.behavior,
        )

        if self._mode == "dask":
            hist_class = hist.dask.Hist
        else:
            hist_class = hist.Hist
        h_mass = (
            hist_class.new
            .StrCat(["opposite", "same"], name="sign")
            .Log(1000, 0.2, 200., name="mass", label=r"$m_{\mu\mu}$ [GeV]")
            .Int64()
        )

        cut = (ak.num(muons) == 2) & (ak.sum(muons.charge, axis=1) == 0)
        # add first and second muon in every event together
        dimuon = muons[cut][:, 0] + muons[cut][:, 1]
        h_mass.fill(sign="opposite", mass=dimuon.mass)

        cut = (ak.num(muons) == 2) & (ak.sum(muons.charge, axis=1) != 0)
        dimuon = muons[cut][:, 0] + muons[cut][:, 1]
        h_mass.fill(sign="same", mass=dimuon.mass)

        if self._mode == "dask":
            return {
                    "entries": ak.num(events, axis=0),
                    "mass": h_mass,
            }
        else:    
            return {
                dataset: {
                    "entries": len(events),
                    "mass": h_mass,
                }
            }
    
    def postprocess(self, accumulator):
        pass

If we were to just use bare uproot to execute this processor, we could do that with the following example, which:

 * Opens a CMS open data file
 * Creates a NanoEvents object using `BaseSchema` (roughly equivalent to the output of reading with plain `uproot`)
 * Creates a `MyProcessor` instance
 * Runs the `process()` function, which returns our accumulators

In [None]:
from coffea.nanoevents import NanoEventsFactory, BaseSchema
import matplotlib.pyplot as plt

In [None]:
filename = "root://xcache//store/user/ncsmith/opendata_mirror/Run2012B_DoubleMuParked.root"
access_log = []
events = NanoEventsFactory.from_root(
    {filename: "Events"},
    entry_stop=500_000,
    metadata={"dataset": "DoubleMuon"},
    schemaclass=BaseSchema,
    mode="virtual",
    access_log=access_log,
).events()
p = MyProcessor("virtual")
out = p.process(events)
out, access_log

In [None]:
%%time
out = p.process(events)
out

In [None]:
access_log

In [None]:
fig, ax = plt.subplots()
out["DoubleMuon"]["mass"].plot1d(ax=ax)
ax.set_xscale("log")
ax.legend(title="Dimuon charge")
plt.show()

In [None]:
filename = "root://xcache//store/user/ncsmith/opendata_mirror/Run2012B_DoubleMuParked.root"
events = NanoEventsFactory.from_root(
    {filename: {"object_path": "Events", "steps": [[0, 500_000]]}},
    metadata={"dataset": "DoubleMuon"},
    schemaclass=BaseSchema,
    mode="dask",
).events()
p = MyProcessor("dask")
taskgraph = p.process(events)
taskgraph

In [None]:
dask.visualize(taskgraph, rankdir="LR", optimize_graph=False)

In [None]:
dask.visualize(taskgraph, rankdir="LR", optimize_graph=True)

In [None]:
%%time
(out,)= dask.compute(taskgraph)
out

In [None]:
dak.necessary_columns(taskgraph)

In [None]:
fig, ax = plt.subplots()
out["mass"].plot1d(ax=ax)
ax.set_xscale("log")
ax.legend(title="Dimuon charge")
plt.show()

# Filesets
We'll need to construct a fileset to run over

## Users without access
Uncomment the `eospublic` files in the following dictionary and comment out the `xcache` files, such that you still have one file per dataset (`DoubleMuon` and `ZZ to 4mu`), these should be reachable from anywhere

In [None]:
initial_fileset = {
    "DoubleMuon": {
        "files": {
            "root://xcache//store/user/ncsmith/opendata_mirror/Run2012B_DoubleMuParked.root": "Events",
            #"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_DoubleMuParked.root": "Events",
        },
        "metadata": {
            "is_mc": False,
        },
    },
    "ZZ to 4mu": {
        "files": {
            "root://xcache//store/user/ncsmith/opendata_mirror/ZZTo4mu.root": "Events",
            #"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/ZZTo4mu.root": "Events",
        },
        "metadata": {
            "is_mc": True,
        }
    }
}

# Processing with Virtual mode

Preprocessing is hidden inside this interface

In [None]:
%%time
iterative_run = processor.Runner(
    executor = processor.IterativeExecutor(compression=None),
    schema=BaseSchema,
    maxchunks=3,
    savemetrics=True,
)

out, metrics = iterative_run(
    initial_fileset,
    processor_instance=MyProcessor("virtual"),
)

In [None]:
out

In [None]:
fig, ax = plt.subplots()
out["DoubleMuon"]["mass"].plot1d(ax=ax)
ax.set_xscale("log")
ax.legend(title="Dimuon charge")
plt.show()

Now, if we want to use more than a single core on our machine, we simply change `IterativeExecutor` for `FuturesExecutor`, which uses the python `concurrent.futures` standard library. We can then set the most interesting argument to the `FuturesExecutor`: the number of cores to use.

In [None]:
%%time
futures_run = processor.Runner(
    executor = processor.FuturesExecutor(workers=4, compression=None),
    schema=BaseSchema,
    savemetrics=True,
)

out, metrics = futures_run(
    initial_fileset,
    processor_instance=MyProcessor("virtual")
)

In [None]:
out

In [None]:
fig, ax = plt.subplots()
out["DoubleMuon"]["mass"].plot1d(ax=ax)
ax.set_xscale("log")
ax.legend(title="Dimuon charge")
plt.show()

# Processing with Dask mode

# Preprocessing
There are dataset discovery tools inside of coffea to help construct such datasets. Those will not be demonstrated here. For now, we'll take the above `initial_fileset` and preprocess it.

In [None]:
from coffea.dataset_tools import apply_to_fileset, max_chunks, max_files, preprocess

In [None]:
preprocessed_available, preprocessed_total = preprocess(
        initial_fileset,
        step_size=100_000,
        align_clusters=False,
        skip_bad_files=True,
        recalculate_steps=False,
        files_per_batch=1,
        file_exceptions=(OSError,),
        save_form=False,
        uproot_options={},
        step_size_safety_factor=0.5,
    )

# Preprocessed fileset
Lets have a look at the contents of the preprocessed_available part of the fileset

In [None]:
preprocessed_available

# Saving a preprocessed fileset
We can use the gzip, pickle, and json modules/libraries to both save and reload datasets directly. We'll do this short example below

In [None]:
import gzip, pickle, json
output_file = "example_fileset"
with gzip.open(f"{output_file}_available.json.gz", "wt") as file:
    json.dump(preprocessed_available, file, indent=2)
    print(f"Saved available fileset chunks to {output_file}_available.json.gz")
with gzip.open(f"{output_file}_all.json.gz", "wt") as file:
    json.dump(preprocessed_total, file, indent=2)
    print(f"Saved complete fileset chunks to {output_file}_all.json.gz")

We could then reload these filesets and quickly pick up where we left off. Often we'll want to preprocess again "soon" before analyzing data because this will let us catch which files are accessible now and which are not. The saved filesets may be useful for tracking, and we may have enough stability to reuse it for some period of time.

In [None]:
with gzip.open(f"{output_file}_available.json.gz", "rt") as file:
    reloaded_available = json.load(file)
with gzip.open(f"{output_file}_all.json.gz", "rt") as file:
    reloaded_all = json.load(file)

# Slicing chunks and files
Given this preprocessed fileset, we can test our processor on just a few chunks of a handful of files. To do this, we use the max_files and max_chunks functions from the dataset tools

In [None]:
test_preprocessed_files = max_files(preprocessed_available, 1)
test_preprocessed = max_chunks(test_preprocessed_files, 3)

In [None]:
test_preprocessed

In [None]:
small_tg, small_rep = apply_to_fileset(data_manipulation=MyProcessor("dask"),
                            fileset=test_preprocessed,
                            schemaclass=BaseSchema,
                            uproot_options={"allow_read_errors_with_report": (OSError, ValueError)},
                           )

In [None]:
dask.visualize(small_tg, optimize_graph=True)

In [None]:
%%time
small_computed, small_rep_computed = dask.compute(small_tg, small_rep)

In [None]:
small_rep_computed['DoubleMuon']

In [None]:
small_computed

In [None]:
fig, ax = plt.subplots()
small_computed["DoubleMuon"]["mass"].plot1d(ax=ax)
ax.set_xscale("log")
ax.legend(title="Dimuon charge")
plt.show()

In [None]:
full_tg, rep = apply_to_fileset(data_manipulation=MyProcessor("dask"),
                            fileset=preprocessed_available,
                            schemaclass=BaseSchema,
                            uproot_options={"allow_read_errors_with_report": (OSError, ValueError)},
                           )

In [None]:
%%time
out, rep = dask.compute(full_tg, rep)

In [None]:
out

In [None]:
fig, ax = plt.subplots()
out["DoubleMuon"]["mass"].plot1d(ax=ax)
ax.set_xscale("log")
ax.legend(title="Dimuon charge")
plt.show()