# Title

## What is Coffea?

Coffea stands for *Columnar Object Framework For Effective Analysis.* It contains a variety of tools which help physicists perform their analyses in a columnar fashion. By "a columnar fashion," we mean that data is contained in numpy-like arrays upon which we can perform operations without calling an explicit event loop. Coffea's arrays are built on Awkward, and any Awkward operation will work on a Coffea data array. But, of course, Coffea isn't just Awkward! Its tools include:

* **NanoEvents** wraps data into an awkward array with physics object methods (such as LorentzVector methods). Depending on the schema used, additional arrays and cross-references may be generated. For NanoAODs, the default schema is [NanoAODSchema](https://coffeateam.github.io/coffea/api/coffea.nanoevents.NanoAODSchema.html#), but the exact details aren't important for the moment. Importantly, NanoEvents accesses data lazily with the help of uproot - data is not instantiated until it is needed!

* **Hists** are as they sound! Once you've accessed and manipulated your data with NanoEvents and awkward, you're going to want to make some plots. The hist object has the functionality you'll need to make ROOT-like histograms.

* **Processors** are Coffea's way of wrapping up an analysis in a way that is deployment-neutral. Once you've turned your analysis into a processor, you can run it on a variety of executors (e.g. Dask, Parsl, Spark) without making any dramatic changes to the analysis itself. This makes scale-out simple and dynamic.

* **Lookup tools** are available in Coffea for any corrections that need to be made to physics data. These tools read a variety of correction file formats and turn them into lookup tables. As corrections are experiment-specific, we won't be covering this aspect in this tutorial.

Conveniently, these tools build on each other. You can't use a hist without data, and you can't use a processor without an analysis (which is, loosely, data + hists). Thus, the above serves as the agenda for our tutorial today!

## A Motivating Comparison

Placeholder. Showcase coffea vs. event loop to motivate why we'd want to use it.
* Find a suitable analysis for this.
* Maybe show the analysis we'll build through the tutorial, just to also show "where we'll end up?"
* If we do make a comparison here, then imports/NanoAODSchema warning messages will be here, so they remain here for now.

In [2]:
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema
from coffea import hist
import awkward as ak

# NanoEvents will try to build crossrefs that aren't in our file! Silence this as it's irrelevant for our purposes.
NanoAODSchema.warn_missing_crossrefs = False

## **NanoEvents**: Data, Awkwardly

As mentioned in the overview, NanoEvents bundles data with physical meaning. We input some nTuple file and we get an awkward array with the desirable physics methods.

To use NanoEvents, we of course need data. (Insert information about the file we'll be using. Currently, we're using a placeholder.)

The simplest way to access our data is to use NanoEventsFactory. Note that I will only take the first 100000 events in our file - this ensures we don't end up waiting for blocks to process. You can remove this constraint if you desire.

In [4]:
events = NanoEventsFactory.from_root("root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root", entry_stop=100000).events()

What does our data look like? Well, it's an awkward array! If you missed the talk about awkward, you can imagine an awkward array as a numpy array that can handle jagged (non-rectangular) data. We won't dive too deep into its details - but we will at least explore the structure of our NanoEvents array. Let's look at <code>events</code>:

In [6]:
events

<NanoEventsArray [<event 194711:299:263142897>, ... ] type='100000 * event'>

What does this mean? Well, we have an array of size 100000 which is populated with <code>event</code> entries. Each <code>event</code> contains a variable amount of the data we are interested in. We can see the data that is contains by exploring its <code>fields</code>.

In [9]:
events.fields

['PV',
 'MET',
 'Muon',
 'luminosityBlock',
 'run',
 'Electron',
 'Photon',
 'Jet',
 'event',
 'HLT',
 'Tau']

As this is a muon dataset, let's look at <code>Muon</code>.

In [12]:
events.Muon

<MuonArray [[Muon], [Muon], ... [Muon, Muon]] type='100000 * var * muon'>

Looking at <code>events.Muon</code> we begin to see the jagged structure of our events array. Each subarray in <code>MuonArray</code> represents one event (note that the size of <code>MuonArray</code> is still 100000). Then, each event can have one Muon, n Muons, or none in its subarray. Of course, this is what we expect to be the case physically, but we now have a data structure that can represent it. That's the beauty of awkward!

To drive the point home, let's count up the number of muons across all of our events.

In [30]:
import awkward as ak

ak.sum(ak.num(events.Muon, axis=-1))

136481

A quick note about axes in Awkward: 0 is always the shallowest, while -1 is the deepest. In other words, <code>axis=0</code> would tell us the number of subarrays, while <code>axis=-1</code> sums up the number of muons in each subarray. In this case, <code>axis=1</code> works just as well; <code>axis=2</code> is meaningless since our subarrays have no further subarrays within them.

Now we have some idea about the structure of our data. How do we actually get information about a <code>Muon</code>, though? Well, a <code>Muon</code> has its own fields:

In [15]:
events.Muon.fields

['charge',
 'dxy',
 'dxyErr',
 'dz',
 'dzErr',
 'eta',
 'genPartIdx',
 'jetIdx',
 'jetIdxG',
 'mass',
 'pfRelIso03_all',
 'pfRelIso04_all',
 'phi',
 'pt',
 'softId',
 'tightId']

And these fields can be accessed in the same way that we accessed Muons! The p<sub>T</sub>, for example:

In [16]:
events.Muon.pt

<Array [[12.8], [24.8], ... [14.8, 4.89]] type='100000 * var * float32[parameter...'>

Note, as a sanity check, that the shape of this array is the same as that of the Muon array above. Indeed:

In [17]:
ak.sum(ak.num(events.Muon.pt))

136481

Now that we know how to access data, we can manipulate it as we desire in the standard awkward way. Most cuts in columnar analysis are achieved through masking. Shortly, a mask is a Boolean array which is generated by performing a conditional on a data array. For example, if we want only muons with a p<sub>T</sub> > 10, our mask would be:

In [20]:
events.Muon.pt > 10

<Array [[True], [True], ... [True, False]] type='100000 * var * bool'>

Then, we can apply the mask to our data. This will pick out only the elements of our data which correspond to a <code>True</code>. The data and the mask thus must have the same shape up to the depth of the selection. Since we're making a selection on muons, the mask must have the exact same shape as the data. If we made a selection on events, the mask should be flat with size 100000. The shape of the output array will differ from the data and mask arrays since we are downselecting data.

In [22]:
events.Muon.pt[events.Muon.pt > 10]

<Array [[12.8], [24.8], ... [38.7], [14.8]] type='100000 * var * float32[paramet...'>

Note that we still have 100000 subarrays, as we still have 100000 events, as we only did a selection on muons. If an event had a muon which didn't meet the cut, then that event just has an empty subarray now.

Compare the output array to our original data array. The last Muon, with a pT of 4.89, is no longer present in the array. We'd expect to have fewer muons overall. Let's take another count!

In [23]:
ak.sum(ak.num(events.Muon.pt[events.Muon.pt > 10]))

109624

Conversely, the set of muons whose pT is less than 10 can also be examined.

In [24]:
events.Muon.pt[events.Muon.pt < 10]

<Array [[], [], [6.99, 3.35, ... [], [4.89]] type='100000 * var * float32[parame...'>

Here, we see the last Muon *is* present, as we'd expect. Doing some rough math, we'd expect this array to be 136481 - 109624 = 26857 elements in length.

In [25]:
ak.sum(ak.num(events.Muon.pt[events.Muon.pt < 10]))

26857

## **Hists**: AKA, Yet Another ZPeak

We have data, we can work with our data, and we undoubtedly want to move on from looking at code to looking at pretty colors.

Let's do a simple example and plot the dimuon mass of opposite-charge muon pairs. This should give us a peak at ~91.12, the mass of the Z boson, as many such dimuon pairs result from Z decays.

In [29]:
# Only want events with 2 muons that are charge-neutral.
dimuons = events.Muon[(ak.num(events.Muon, axis=1) == 2) & (ak.sum(events.Muon.charge, axis=1) == 0)]

<MuonArray [[Muon, Muon], ... [Muon, Muon]] type='11827 * var * muon'>

Note that we can string together conditionals. We could have just as easily done two separate masks. 

Our <code>dimuons</code> array should now contain only opposite-charge muon pairs. Let's check!

In [31]:
dimuons.charge

<Array [[-1, 1], [-1, 1], ... [1, -1], [1, -1]] type='11827 * var * int32[parame...'>

Note that this time, the mask performed a cut at the event level rather than the muon level. We have fewer events, but the same amount of muons in each event (in the events that we kept). We now only have 11827 events.

All we need now is the dimuon mass. Awkward arrays can be indexed in a similar way as numpy array, so <code>dimuons[:, 0]</code> will select the first muon in every <code>dimuon</code> event. If we do any mathematical operations on the <code>MuonArray</code> level, then NanoEvents ensures each <code>Muon</code> will be treated as a LorentzVector object. That makes our life easy:

In [32]:
mumu_mass = (dimuons[:, 0] + dimuons[:, 1]).mass


mumu_mass

<Array [79.9, 2.78, 49.3, ... 24.1, 13, 44.5] type='11827 * float32'>

We've effectively collapsed our subarrays by finding the mass of the pairs, so now we have a flat array. It is of the same size as our dimuons array above. On to plotting!

## Broader Deployment: Processors

Go through a processor step by step
