# CoDaS-HEP Columnar Data Analysis, part 2

This is the second of three notebooks on [columnar data analysis](https://indico.cern.ch/event/1151367/timetable/#41-columnar-data-analysis), presented at CoDaS-HEP at 12:30pm on August 3, 2022 by Jim Pivarski and Ioana Ifrim.

See the [GitHub repo](https://github.com/jpivarski-talks/2022-08-03-codas-hep-columnar-tutorial#readme) for instructions on how to run it.

<br><br><br><br><br>

## From ROOT files into arrays

Physics data are in ROOT files. For columnar analysis, we'll need to get the data into arrays.

In [None]:
import ROOT

In [None]:
rdf = ROOT.RDataFrame("Events", "data/SMHiggsToZZTo4L.root")

In [None]:
rdf.AsNumpy(["MET_pt", "MET_phi"])

<br><br><br><br><br>

But for variable-length data, such as particle quantities in events with arbitrarily many particles, the NumPy array has `dtype=object`.

In [None]:
muon_quantities = rdf.AsNumpy(["Muon_pt", "Muon_phi"])
muon_quantities

<br><br><br><br><br>

NumPy `dtype=object` arrays are essentially lists: every element is a Python object and NumPy does not know the internal structure.

On the one hand, this limits speed of calculation (notice the units):

In [None]:
import numpy as np

numeric_array = np.arange(500000, dtype=np.int64)
python_objects = np.array(range(500000), dtype=object)

In [None]:
%%timeit

numeric_array**2

In [None]:
%%timeit

python_objects**2

And on the other hand, it limits expressiveness:

In [None]:
numeric_array = np.empty((2, 2, 3), dtype=np.int64)
python_objects = np.empty(2, dtype=object)

numeric_array[:] = [[[1, 2, 3], [4, 5, 6]], [[10, 20, 30], [40, 50, 60]]]
python_objects[:] = [[[1, 2, 3], [4, 5, 6]], [[10, 20, 30], [40, 50, 60]]]

In [None]:
numeric_array[:, :, 1:]    # drop first element from innermost dimension

In [None]:
python_objects[:, :, 1:]   # can't manipulate anything past the first dimension

<br><br><br><br><br>

Thus, to compute $p_x$ and $p_y$ from muon $p_T$ and $\phi$ in Python, we'd have to drop down into imperative or list comprehensions (functional):

In [None]:
all_pt, all_phi = muon_quantities["Muon_pt"], muon_quantities["Muon_phi"]

all_px = np.array([[pt * np.cos(phi) for pt, phi in zip(event_pt, event_phi)] for event_pt, event_phi in zip(all_pt, all_phi)], dtype=object)
all_py = np.array([[pt * np.sin(phi) for pt, phi in zip(event_pt, event_phi)] for event_pt, event_phi in zip(all_pt, all_phi)], dtype=object)

all_px, all_py

<br><br><br><br><br>

However, if we instead read the file with Uproot (to be described later),

In [None]:
import uproot

In [None]:
with uproot.open("data/SMHiggsToZZTo4L.root:Events") as events:
    muon_quantities2 = events.arrays(["Muon_pt", "Muon_phi"])

muon_quantities2.show()

Operations on these Awkward Arrays (to be described later) can be expressed in an array-oriented way:

In [None]:
all_pt, all_phi = muon_quantities2["Muon_pt"], muon_quantities2["Muon_phi"]

all_px = all_pt * np.cos(all_phi)
all_py = all_pt * np.sin(all_phi)

all_px, all_py

<br><br><br><br><br>

To be fair, that's not how RDataFrame is _supposed_ to be used: it's a functional programming framework that takes functions as strings of C++ code (which it compiles).

The `ROOT::VecOps` library presents an array-oriented style _per event_.

In [None]:
(
    rdf.Define("px", "Muon_pt * ROOT::VecOps::cos(Muon_phi)")
       .Define("py", "Muon_pt * ROOT::VecOps::sin(Muon_phi)")
       .AsNumpy(["px", "py"])
)

<br><br><br><br><br>

## Tools from Scikit-HEP

<img src="img/scikit-hep-logo.svg" width="300">

Scikit-HEP is an umbrella organization for particle physics software in Python.

See [scikit-hep.org](https://scikit-hep.org/) for more information.

<br><br><br><br><br>

### Uproot

<img src="img/uproot-logo.svg" width="300">

Uproot is a reimplementation of ROOT file I/O in Python.

See [uproot.readthedocs.io](https://uproot.readthedocs.io/) for tutorials and reference documentation.

<img src="img/abstraction-layers.svg" width="800">

<br><br><br><br><br>

ROOT files can contain standalone objects, such as histograms, and tables of data ("TTrees") whose columns are arrays ("TBranches").

As a low-level detail, ranges of entries in the arrays can only be read in granular units ("TBaskets").

<img src="img/terminology.svg" width="700">

Uproot reads standalone objects, including TTree metadata, in slow, imperative Python.

It reads, decompresses, and interprets TBranch arrays in fast, array-oriented NumPy.

<br><br><br><br><br>

Here's an example of how you would interact with Uproot to get some of the arrays.

Note that this is using a pre-release of Uproot 5, which will be [formally released in December 2022](https://github.com/scikit-hep/awkward/wiki#grand-view-and-history).

<br><br>

"Open a file."

In [None]:
file = uproot.open("data/SMHiggsToZZTo4L.root")
file

"What's in the file?"

In [None]:
file.keys()

In [None]:
file.classnames()

"Read the TTree metadata. (Not the arrays!)"

In [None]:
tree = file["Events"]
tree

"What TBranch types are in the TTree?"

In [None]:
tree.show()

"Can I get that information programmatically?"

(Yes.)

In [None]:
{key: branch.typename for key, branch in tree.items()}

"Read the muon $pT$, $\eta$, $\phi$, and mass, and no other arrays."

In [None]:
muon_kinematics = tree.arrays(["Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"])
muon_kinematics

"Show me that (already read) array in more detail, including data types."

In [None]:
muon_kinematics.show(type=True)

"Which TBranches have anything to do with muons or electrons?"

In [None]:
tree.keys(filter_name=["Muon_*", "Electron_*"])

"Read all the TBranches that have anything to do with muons or electrons (_re-reading_ the muon kinematics!)."

In [None]:
muons_and_electrons = tree.arrays(filter_name=["Muon_*", "Electron_*"])
muons_and_electrons

"More detail on that (already read) array, please."

In [None]:
muons_and_electrons.show(type=True)

<br><br><br><br><br>

We can pull individual arrays out of this using syntax like

In [None]:
muons_and_electrons["Muon_pt"]

but please be aware of the distinction between accessing data that have already been read (above)...

...and reading or re-reading new data from disk (below).

In [None]:
tree["Muon_pt"].array()

Uproot and Awkward Array are "eager": they do what you tell them to, when you tell them to.

<br><br><br><br><br>

Unless you're using Dask* (brand new; highly experimental).

(\* Thanks to Kush Kothari!)

In [None]:
delayed_read = uproot.dask("data/SMHiggsToZZTo4L.root", library="np")

delayed_px = delayed_read["MET_pt"] * np.cos(delayed_read["MET_phi"])

delayed_px

In [None]:
delayed_px.visualize()

In [None]:
delayed_px.compute()

<br><br><br><br><br>

### Awkward Array

<img src="img/awkward-logo.svg" width="300">

Awkward Array is a library for manipluating arrays of arbitrary data types as though they were NumPy arrays.

See [awkward-array.org](https://awkward-array.org/) for tutorials and reference documentation.

Note that this is using a pre-release of Awkward Array 2, which will be [formally released in December 2022](https://github.com/scikit-hep/awkward/wiki#grand-view-and-history).

In [None]:
import awkward._v2 as ak

<br><br><br><br><br>

As an example with some generality, consider arrays of variable-length lists of records with fields "x" and "y"; the "x" values are either missing (`None`) or floating point values; the "y" are lists of integers.

Like this:

In [None]:
array = ak.Array([
    [{"x": 1.1, "y": [1]}, {"x": None, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": None, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
] * 10000)

The following NumPy-like expression

   * accesses field "y"
   * drops the first element of each list (`1:`) from the innermost dimension (`...`)
   * squares each value with `np.square`, a NumPy function
   * returns a structure that is unmodified from the original, except where dictated.

In [None]:
output = np.square(array["y", ..., 1:])
output

Looking at that in more detail:

In [None]:
output.show()

<br><br><br><br><br>

To do the equivalent in Python, we'd have to write the following:

In [None]:
array_as_lists = array.tolist()

In [None]:
%%timeit

output = []
for sublist in array_as_lists:
    tmp1 = []
    for record in sublist:
        tmp2 = []
        for number in record["y"][1:]:
            tmp2.append(np.square(number))
        tmp1.append(tmp2)
    output.append(tmp1)

The array-oriented expression is faster, too.

In [None]:
%%timeit

output = np.square(array["y", ..., 1:])

<br><br><br><br><br>

While we're showing off brand-new, highly experimental features, how about converting an Awkward Array to RDataFrame and back*?

(\* Thanks to Yana Osborne!)

In [None]:
rdf2 = ak.to_rdataframe({"array": array})

In [None]:
rdf3 = rdf2.Define("output", """
std::vector<std::vector<int64_t>> tmp1;

for (auto record : array) {
    std::vector<int64_t> tmp2;
    for (auto number : record.y()) {
        tmp2.push_back(number * number);
    }
    tmp1.push_back(tmp2);
}
return tmp1;
""")

In [None]:
output = ak.from_rdataframe(rdf3, "output")
output

In [None]:
output["output"].show()

<br><br><br><br><br>

### Vector

<br>

<img src="img/vector-logo.svg" width="300">

<br>

Vector is a library for manipluating arrays of Lorentz vectors (and 2D, 3D Euclidean vectors).

See [vector.readthedocs.io](https://vector.readthedocs.io/) for tutorials and reference documentation.

We'll use this to add vectors and compute masses without having to write the formulae by hand.

But because Vector currently requires Awkward 1, we'll reload the data for the project in Awkward 1.

<br><br><br><br><br>

### hist

<img src="img/hist-logo.svg" width="300">

Hist is a library for filling, manipulating, and plotting histograms.

See [hist.readthedocs.io](https://hist.readthedocs.io/) for tutorials and reference documentation.

We'll use this to plot distributions.

<br><br><br><br><br>

# Next stop: the hands-on project

Go to [project.ipynb](project.ipynb) for the hands-on project.