# CoDaS-HEP Columnar Data Analysis, part 2

This is the second of three sessions on [columnar data analysis](https://indico.cern.ch/event/1151367/timetable/#41-columnar-data-analysis), presented at CoDaS-HEP at 12:30pm on August 3, 2022 by Jim Pivarski and Ioana Ifrim.

See the [GitHub repo](https://github.com/jpivarski-talks/2022-08-03-codas-hep-columnar-tutorial#readme) for instructions on how to run it.

<br><br><br><br><br>

## From ROOT files into arrays

Physics data are in ROOT files. For columnar analysis, we'll need to get the data into arrays.

In [4]:
import ROOT

In [5]:
rdf = ROOT.RDataFrame("Events", "data/SMHiggsToZZTo4L.root")

In [6]:
rdf.AsNumpy(["MET_pt", "MET_phi"])

{'MET_pt': ndarray([21.92993 , 16.972134, 19.061464, ..., 17.671701, 23.999083,
          12.943779], dtype=float32),
 'MET_phi': ndarray([-2.7301223,  2.8669462, -2.1664631, ...,  1.8889483, -1.973488 ,
           1.6512431], dtype=float32)}

<br><br><br><br><br>

But for variable-length data, such as particle quantities in events with arbitrarily many particles, the NumPy array has `dtype=object`.

In [22]:
muon_quantities = rdf.AsNumpy(["Muon_pt", "Muon_phi"])
muon_quantities

{'Muon_pt': ndarray([<cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac8b574010>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac8b574050>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac8b574090>,
          ...,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac8c7c3090>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac8c7c30d0>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac8c7c3110>],
         dtype=object),
 'Muon_phi': ndarray([<cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e3d1b4f0>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e3d1b530>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e3d1b570>,
          ...,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e4f6a570>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e4f6a5b0>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e4f6a5f0>],
         dtype=object)}

<br><br><br><br><br>

NumPy `dtype=object` arrays are essentially lists: every element is a Python object and NumPy does not know the internal structure.

On the one hand, this limits speed of calculation:

In [8]:
import numpy as np

numeric_array = np.arange(1000000, dtype=np.int64)
python_objects = np.array(range(1000000), dtype=object)

In [9]:
%%timeit

numeric_array**2

279 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [10]:
%%timeit

python_objects**2

148 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


And on the other hand, it limits expressiveness:

In [17]:
numeric_array = np.empty((2, 2, 3), dtype=np.int64)
python_objects = np.empty(2, dtype=object)

numeric_array[:] = [[[1, 2, 3], [4, 5, 6]], [[10, 20, 30], [40, 50, 60]]]
python_objects[:] = [[[1, 2, 3], [4, 5, 6]], [[10, 20, 30], [40, 50, 60]]]

In [20]:
numeric_array[:, :, 1:]    # drop first element from innermost dimension

array([[[ 2,  3],
        [ 5,  6]],

       [[20, 30],
        [50, 60]]])

In [21]:
python_objects[:, :, 1:]   # can't manipulate anything past the first dimension

IndexError: too many indices for array: array is 1-dimensional, but 3 were indexed

<br><br><br><br><br>

Thus, to compute $p_x$ and $p_y$ from muon $p_T$ and $\phi$ in Python, we'd have to drop down into imperative or list comprehensions (functional):

In [34]:
all_pt, all_phi = muon_quantities["Muon_pt"], muon_quantities["Muon_phi"]

all_px = np.array([[pt * np.cos(phi) for pt, phi in zip(event_pt, event_phi)] for event_pt, event_phi in zip(all_pt, all_phi)], dtype=object)
all_py = np.array([[pt * np.sin(phi) for pt, phi in zip(event_pt, event_phi)] for event_pt, event_phi in zip(all_pt, all_phi)], dtype=object)

all_px, all_py

(array([list([-62.09642131826239, 19.5441283607252, 2.05475040026448]),
        list([]), list([]), ...,
        list([2.37480227860668, 3.9543648339704807, 3.018615575574286, 2.254885007906879]),
        list([]), list([])], dtype=object),
 array([list([10.888704252275756, -32.729005959017954, 3.4885342087314886]),
        list([]), list([]), ...,
        list([3.604087973668882, 1.8335333778519876, 4.755512408592984, 4.181554816344941]),
        list([]), list([])], dtype=object))

<br><br><br><br><br>

However, if we instead read the file with Uproot (to be described later),

In [29]:
import uproot

In [33]:
with uproot.open("data/SMHiggsToZZTo4L.root:Events") as events:
    muon_quantities2 = events.arrays(["Muon_pt", "Muon_phi"])

muon_quantities2.show()

[{Muon_pt: [63, 38.1, 4.05], Muon_phi: [2.97, ..., 1.04]},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [54.3, 23.5, ..., 8.39, 3.49], Muon_phi: [...]},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [38.5, 47], Muon_phi: [2.05, -1.15]},
 {Muon_pt: [4.45], Muon_phi: [1.12]},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [], Muon_phi: []},
 ...,
 {Muon_pt: [37.2, 50.1], Muon_phi: [-0.875, 2.65]},
 {Muon_pt: [43.2, 24], Muon_phi: [-1.3, 1.38]},
 {Muon_pt: [24.2, 79.5], Muon_phi: [-0.997, 2.51]},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [9.81, 25.5], Muon_phi: [1.66, -3.09]},
 {Muon_pt: [32.6, 43.1], Muon_phi: [-0.981, 2.27]},
 {Muon_pt: [4.32, 4.36, 5.63, 4.75], Muon_phi: [0.988, ...]},
 {Muon_pt: [], Muon_phi: []},
 {Muon_pt: [], Muon_phi: []}]


Operations on these Awkward Arrays (to be described later) can be expressed in an array-oriented way:

In [35]:
all_pt, all_phi = muon_quantities2["Muon_pt"], muon_quantities2["Muon_phi"]

all_px = all_pt * np.cos(all_phi)
all_py = all_pt * np.sin(all_phi)

all_px, all_py

(<Array [[-62.1, 19.5, 2.05], [], [], ..., [], []] type='299973 * var * float32'>,
 <Array [[10.9, -32.7, 3.49], [], [], ..., [], []] type='299973 * var * float32'>)

<br><br><br><br><br>

To be fair, that's not how RDataFrame is _supposed_ to be used: it's a functional programming framework that takes functions as strings of C++ code (which it compiles).

The `ROOT::VecOps` library presents an array-oriented style _per event_.

In [37]:
(
    rdf.Define("px", "Muon_pt * ROOT::VecOps::cos(Muon_phi)")
       .Define("py", "Muon_pt * ROOT::VecOps::sin(Muon_phi)")
       .AsNumpy(["px", "py"])
)

{'px': ndarray([<cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac79678010>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac79678050>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac79678090>,
          ...,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac7a8c7090>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac7a8c70d0>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x7fac7a8c7110>],
         dtype=object),
 'py': ndarray([<cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e96880a0>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e96880e0>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6e9688120>,
          ...,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6ea8d7120>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6ea8d7160>,
          <cppyy.gbl.ROOT.VecOps.RVec<float> object at 0x55a6ea8d71a0>],
         dtype=object)}

<br><br><br><br><br>

### Uproot

<br><br><br><br><br>

### Awkward Array

<br><br><br><br><br>

## Project: H → ZZ → 4ℓ

<br><br><br><br><br>

### 4 leptons of the same flavor

<br><br><br><br><br>

### Opposite charges

<br><br><br><br><br>

### On your own: the H → ZZ → 2μ2e case

<br><br><br><br><br>

### Hint!