# Columnar analysis with Awkward Array

## How this works as a hands-on tutorial

Even though I don't have formal exercises scattered throughout these notebooks, this session can still be interactive.

   * **You** should open each notebook in Binder (see [GitHub README](https://github.com/jpivarski/2020-06-08-uproot-awkward-columnar-hats)) and evaluate cells, following along with me.
   * **I** should pause frequently and stay open to questions. I'll be monitoring the videoconference chat.
   * **We** should feel free to step off the path and try to answer "What if?" questions in real time.

Not all digressions will lead to an answer—I often realize, "That's why it didn't work!" long after the tutorial is over—but tinkering is how we learn.

Consider this a tour and I'm your guide. The planned route is a suggestion to get things started, but your questions are more important.

(Also, I'm awful at writing formal exercises; they end up being too easy _and_ too hard.)

<br><br><br>

## Array-based programming

One of the first programming languages, named **APL** ("A Programming Language") was array-based. It started as a notation for _describing_ hand-written machine code and was later made interactive.

**Nial** was also theoretically motivated, and the two of these inspired a generation of direct descendants (green).

Meanwhile, the **S** language for statistics borrowed many of these ideas while being focused on a particular domain. Its descendent, **R**, is still widely used.

**IDL** was invented for the sciences and gained a lot of traction as an alternative to writing custom Fortran, again using vectorization as a first-class concept.

**MATLAB** was similarly gained traction in the sciences as a commercial product.

**PDL** (Perl Data Language) and **NumPy** introduced the same concepts as libraries within an established language (Perl and Python). **Julia** has some vector-like interfaces, though its focus is on just-in-time compiling imperative code.

![](img/apl-timeline.png)

<br><br><br>

Common features of array-based languages:

   * Arrays are the central data type with most operations applying to arrays. (By contrast, C requires explicit iteration over the arrays: it's imperative.)
   * They are _all_ interactive languages. The array-at-a-time logic makes it possible to define precompiled routines that run in response to user commands.
   * They are primarily data analysis languages, highly targeted to the sciences and statistics.

In retrospect, it sounds like a perfect fit.

<br><br><br>

In this plot of the "astronomical" rise of Python, note that 2 of the 3 languages it's displacing are array languages.

![](img/mentions-of-programming-languages.png)

<br><br><br>

## Why not for particle physics?

Because **data structures**. Particle physicists have _always_ needed to deal with complex data structures, so much so that we invented packages to add them to Fortran.

The following is from [_Initiation to HYDRA_ by R.K. Böck (1976)](https://cds.cern.ch/record/864527?ln=en) as part of an explanation of what a "data structure" is, at a time before Fortran had `FOR` loops. (HYDRA was merged into ZEBRA, which became the basis for ROOT I/O.)

We would draw similar diagrams today.

![](img/hydra-2.png)

<br><br><br>

But the modify-compile-rerun cycle of C++ is too long for interactive data analysis. That's why ROOT invented CINT and then Cling.

But C++ is too complex of a language for data-focused tasks. That's why I was thinking a lot about [extending query languages (like SQL) to data structures](https://stackoverflow.com/questions/38831961/what-declarative-language-is-good-at-analysis-of-tree-like-data).

But I was surprised by how useful the simple JaggedArray class in Uproot turned out to be. My conclusion was that you don't need a new language, just some data types and operations.

![](img/uproot-awkward-timeline.png)

<br><br><br>

<font size="15">That's what </font><img src="img/awkward-logo-300px.png" style="vertical-align:middle"><font size="15"> is.</font>


Just arrays, but with awkward shapes.

![](img/cartoon-schematic.png)

<br><br><br>

## Let's start with a non-physics example

To get a feel for what this means, let's look at _something completely different_ from a Z peak: [Chicago bike paths](https://github.com/Chicago/osd-bike-routes/blob/master/data/Bikeroutes.geojson).

In [1]:
import json

bikeroutes_json = open("data/Bikeroutes.geojson").read()
bikeroutes_pyobj = json.loads(bikeroutes_json)

In [2]:
# First thousand bytes...
print(bikeroutes_json[:1000])

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
                                                                                
"features": [
{ "type": "Feature", "properties": { "STREET": "W FULLERTON AVE", "TYPE": "4", "BIKEROUTE": "RECOMMENDED BIKE ROUTE", "F_STREET": "W GRAND AVE", "T_STREET": "W GRAND AVE" }, "geometry": { "type": "MultiLineString", "coordinates": [ [ [ -87.788572682391163, 41.923652047961923 ], [ -87.788645591836797, 41.923651405921802 ], [ -87.788844988373143, 41.923649881816345 ], [ -87.788950897155686, 41.923649066751238 ], [ -87.789091716222416, 41.923648060311677 ], [ -87.789279058225759, 41.923646752738705 ], [ -87.789396101278086, 41.923645907691565 ], [ -87.789781787237345, 41.923641532135363 ], [ -87.789851574317836, 41.923640764779442 ], [ -87.78989703352525, 41.923640238994977 ], [ -87.790052557345319, 41.923638477701097 ], [ -87.790253898265888, 41.92363731709856 ], [ -87.7903925124

It has a lot of structure—metadata and street names mixed in with the longitude, latitude points. But we can read it in as arrays.

In [3]:
import awkward1 as ak

bikeroutes = ak.Record(bikeroutes_pyobj)
bikeroutes

<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>

The analysis-relevant information about the data's structure is contained in its `type`.

In [4]:
ak.type(bikeroutes)

{"type": string, "crs": {"type": string, "properties": {"name": string}}, "features": var * {"type": string, "properties": {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}, "geometry": {"type": string, "coordinates": var * var * var * float64}}}

Everything is a `string` or an `option[string]` (i.e. can be `null`) except the coordinates, which are `var * var * var * float64` (triply jagged array).

Let's go straight for the coordinates.

In [5]:
bikeroutes["features", "geometry", "coordinates"]

<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>

or

In [6]:
bikeroutes.features.geometry.coordinates

<Array [[[[-87.8, 41.9], ... [-87.7, 42]]]] type='1061 * var * var * var * float64'>

In [7]:
ak.to_list(bikeroutes.features.geometry.coordinates[:3])

[[[[-87.78857268239116, 41.92365204796192],
   [-87.7886455918368, 41.9236514059218],
   [-87.78884498837314, 41.923649881816345],
   [-87.78895089715569, 41.92364906675124],
   [-87.78909171622242, 41.92364806031168],
   [-87.78927905822576, 41.923646752738705],
   [-87.78939610127809, 41.923645907691565],
   [-87.78978178723735, 41.92364153213536],
   [-87.78985157431784, 41.92364076477944],
   [-87.78989703352525, 41.92364023899498],
   [-87.79005255734532, 41.9236384777011],
   [-87.79025389826589, 41.92363731709856],
   [-87.79039251240177, 41.923636326077215],
   [-87.79131231855902, 41.92362983196139],
   [-87.79145920250602, 41.923628604266135],
   [-87.79148375037359, 41.92362841824402]]],
 [[[-87.74815752805499, 41.914431860310785],
   [-87.74816482757203, 41.91443315985752],
   [-87.74819817563908, 41.914438543841555],
   [-87.74823564553337, 41.914451221037915],
   [-87.74829849193455, 41.91448927446621],
   [-87.74836277977153, 41.914546517396424],
   [-87.74841516152057, 

The third axis happens to have length 2 in all cases, but since we came from JSON (which can't guarantee that lists have a certain length).

We _could_ enforce this by applying a slice that has a fixed length.

In [8]:
ak.type(bikeroutes.features.geometry.coordinates)

1061 * var * var * var * float64

In [9]:
ak.type(bikeroutes.features.geometry.coordinates[:, :, :, [0, 1]])

1061 * var * var * 2 * float64

This distinction between fixed-size and in-principle-variable size is important in general, though not very important for this example.

A more important question is, what do these levels of jaggedness represent?

Let's pick one item and print it out in full detail.

In [10]:
ak.to_list(bikeroutes.features[751])

{'type': 'Feature',
 'properties': {'STREET': 'E 26TH ST',
  'TYPE': '1',
  'BIKEROUTE': 'EXISTING BIKE LANE',
  'F_STREET': 'S STATE ST',
  'T_STREET': 'S DR MARTIN LUTHER KING JR DR'},
 'geometry': {'type': 'MultiLineString',
  'coordinates': [[[-87.62685625163756, 41.84558714841179],
    [-87.62675996392576, 41.84558902593194],
    [-87.62637708895348, 41.845596494328554],
    [-87.62626461651281, 41.845598326696425],
    [-87.62618268489398, 41.84559966093136],
    [-87.6261438116618, 41.84560027230502],
    [-87.62613206507362, 41.845600474403334],
    [-87.6261027723024, 41.8456009526551],
    [-87.62579736038116, 41.84560626159298],
    [-87.62553890383363, 41.845610239979905],
    [-87.62532611036139, 41.845613593674],
    [-87.6247932635836, 41.84562202574476]],
   [[-87.62532611036139, 41.845613593674],
    [-87.6247932635836, 41.84562202574476]],
   [[-87.6247932635836, 41.84562202574476],
    [-87.62446484629727, 41.84562675013391],
    [-87.62444032614908, 41.8456270927620

The hint is "MultiLineString": a bike route consists of disconnected lines. (I guess you have to pick up your bike and walk it.)

Most routes are a single connected line; I found this extreme with [ak.num](https://awkward-array.readthedocs.io/en/latest/_auto/ak.num.html), a function for jagged multiplicity.

In [11]:
ak.num(bikeroutes.features.geometry.coordinates)

<Array [1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1] type='1061 * int64'>

In [12]:
ak.argmax(ak.num(bikeroutes.features.geometry.coordinates))

751

[ak.max](https://awkward-array.readthedocs.io/en/latest/_auto/ak.max.html) is just like [np.max](https://numpy.org/doc/1.18/reference/generated/numpy.amax.html) from NumPy except that it recognizes Awkward Arrays.

By contrast, [ak.num](https://awkward-array.readthedocs.io/en/latest/_auto/ak.num.html) could not have a NumPy equivalent because it provides information that would always be trivial with NumPy's rectangular arrays.

Functions that overlap NumPy functions, like [ak.max](https://awkward-array.readthedocs.io/en/latest/_auto/ak.max.html) does for [np.max](https://numpy.org/doc/1.18/reference/generated/numpy.amax.html), have exactly the same interface and defaults. If you have NumPy 1.17 or above, they're actually interchangeable (NumPy recognizes that it's looking at a non-NumPy arrays and defers to our implementation).

In [13]:
import numpy as np

np.argmax(ak.num(bikeroutes.features.geometry.coordinates))

751

To fill out the pattern set by NumPy, most of the functions have an `axis` parameter indicating the depth of nestedness where you want the function to apply.

This can be particularly useful for [ak.num](https://awkward-array.readthedocs.io/en/latest/_auto/ak.num.html).

In [14]:
# most routes have a single contiguous path
ak.num(bikeroutes.features.geometry.coordinates, axis=1)

<Array [1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1] type='1061 * int64'>

In [15]:
# paths can have many or few longitude, latitude points
ak.num(bikeroutes.features.geometry.coordinates, axis=2)

<Array [[16], [16], [7], ... [80], [20], [11]] type='1061 * var * int64'>

In [16]:
# all of the longitude, latitude points have exactly two numbers
ak.num(bikeroutes.features.geometry.coordinates, axis=3)

<Array [[[2, 2, 2, 2, 2, ... 2, 2, 2, 2, 2]]] type='1061 * var * var * int64'>

Notice the data type: the number of entries in deeply nested data is itself a nested structure.

We can verify that the longitude, latitude points really do have two values like this:

In [17]:
num = ak.num(bikeroutes.features.geometry.coordinates, axis=3)
num == 2

<Array [[[True, True, True, ... True, True]]] type='1061 * var * var * bool'>

In [18]:
ak.all(num == 2)

True

[ak.all](https://awkward-array.readthedocs.io/en/latest/_auto/ak.all.html) is a reducer (like [np.all](https://numpy.org/doc/1.18/reference/generated/numpy.all.html)), which turns arrays into scalars.

Its default `axis` is `None`, meaning "reduce everything." We can also partially reduce.

`axis=-1` means "deepest axis," which is the most-often useful axis, apart from `None`.

In [19]:
ak.all(num == 2, axis=-1)

<Array [[True], [True], ... [True], [True]] type='1061 * var * bool'>

So now let's do something useful: how about computing the length of each bike route?

First, get the longitude and latitude separately.

In [20]:
longitude = bikeroutes.features.geometry.coordinates[..., 0]
latitude = bikeroutes.features.geometry.coordinates[..., 1]
longitude, latitude

(<Array [[[-87.8, -87.8, ... -87.7, -87.7]]] type='1061 * var * var * float64'>,
 <Array [[[41.9, 41.9, 41.9, ... 42, 42, 42]]] type='1061 * var * var * float64'>)

The ellipsis (`...`) saved me from having to type `coordinates[:, :, :, 0]`, having to know the exact depth when I wanted the deepest.

At our longtidue and latitude, one degree of longitude corresponds to 82.7 km and one degree of latitude corresponds to 111.1 km (I looked that up elsewhere).

Functions like [ak.mean](https://awkward-array.readthedocs.io/en/latest/_auto/ak.mean.html)/[np.mean](https://numpy.org/doc/1.18/reference/generated/numpy.mean.html) have the same interface as reducers.

In [21]:
km_east = (longitude - np.mean(longitude)) * 82.7
km_north = (latitude - np.mean(latitude)) * 111.1
km_east, km_north

(<Array [[[-9.68, -9.69, ... -3.58, -3.62]]] type='1061 * var * var * float64'>,
 <Array [[[6.68, 6.68, 6.67, ... 9.68, 9.72]]] type='1061 * var * var * float64'>)

So now all of the paths are in distance units (km), relative to the center of Chicago (the center of all the points at least; we only needed a convenient origin).

Think, for a moment, about what that transformation would have required to do it in "for" loops. Even if speed were not an issue, it would be a lot of typing.

To compute lengths, we need distances _between_ points. So we want to match pairs of points along each path.

The way you'd do this with a NumPy array is with slices:

In [22]:
path = np.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
path[1:] - path[:-1]

array([1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1])

where `path[1:]` drops the first element and `path[:-1]` drops the last element, which is exactly what you want to subtract to get all the distances in between.

We can do the same thing with our Awkward Arrays, even though they have different lengths at the deepest level of jaggedness.

In [23]:
km_east[0, 0, :-1], km_east[0, 0, 1:]

(<Array [-9.68, -9.69, -9.7, ... -9.91, -9.92] type='15 * float64'>,
 <Array [-9.69, -9.7, -9.71, ... -9.92, -9.92] type='15 * float64'>)

In [24]:
km_east[0, 0, :-1] - km_east[0, 0, 1:]

<Array [0.00603, 0.0165, ... 0.0121, 0.00203] type='15 * float64'>

Doing it for all paths at once is no more difficult than doing it for the first path.

In [25]:
km_east[:, :, :-1] - km_east[:, :, 1:]

<Array [[[0.00603, 0.0165, ... 0.0385]]] type='1061 * var * var * float64'>

(The issue that might come up in a physics analysis is that you might have some empty lists. This dataset didn't have empty paths—why would it?—so we'll deal with empty lists in the physics section, below.)

Since we're all familiar with using $\sqrt{(x_i - x_{i + 1})^2 + (y_i - y_{i + 1})^2}$ as a distance formula, let's jump to the answer.

In [26]:
segment_length = np.sqrt((km_east[:, :, 1:] - km_east[:, :, :-1])**2 +
                         (km_north[:, :, 1:] - km_north[:, :, :-1])**2)
segment_length

<Array [[[0.00603, 0.0165, ... 0.0523]]] type='1061 * var * var * float64'>

So now we have replaced paths of length _n_ (where _n_ is variable) with segment distances of length _n ‒ 1_.

We probably want the length of each path, so... reducer! This one is [ak.sum](https://awkward-array.readthedocs.io/en/latest/_auto/ak.sum.html)/[np.sum](https://numpy.org/doc/1.18/reference/generated/numpy.sum.html).

In [27]:
path_length = np.sum(segment_length, axis=-1)
path_length

<Array [[0.241], [0.0971], ... 0.347], [0.281]] type='1061 * var * float64'>

Okay, but some routes have multiple paths (though most have exactly one). These `path_lengths` have the same multiplicity as the paths.

In [28]:
ak.num(path_length), ak.num(bikeroutes.features.geometry.coordinates)

(<Array [1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1] type='1061 * int64'>,
 <Array [1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1] type='1061 * int64'>)

In [29]:
ak.num(path_length) == ak.num(bikeroutes.features.geometry.coordinates)

<Array [True, True, True, ... True, True, True] type='1061 * bool'>

In [30]:
ak.all(ak.num(path_length) == ak.num(bikeroutes.features.geometry.coordinates))

True

So let's reduce again.

In [31]:
route_length = np.sum(path_length, axis=-1)
route_length

<Array [0.241, 0.0971, 0.203, ... 0.347, 0.281] type='1061 * float64'>

There you have it. We can also put this new derived column into the original array, if that's useful for anything.

(Note: you have to assign with square brackets and strings, not attributes, because attribute-assignment would lead to confusion about assigning to temporary copies. Pandas has the same problem, and they're deprecating attribute-assignemnt because of it.)

In [32]:
bikeroutes["features", "route_length"] = route_length

In [36]:
ak.to_list(bikeroutes.features[751])

{'type': 'Feature',
 'properties': {'STREET': 'E 26TH ST',
  'TYPE': '1',
  'BIKEROUTE': 'EXISTING BIKE LANE',
  'F_STREET': 'S STATE ST',
  'T_STREET': 'S DR MARTIN LUTHER KING JR DR'},
 'geometry': {'type': 'MultiLineString',
  'coordinates': [[[-87.62685625163756, 41.84558714841179],
    [-87.62675996392576, 41.84558902593194],
    [-87.62637708895348, 41.845596494328554],
    [-87.62626461651281, 41.845598326696425],
    [-87.62618268489398, 41.84559966093136],
    [-87.6261438116618, 41.84560027230502],
    [-87.62613206507362, 41.845600474403334],
    [-87.6261027723024, 41.8456009526551],
    [-87.62579736038116, 41.84560626159298],
    [-87.62553890383363, 41.845610239979905],
    [-87.62532611036139, 41.845613593674],
    [-87.6247932635836, 41.84562202574476]],
   [[-87.62532611036139, 41.845613593674],
    [-87.6247932635836, 41.84562202574476]],
   [[-87.6247932635836, 41.84562202574476],
    [-87.62446484629727, 41.84562675013391],
    [-87.62444032614908, 41.8456270927620

Now every record has a `route_length` field.

In [37]:
bikeroutes.features.route_length

<Array [0.241, 0.0971, 0.203, ... 0.347, 0.281] type='1061 * float64'>

That's how these calculations go. If we were to do the same thing with Python for loops, it would be a lot more verbose and slower.

In [38]:
%%timeit

total_length = []
for route in bikeroutes_pyobj["features"]:
    route_length = []
    for polyline in route["geometry"]["coordinates"]:
        segment_length = []
        last = None
        for lng, lat in polyline:
            km_east = lng * 82.7
            km_north = lat * 111.1
            if last is not None:
                dx2 = (km_east - last[0])**2
                dy2 = (km_north - last[1])**2
                segment_length.append(np.sqrt(dx2 + dy2))
            last = (km_east, km_north)

        route_length.append(sum(segment_length))
    total_length.append(sum(route_length))

57.9 ms ± 6.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [39]:
%%timeit

km_east = bikeroutes.features.geometry.coordinates[..., 0] * 82.7
km_north = bikeroutes.features.geometry.coordinates[..., 1] * 111.1

segment_length = np.sqrt((km_east[:, :, 1:] - km_east[:, :, :-1])**2 +
                         (km_north[:, :, 1:] - km_north[:, :, :-1])**2)

route_length = np.sum(segment_length, axis=-1)
total_length = np.sum(route_length, axis=-1)

9.63 ms ± 440 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In the limit, it's a factor of 8 faster. This isn't even _great_ scaling; other examples could easily reach a factor of 100. The point is that we're in the 

![](img/bikeroutes-scaling.png)

The way that the data has been split up into arrays is contained in its `form`. Unless you're developing software on Awkward Array, you probably don't need to know about an array's form.

In [35]:
bikeroutes.layout.form

{
    "class": "RecordArray",
    "contents": {
        "type": {
            "class": "ListArray64",
            "starts": "i64",
            "stops": "i64",
            "content": "uint8",
            "parameters": {
                "__array__": "string"
            }
        },
        "crs": {
            "class": "RecordArray",
            "contents": {
                "type": {
                    "class": "ListArray64",
                    "starts": "i64",
                    "stops": "i64",
                    "content": "uint8",
                    "parameters": {
                        "__array__": "string"
                    }
                },
                "properties": {
                    "class": "RecordArray",
                    "contents": {
                        "name": {
                            "class": "ListArray64",
                            "starts": "i64",
                            "stops": "i64",
                            "content": "uint8",


In general, any data structure can be built from nested nodes that each point to an array.

The key thing is that the number of nodes scales with the data complexity (5 in the example below, 32 for the bike routes), not with data volume.

![](img/example-hierarchy.png)

The CMS example that we'll be using for most of this tutorial has 17 layout nodes in its form but a million events.

"Bookkeeping" code that needs to iterate over slow, dynamically typed nodes iterates over 17 objects. Fast, precompiled, vectorized code runs over the million.

In [None]:
import uproot
cms_dict = uproot.open("data/cms_opendata_2012_nanoaod_DoubleMuParked.root")["Events"].arrays()

cms_events = ak.zip({
    "run": ak.from_awkward0(cms_dict[b"run"]),
    "luminosityBlock": ak.from_awkward0(cms_dict[b"luminosityBlock"]),
    "event": ak.from_awkward0(cms_dict[b"event"]),
    "PV": ak.zip({
        "x": ak.from_awkward0(cms_dict[b"PV_x"]),
        "y": ak.from_awkward0(cms_dict[b"PV_y"]),
        "z": ak.from_awkward0(cms_dict[b"PV_z"])
    }),
    "muons": ak.zip({
        "pt": ak.from_awkward0(cms_dict[b"Muon_pt"]),
        "eta": ak.from_awkward0(cms_dict[b"Muon_eta"]),
        "phi": ak.from_awkward0(cms_dict[b"Muon_phi"]),
        "pt": ak.from_awkward0(cms_dict[b"Muon_pt"]),
        "mass": ak.from_awkward0(cms_dict[b"Muon_mass"]),
        "charge": ak.from_awkward0(cms_dict[b"Muon_charge"]),
        "pfRelIso04_all": ak.from_awkward0(cms_dict[b"Muon_pfRelIso04_all"]),
        "tightId": ak.from_awkward0(cms_dict[b"Muon_tightId"])
    })
}, depth_limit=1)

cms_events

In [None]:
ak.type(cms_events)

In [None]:
cms_events.layout.form

In [None]:
len(cms_events)