# Columnar analysis with Awkward Array

## How this works as a hands-on tutorial

Even though I don't have formal exercises scattered throughout these notebooks, this session can still be interactive.

   * **You** should open each notebook in Binder (see [GitHub README](https://github.com/jpivarski/2020-06-08-uproot-awkward-columnar-hats)) and evaluate cells, following along with me.
   * **I** should pause frequently and stay open to questions. I'll be monitoring the videoconference chat.
   * **We** should feel free to step off the path and try to answer "What if?" questions in real time.

Not all digressions will lead to an answer—I often realize, "That's why it didn't work!" long after the tutorial is over—but tinkering is how we learn.

Consider this a tour and I'm your guide. The planned route is a suggestion to get things started, but your questions are more important.

(Also, I'm awful at writing formal exercises; they end up being too easy _and_ too hard.)

<br><br><br>

## Array-based programming

One of the first programming languages, named **APL** ("A Programming Language") was array-based. It started as a notation for _describing_ hand-written machine code and was later made interactive.

**Nial** was also theoretically motivated, and the two of these inspired a generation of direct descendants (green).

Meanwhile, the **S** language for statistics borrowed many of these ideas while being focused on a particular domain. Its descendent, **R**, is still widely used.

**IDL** was invented for the sciences and gained a lot of traction as an alternative to writing custom Fortran, again using vectorization as a first-class concept.

**MATLAB** was similarly gained traction in the sciences as a commercial product.

**PDL** (Perl Data Language) and **NumPy** introduced the same concepts as libraries within an established language (Perl and Python). **Julia** has some vector-like interfaces, though its focus is on just-in-time compiling imperative code.

![](img/apl-timeline.png)

<br><br><br>

Common features of array-based languages:

   * Arrays are the central data type with most operations applying to arrays. (By contrast, C requires explicit iteration over the arrays: it's imperative.)
   * They are _all_ interactive languages. The array-at-a-time logic makes it possible to define precompiled routines that run in response to user commands.
   * They are primarily data analysis languages, highly targeted to the sciences and statistics.

In retrospect, it sounds like a perfect fit.

<br><br><br>

In this plot of the "astronomical" rise of Python, note that 2 of the 3 languages it's displacing are array languages.

![](img/mentions-of-programming-languages.png)

<br><br><br>

## Why not for particle physics?

Because **data structures**. Particle physicists have _always_ needed to deal with complex data structures, so much so that we invented packages to add them to Fortran.

The following is from [_Initiation to HYDRA_ by R.K. Böck (1976)](https://cds.cern.ch/record/864527?ln=en) as part of an explanation of what a "data structure" is, at a time before Fortran had `FOR` loops. (HYDRA was merged into ZEBRA, which became the basis for ROOT I/O.)

We would draw similar diagrams today.

![](img/hydra-2.png)

<br><br><br>

But the modify-compile-rerun cycle of C++ is too long for interactive data analysis. That's why ROOT invented CINT and then Cling.

But C++ is too complex of a language for data-focused tasks. That's why I was thinking a lot about [extending query languages (like SQL) to data structures](https://stackoverflow.com/questions/38831961/what-declarative-language-is-good-at-analysis-of-tree-like-data).

But I was surprised by how useful the simple JaggedArray class in Uproot turned out to be. My conclusion was that you don't need a new language, just some data types and operations.

![](img/uproot-awkward-timeline.png)

<br><br><br>

<font size="15">That's what </font><img src="img/awkward-logo-300px.png" style="vertical-align:middle"><font size="15"> is.</font>


Just arrays, but with awkward shapes.

![](img/cartoon-schematic.png)

<br><br><br>

## Let's start with a non-physics example

Chicago bike paths [as a GeoJSON file](https://github.com/Chicago/osd-bike-routes/blob/master/data/Bikeroutes.geojson).

In [8]:
import json

bikeroutes_json = open("data/Bikeroutes.geojson").read()
bikeroutes_pyobj = json.loads(bikeroutes_json)

print(bikeroutes_json[:1000])

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
                                                                                
"features": [
{ "type": "Feature", "properties": { "STREET": "W FULLERTON AVE", "TYPE": "4", "BIKEROUTE": "RECOMMENDED BIKE ROUTE", "F_STREET": "W GRAND AVE", "T_STREET": "W GRAND AVE" }, "geometry": { "type": "MultiLineString", "coordinates": [ [ [ -87.788572682391163, 41.923652047961923 ], [ -87.788645591836797, 41.923651405921802 ], [ -87.788844988373143, 41.923649881816345 ], [ -87.788950897155686, 41.923649066751238 ], [ -87.789091716222416, 41.923648060311677 ], [ -87.789279058225759, 41.923646752738705 ], [ -87.789396101278086, 41.923645907691565 ], [ -87.789781787237345, 41.923641532135363 ], [ -87.789851574317836, 41.923640764779442 ], [ -87.78989703352525, 41.923640238994977 ], [ -87.790052557345319, 41.923638477701097 ], [ -87.790253898265888, 41.92363731709856 ], [ -87.7903925124

It has a lot of structure—metadata and street names mixed in with the longitude, latitude points. But we can read it in as arrays.

In [9]:
import awkward1 as ak

bikeroutes = ak.Record(bikeroutes_pyobj)
bikeroutes

<Record ... [-87.7, 42], [-87.7, 42]]]}}]} type='{"type": string, "crs": {"type"...'>

The analysis-relevant information about the data's structure is contained in its `type`.

In [10]:
ak.type(bikeroutes)

{"type": string, "crs": {"type": string, "properties": {"name": string}}, "features": var * {"type": string, "properties": {"STREET": string, "TYPE": string, "BIKEROUTE": string, "F_STREET": string, "T_STREET": option[string]}, "geometry": {"type": string, "coordinates": var * var * var * float64}}}

Everything is a `string` or an `option[string]` (i.e. can be `null`) except the coordinates, which are `var * var * var * float64` (triply jagged array).

The way that the data has been split up into arrays is contained in its `form`. Unless you're developing software on Awkward Array, you probably don't need to know about an array's form.

In [12]:
bikeroutes.layout.array.form

{
    "class": "RecordArray",
    "contents": {
        "type": {
            "class": "ListOffsetArray64",
            "offsets": "i64",
            "content": "uint8",
            "parameters": {
                "__array__": "string"
            }
        },
        "crs": {
            "class": "RecordArray",
            "contents": {
                "type": {
                    "class": "ListOffsetArray64",
                    "offsets": "i64",
                    "content": "uint8",
                    "parameters": {
                        "__array__": "string"
                    }
                },
                "properties": {
                    "class": "RecordArray",
                    "contents": {
                        "name": {
                            "class": "ListOffsetArray64",
                            "offsets": "i64",
                            "content": "uint8",
                            "parameters": {
                                "__array__"

In general, any data structure can be built from nested nodes that each point to an array.

The key thing is that the number of nodes scales with the data complexity (5 in the example below, 32 for the bike routes), not with data volume.

![](img/example-hierarchy.png)

The CMS example that we'll be using for most of this tutorial has 17 layout nodes in its form but a million events.

"Bookkeeping" code that needs to iterate over slow, dynamically typed nodes iterates over 17 objects. Fast, precompiled, vectorized code runs over the million.

In [16]:
import uproot
cms_dict = uproot.open("data/cms_opendata_2012_nanoaod_DoubleMuParked.root")["Events"].arrays()

cms_events = ak.zip({
    "run": ak.from_awkward0(cms_dict[b"run"]),
    "luminosityBlock": ak.from_awkward0(cms_dict[b"luminosityBlock"]),
    "event": ak.from_awkward0(cms_dict[b"event"]),
    "PV": ak.zip({
        "x": ak.from_awkward0(cms_dict[b"PV_x"]),
        "y": ak.from_awkward0(cms_dict[b"PV_y"]),
        "z": ak.from_awkward0(cms_dict[b"PV_z"])
    }),
    "muons": ak.zip({
        "pt": ak.from_awkward0(cms_dict[b"Muon_pt"]),
        "eta": ak.from_awkward0(cms_dict[b"Muon_eta"]),
        "phi": ak.from_awkward0(cms_dict[b"Muon_phi"]),
        "pt": ak.from_awkward0(cms_dict[b"Muon_pt"]),
        "mass": ak.from_awkward0(cms_dict[b"Muon_mass"]),
        "charge": ak.from_awkward0(cms_dict[b"Muon_charge"]),
        "pfRelIso04_all": ak.from_awkward0(cms_dict[b"Muon_pfRelIso04_all"]),
        "tightId": ak.from_awkward0(cms_dict[b"Muon_tightId"])
    })
}, depth_limit=1)

cms_events

<Array [{run: 194778, ... ] type='1000000 * {"run": int32, "luminosityBlock": ui...'>

In [17]:
ak.type(cms_events)

1000000 * {"run": int32, "luminosityBlock": uint32, "event": uint64, "PV": {"x": float32, "y": float32, "z": float32}, "muons": var * {"pt": float32, "eta": float32, "phi": float32, "mass": float32, "charge": int32, "pfRelIso04_all": float32, "tightId": bool}}

In [18]:
cms_events.layout.form

{
    "class": "RecordArray",
    "contents": {
        "run": "int32",
        "luminosityBlock": "uint32",
        "event": "uint64",
        "PV": {
            "class": "RecordArray",
            "contents": {
                "x": "float32",
                "y": "float32",
                "z": "float32"
            }
        },
        "muons": {
            "class": "ListOffsetArray64",
            "offsets": "i64",
            "content": {
                "class": "RecordArray",
                "contents": {
                    "pt": "float32",
                    "eta": "float32",
                    "phi": "float32",
                    "mass": "float32",
                    "charge": "int32",
                    "pfRelIso04_all": "float32",
                    "tightId": "bool"
                }
            }
        }
    }
}

In [19]:
len(cms_events)

1000000