## Awkward arrays: jaggedness and more

In [04-ttree-data-pyroot.ipynb](04-ttree-data-pyroot.ipynb), we saw some examples of jagged and object data. Uproot uses a package called "awkward" to deal with them.

This section focuses on various kinds of awkward arrays and what you can do with them (including making them less awkward: into pure Numpy arrays!).

Everything that comes out of uproot is a Numpy array, a `JaggedArray`, a `Table`, an `ObjectArray`, or some combination.

In [None]:
import uproot
a = uproot.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"].array("Muon_Px")
a

In [None]:
type(a)

In [None]:
type(a.content)

In [None]:
b = uproot.open("http://scikit-hep.org/uproot/examples/HZZ-objects.root")["events"].array("muonp4")
b

In [None]:
type(b)

In [None]:
type(b.content)

In [None]:
type(b.content.content)

In [None]:
type(b.content.content.contents["fX"])

If ROOT managed to "split" the objects into columns, then the data are in a columnar state: each attribute represented by a contiguous array.

In [None]:
b.content.content.contents["fX"]

In [None]:
b.content.content.contents["fY"]

Even if the data are "unsplit," they're presented as a bag of bytes and a Python function to interpret them, as an `ObjectArray`.

In [None]:
c = uproot.open("http://scikit-hep.org/uproot/examples/Event.root")["T"].array("fH")
c

In [None]:
c.content     # bags of bytes, for each entry

In [None]:
c.generator   # interpretation class

In [None]:
c[500].show()

A `JaggedArray` is a list of unequal-sized sublists, encoded as a continuous array of `content` divided up by an array of `offsets`.

In [None]:
import awkward
x = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
x

In [None]:
x.content

In [None]:
x.offsets

A `Table` is an array of `Row` records, encoded as a continuous array of each column in its `contents` dict.

In [None]:
x = awkward.fromiter([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
x

In [None]:
x.tolist()

In [None]:
x.contents["x"]

In [None]:
x.contents["y"]

An `ObjectArray` is a virtual array of objects, represented by some array `content` and a `generator` that creates each object on demand.

In [None]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self):
        return "Point({0}, {1})".format(self.x, self.y)

x = awkward.fromiter([Point(1, 1.1), Point(2, 2.2), Point(3, 3.3)])
x

In [None]:
x.content

In [None]:
x.content.contents["x"]

In [None]:
x.content.contents["y"]

## Jagged operations

As much as possible, awkward arrays act like Numpy arrays.

In [None]:
x = awkward.fromiter([[1.1, 2.2, 3.3, 4.4], [5.5, 6.6], [7.7, 8.8, 9.9]])
x

In [None]:
# take the first two inner lists
x[:2]

In [None]:
# take the first two numbers in each inner list
x[:, :2]

In [None]:
# mask outer lists
x[[True, False, True]]

In [None]:
# mask inner lists
x[awkward.fromiter([[True, False, True, False], [False, True], [True, True, False]])]

Reductions (min, max, sum, ...) turn Numpy arrays into scalars and jagged arrays into flat Numpy arrays.

In [None]:
x

In [None]:
x.min()

In [None]:
x.max()

Empty sublists return the identity of the reduction operation's group. (Group theory "group.")

In [None]:
x = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
x

In [None]:
x.sum()

In [None]:
x.max()    # what's the identity of max? of min?

There's also an equivalent of `argmin/argmax` that returns jagged arrays of indexes.

In [None]:
x = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
x

In [None]:
indexes = x.argmax()
indexes

What's this useful for? Maximizing by one attribute and applying to another.

In [None]:
y = awkward.fromiter([[300, 200, 100], [], [500, 400]])
y

In [None]:
y[indexes]

In Numpy, selecting elements by an array of indexes is called "fancy indexing."

Numpy's "universal functions" can be applied to awkward arrays. They apply element-by-element and maintain structure.

In [None]:
x

In [None]:
import numpy
numpy.sqrt(x)

This allows us to compute things with awkward arrays as we would Numpy arrays—as long as the structure matches.

In [None]:
x + y**2

This is how all Lorentz vector methods are implemented: array-by-array operations.

In [None]:
b    # a JaggedArray of TLorentzVector objects that also has TLorentzVector methods

In [None]:
b.pt

The `.pt` is a property implemented as

In [None]:
numpy.sqrt(b["fX"]**2 + b["fY"]**2)

**Physics case:** add the first and second muon of each event to get Z masses.

In [None]:
first = b[:, 0]
second = b[:, 0]

In [None]:
hastwo = (b.counts >= 2)
hastwo

In [None]:
first = b[hastwo, 0]
second = b[hastwo, 1]
first                    # an ObjectArray with TLorentzVector methods, but no longer jagged

In [None]:
(first + second).mass

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot
matplotlib.pyplot.hist((first + second).mass, 100);

As more analysis groups use awkward arrays, we add more functions for dealing with complex cases.

In [None]:
x = awkward.fromiter([[1.1, 2.2, 3.3, 4.4, 5.5], [], [6.6, 7.7, 8.8]])
x

In [None]:
x.pad(4)                                     # ensure at least two elements

In [None]:
x.pad(4, clip=True)                          # exactly two elements

In [None]:
x.pad(4, clip=True).fillna(-1000)            # turn "None" into -1000

In [None]:
x.pad(4, clip=True).fillna(-1000).regular()  # and make it a plain 'ol Numpy array

In [None]:
x

In [None]:
y = awkward.fromiter([[100, 200], [300], [400, 500]])
y

In [None]:
awkward.JaggedArray.concatenate([x, y])

In [None]:
awkward.JaggedArray.concatenate([x, y], axis=1)

Combinatorics: emulating nested "for" loops.

In [None]:
x = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
y = awkward.fromiter([[10, 20], [30], [40]])

In [None]:
z = x.cross(y)
z

In [None]:
z.i0

In [None]:
z.i1

Using combinatorics to do the Z peak better: use all muons, not just the first two.

In [None]:
b

In [None]:
pairs = b.cross(b)
(pairs.i0 + pairs.i1).mass

Why are some masses `2*0.106`?

Because it's the mass of a muon four-vector with itself...

Now without double-counting.

In [None]:
pairs = b.distincts()   # like b.cross(b), but taking only pairs above the diagonal
pairs

In [None]:
(pairs.i0 + pairs.i1).mass

More involved example from a CMS analysis: jet cleaning.

In [None]:
dataset = uproot.open("http://scikit-hep.org/uproot/examples/HZZ-objects.root")["events"]
muons = dataset.array("muonp4")
jets = dataset.array("jetp4")

In [None]:
def ΔR(combinations):
    return combinations.i0.delta_r(combinations.i1)

combinations = jets.cross(muons, nested=True)     # nested=True means make a doubly jagged array; "any()" reduces one level
jets[~(ΔR(combinations) < 0.5).any()]             # "jets for which not (~) any combination has ΔR < 0.5"

Numpy has a concept of "broadcasting," in which an array and a scalar may be operated element-by-element, by duplicating the scalar to match the array.

In [None]:
numpy.array([1.1, 2.2, 3.3, 4.4, 5.5]) + 100

The jagged equivalent of this is broadcasting a Numpy array to match a jagged array:

In [None]:
x

In [None]:
x + numpy.array([100, 200, 300])

**Physics case:** consider a jagged array of timing measurements.

In [None]:
times = awkward.fromiter([[4.4, 2.6, 3.5, -0.6], [1.8, 7.4], [], [9.5, 5.2, 8.5]])   # in picoseconds, probably
times

Time-zero corrections (`t0`) may be applied globally:

In [None]:
times - 0.6

Or they may be applied per event:

In [None]:
times - numpy.array([0.6, 1.2, -0.4, 3.3])

Or by detector id:

In [None]:
times

In [None]:
detid = awkward.fromiter([[101, 274, 101, 97], [274, 97], [], [101, 634, 274]])
detid

In [None]:
lookup = awkward.SparseArray(1000, [97, 101, 274, 634], [0.1, 0.2, 0.3, 0.4])  # 97 → 0.1, 101 → 0.2, 274 → 0.3, 634 → 0.4
lookup

In [None]:
corrections = awkward.JaggedArray.fromoffsets(detid.offsets, lookup[detid.content])
corrections

In [None]:
times - corrections

## Other awkward types

The last example used a `SparseArray`, which I haven't explained yet. The awkward library has quite a few array classes, all mutually composable:

In [None]:
[x for x in dir(awkward) if "Array" in x]

<table style="font-size: 22pt; margin-top: 50px">
    <tr style="font-weight: bold"><td>Array type</td><td>Purpose</td><td>Members</td><td>Usage</td></tr>
    <tr><td>JaggedArray</td><td>variable-sized data structures</td><td>starts, stops, content</td><td>ubiquitous</td></tr>
    <tr><td>Table</td><td>struct-like objects in columns</td><td>contents (dict)</td><td>ubiquitous</td></tr>
    <tr><td>ObjectArray</td><td>arbitrary Python types on demand</td><td>generator, content</td><td>common</td></tr>
    <tr><td>Methods</td><td>mix-in methods and properties on any array type</td><td>(none)</td><td>common</td></tr>
    <tr><td>MaskedArray</td><td>allow nullable values (None)</td><td>mask (bytes), content</td><td>occasional</td></tr>
    <tr><td>BitMaskedArray</td><td>same, but with a bit-mask</td><td>mask (bits), content</td><td>from Arrow</td></tr>
    <tr><td>IndexedMaskedArray</td><td>same, but with dense content</td><td>mask-index (integers) content</td><td>rare</td></tr>
    <tr><td>IndexedArray</td><td>lazy fancy indexing: "pointers"</td><td>index, content</td><td>rare</td></tr>
    <tr><td>SparseArray</td><td>huge array defined at a few indexes</td><td>index, content, default</td><td>rare</td></tr>
    <tr><td>UnionArray</td><td>heterogeneous types or data sources</td><td>tags, index, contents (list)</td><td>rare</td></tr>
    <tr><td>StringArray</td><td>special case: jagged array of char</td><td>starts, stops, content, string methods</td><td>common</td></tr>
    <tr><td>ChunkedArray</td><td>discontiguous array presented as a whole</td><td>counts, chunks (lists)</td><td>from Parquet</td></tr>
    <tr><td>AppendableArray</td><td>chunked allocation for efficient appending</td><td>counts, chunks (lists)</td><td>rare</td></tr>
    <tr><td>VirtualArray</td><td>array generated from a function when needed</td><td>generator, possible cached array</td><td>from Parquet</td></tr>
</table>

Taken together, this allows for some complex data structures, all backed by arrays.

In [None]:
array = awkward.fromiter([[1.1, 2.2, None, 3.3, None],
                          [4.4, [5.5]],
                          [{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]
                         ])
array

In [None]:
print(array.type)

"An array of 3 elements, containing arrays of any number of elements, containing nullable (`?`) data that may be `float64`, jagged arrays of `float64`, or records with fields `"x"` (`int64`) and `"y"` (records with single field `"z"` (`int64`))."

All the same broadcasting and slicing rules apply. They are complex data structures with Numpy idioms.

In [None]:
array.tolist()

In [None]:
(array + 100).tolist()

In [None]:
array[:, -2:].tolist()

In [None]:
# get Higgs → ZZ events
tree = uproot.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]

# make a Table of MET (missing energy, one per event)
events = awkward.Table(tree.arrays(["MET_px", "MET_py"], namedecode="utf-8"))

# add a jagged table (JaggedArray of Table) so muons share a single "offsets"
events["muons"] = awkward.JaggedArray.zip(tree.arrays(["Muon_Px", "Muon_Py", "Muon_Pz"], namedecode="utf-8"))

# add a jagged table of jets in the same way
events["jets"] = awkward.JaggedArray.zip(tree.arrays(["Jet_Px", "Jet_Py", "Jet_Pz"], namedecode="utf-8"))

# here they are
events

In [None]:
print(events.type)

In [None]:
events[0].tolist()

In [None]:
events[3].tolist()

## Persistence

These data structures can be saved and restored from disk in a variety of formats. (Not yet for ROOT, but that's planned for this summer.)

In [None]:
!rm tmp.awkd

In [None]:
awkward.save("tmp.awkd", events)             # this is just like numpy.save

In [None]:
awkward.load("tmp.awkd")[3].tolist()         # and numpy.load

In [None]:
!rm tmp.hdf5

In [None]:
import h5py
file = awkward.hdf5(h5py.File("tmp.hdf5"))   # wrap HDF5 file as an awkward HDF5 file
file

In [None]:
file["events"] = events                      # translates awkward structures into groups of flat arrays

In [None]:
file["events"][3].tolist()                   # and translates back

Parquet is a format for columnar data, currently limited to jagged arrays, not jagged tables.

In [None]:
tree = uproot.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]
events = awkward.Table(tree.arrays(["MET_px", "MET_py", "Muon_Px", "Muon_Py", "Muon_Pz"], namedecode="utf-8"))
print(events.type)

In [None]:
awkward.toparquet("tmp.parquet", events)

In [None]:
reconstituted = awkward.fromparquet("tmp.parquet")
reconstituted

In [None]:
reconstituted.chunks[0]                       # chunks are Parquet "row groups"

In [None]:
reconstituted.chunks[0].contents["MET_px"]    # fields are VirtualArrays: read on demand

In [None]:
reconstituted.chunks[0].contents["MET_px"].array

In [None]:
print(events.type, end="\n\n")                # all data from Parquet is in-principle nullable
print(reconstituted.type)

The high-level type does not indicate that the data contain `ChunkedArrays` and `VirtualArrays` (or `IndexedArrays`, if it had them). Those are low-level details of how the data are delivered.

In [None]:
events[3].tolist()

In [None]:
reconstituted[3].tolist()

## Getting fancy: cross-references

In [None]:
data = awkward.fromiter([
    {"tracks": [{"phi": 1.0}, {"phi": 2.0}],
     "hits": [{"detid": 100, "pos": 3.7}, {"detid": 50, "pos": 2.1}, {"detid": 75, "pos": 2.5}]},
    {"tracks": [{"phi": 1.5}],
     "hits": [{"detid": 100, "pos": 1.4}, {"detid": 50, "pos": 0.7}, {"detid": 75, "pos": 3.0}]}])
print(data.type)

In [None]:
data["tracks"]["hits-on-track"] = \
    awkward.JaggedArray.fromcounts([2, 1],
        awkward.JaggedArray.fromcounts([2, 2, 1, 1],
            awkward.IndexedArray([0, 1, 1, 2, 3, 5],
                data["hits"].content)))

In [None]:
data.tolist()

In [None]:
data.tolist()

In [None]:
data["hits"]["pos"] = data["hits"]["pos"] - 0.5

In [None]:
data.tolist()

## Getting fancier: cyclic references

In [None]:
infinite_well = awkward.JaggedArray([0], [1], [12345])
infinite_well.content = infinite_well

In [None]:
len(infinite_well), len(infinite_well[0]), len(infinite_well[0, 0]), len(infinite_well[0, 0, 0])

In [None]:
infinite_well

In [None]:
tree = awkward.fromiter([
    {"value": 1.23, "left":    1, "right":    2},     # node 0   (how trees are often
    {"value": 3.21, "left":    3, "right":    4},     # node 1    stored in DataFrames)
    {"value": 9.99, "left":    5, "right":    6},     # node 2
    {"value": 3.14, "left":    7, "right": None},     # node 3
    {"value": 2.71, "left": None, "right":    8},     # node 4
    {"value": 5.55, "left": None, "right": None},     # node 5
    {"value": 8.00, "left": None, "right": None},     # node 6
    {"value": 9.00, "left": None, "right": None},     # node 7
    {"value": 0.00, "left": None, "right": None},     # node 8
])
left = tree.contents["left"].content
right = tree.contents["right"].content
left[(left < 0) | (left > 8)] = 0         # satisfy overzealous validity checks
right[(right < 0) | (right > 8)] = 0
tree.contents["left"].content = awkward.IndexedArray(left, tree)
tree.contents["right"].content = awkward.IndexedArray(right, tree)

In [None]:
tree[0].tolist()