# 2019-12-20-coffea-demo

## 1. Introduction

This demo of the new Awkward Array was presented on December 20, 2019, before the final 1.0 version was released. Some interfaces may have changed. To run this notebook, make sure you have version 0.1.36 ([GitHub](https://github.com/scikit-hep/awkward-1.0/releases/tag/0.1.36), [pip](https://pypi.org/project/awkward1/0.1.36/)) by installing

```bash
pip install 'awkward1==0.1.36'
```

The basic concepts of Awkward arrays are presented on the [old Awkward README](https://github.com/scikit-hep/awkward-array/tree/0.12.17#readme) and the motivation for a 1.0 rewrite are presented on the [new Awkward README](https://github.com/scikit-hep/awkward-1.0/tree/0.1.32#readme).

In [1]:
# The base of the GitHub repo is two levels up from this notebook.
import sys
import os
sys.path.insert(0, os.path.join(os.getcwd(), "..", ".."))

## 2. High-level array class

The biggest user-facing change is that, instead of mixing NumPy arrays and `JaggedArray` objects, the new Awkward has a single `Array` class for data analysis.

In [2]:
import numpy as np
import awkward1 as ak

array1 = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
array1

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [3]:
array2 = ak.Array([{"x": 0, "y": []}, {"x": 1, "y": [1.1]}, {"x": 2, "y": [1.1, 2.2]}])
array2

<Array [{x: 0, y: []}, ... y: [1.1, 2.2]}] type='3 * {"x": int64, "y": var * flo...'>

The same `Array` class is used for all data structures, such as the array of lists in `array1` and the array of records in `array2`. (Incidentally, the width of that string representation is exactly large enough to fit into GitHub and StackOverflow text boxes without scrolling!)

There won't be any user-level functions that apply to some data types and not others. The result of an operation is likely type-dependent, but its accessibility is not. (At this time, the only existing operations are conversions like `ak.tolist` and descriptions like `ak.typeof`.)

In [4]:
ak.tolist(array1)

[[1.1, 2.2, 3.3], [], [4.4, 5.5]]

In [5]:
ak.tojson(array1)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [6]:
ak.tolist(array2)

[{'x': 0, 'y': []}, {'x': 1, 'y': [1.1]}, {'x': 2, 'y': [1.1, 2.2]}]

In [7]:
ak.tojson(array2)

'[{"x":0,"y":[]},{"x":1,"y":[1.1]},{"x":2,"y":[1.1,2.2]}]'

In [8]:
ak.typeof(array1)

3 * var * float64

In [9]:
ak.typeof(array2)

3 * {"x": int64, "y": var * float64}

Data types are described using the [datashape language](https://datashape.readthedocs.io/en/latest/). Some Awkward features are [not expressible](https://github.com/blaze/datashape/issues/237) in the current datashape specification, so they're expressed in an extension of the language using the same _style_ of syntax.

## 3. Low-level array classes

The old `JaggedArray` and `Table` are still available, but you have to ask for them explicitly with `layout`. They're not "private" or "internal implementations" (there's no underscore in `layout`): they're public for frameworks like Coffea but hidden from data analysts.

As such, their string representations have more low-level detail: the contents of indexes, rather than what they mean as high-level types. (The XML formatting is just an elaboration on Python's angle-bracket convention for `repr` and the fact that we need to denote nesting.)

In [10]:
array1.layout

<ListOffsetArray64>
    <type>var * float64</type>
    <offsets><Index64 i="[0 3 3 5]" offset="0" at="0x5616f0eb8070"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x5616f0eba080">
        <type>float64</type>
    </NumpyArray></content>
</ListOffsetArray64>

In [11]:
array2.layout

<RecordArray>
    <type>{"x": int64, "y": var * float64}</type>
    <field index="0" key="x">
        <NumpyArray format="l" shape="3" data="0 1 2" at="0x5616f0ec0d00">
            <type>int64</type>
        </NumpyArray>
    </field>
    <field index="1" key="y">
        <ListOffsetArray64>
            <type>var * float64</type>
            <offsets><Index64 i="[0 0 1 3]" offset="0" at="0x5616f0ec2d10"/></offsets>
            <content><NumpyArray format="d" shape="3" data="1.1 1.1 2.2" at="0x5616f0ec4d20">
                <type>float64</type>
            </NumpyArray></content>
        </ListOffsetArray64>
    </field>
</RecordArray>

These classes are defined in C++ and wrapped by pybind11. The `awkward1.Array` class is pure Python. Many of the same operations work for layout classes, though less attention has been paid to its interface.

In [12]:
ak.typeof(array1)

3 * var * float64

In [13]:
ak.typeof(array1.layout)

var * float64

In [14]:
ak.tojson(array1)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [15]:
ak.tojson(array1.layout)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [16]:
array1.layout.tojson()

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

## 4. Behavioral mix-ins

In the original Awkward library, we added behaviors to objects and arrays of objects, like computing `pt` or boosting for arrays of Lorentz vectors, by letting structure classes such as `JaggedArray` multiply inherit from classes providing the implementations. That technique was not fully thought-through: it was easy to lose an array's "Lorentzness" when slicing it or performing other operations. It also relies on a Python language feature that can't pass through C++.

It has since become clear that behavioral mix-ins aren't an obscure use-case but a primary one, so its implementation requires more thought. Adding behaviors to arrays is now a "first-class feature," built into the array types themselves.

In [17]:
class PointClass(ak.Record):
    def __repr__(self):
        return "<Point({}, {})>".format(self["x"], self["y"])
    
    def mag(self):
        return abs(np.sqrt(self["x"]**2 + self["y"]**2))

ak.namespace["Point"] = PointClass

In [18]:
array3 = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
array3

<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>

In [19]:
array3.layout.type

{"x": int64, "y": float64}

Types can have arbitrary parameters, which modify their meaning. These types are JSON-encoded and passed through C++ or wherever the arrays get sent.

In [20]:
pointtype = array3.layout.type
pointtype["__class__"] = "Point"
pointtype

struct[["x", "y"], [int64, float64], parameters={"__class__": "Point"}]

In [21]:
pointtype["__str__"] = "PointType[{}, {}]".format(pointtype.field("x"), pointtype.field("y"))
pointtype

PointType[int64, float64]

In [22]:
# There will be a better interface for assigning types...
array4 = ak.Array(array3.layout, type=ak.ArrayType(pointtype, len(array3.layout)))
array4

<Array [<Point(1, 1.1)>, ... <Point(3, 3.3)>] type='3 * PointType[int64, float64]'>

In [23]:
[x.mag() for x in array4]

[1.4866068747318506, 2.973213749463701, 4.459820624195552]

The elements of this array are `PointClass` instances because the `__class__` parameter is `"Point"`, a name that is recognized in Awkward's class namespace. The global namespace is in `ak.namespace`, but custom ones can be passed into the `Array` constructor to turn on/off or change behaviors.

In [24]:
ak.namespace

{'char': awkward1.behavior.string.CharBehavior,
 'string': awkward1.behavior.string.StringBehavior,
 'Point': __main__.PointClass}

As you can see, variable-length strings are also implemented as mix-ins. Apart from this type annotation, a string is just a jagged array of 8-bit integers.

In [25]:
array5 = ak.Array(["Daisy", "Daisy", "give", "me", "your", "answer", "do."])
array5

<Array ['Daisy', 'Daisy', ... 'answer', 'do.'] type='7 * string'>

In [26]:
array5.layout

<ListOffsetArray64>
    <type>string</type>
    <offsets><Index64 i="[0 5 10 14 16 20 26 29]" offset="0" at="0x5616f0edfc00"/></offsets>
    <content><NumpyArray format="B" shape="29" data="0x 44616973 79446169 73796769 76656d65 796f7572 616e7377 6572646f 2e" at="0x5616f0e821e0">
        <type>utf8</type>
    </NumpyArray></content>
</ListOffsetArray64>

In [27]:
ak.tolist(array5.layout)

[[68, 97, 105, 115, 121],
 [68, 97, 105, 115, 121],
 [103, 105, 118, 101],
 [109, 101],
 [121, 111, 117, 114],
 [97, 110, 115, 119, 101, 114],
 [100, 111, 46]]

In [28]:
# Slice it!
ak.tolist(array5.layout[:, 1:])

[[97, 105, 115, 121],
 [97, 105, 115, 121],
 [105, 118, 101],
 [101],
 [111, 117, 114],
 [110, 115, 119, 101, 114],
 [111, 46]]

In [29]:
# Slice it!
array5[:, 1:]

<Array ['aisy', 'aisy', ... 'nswer', 'o.'] type='7 * string'>

Like all behavioral mix-ins, the string interpretation is _only_ applied in the high-level `Array` view, not the layout classes. Thus, a C++ function that generates or uses jagged arrays of Lorentz vectors (e.g. a nice FastJet interface?) does not depend on Python. It only has to manipulate a map of strings.

The old Awkward also had an `ObjectArray`, which generated Python objects on demand, such as individual Lorentz vectors, and these had to have the same set of methods as arrays of Lorentz vectors. Keeping those coordinated was difficult. Now, however, the individual objects don't disinherit from the Awkward arrays they come from: the strings above are merely a view (which is why the slice worked). Instead of `Methods` and `ObjectArrays`, we now have a unified mechanism.

For instance, this `PointClass` object,

In [30]:
array4[2]

<Point(3, 3.3)>

is still an Awkward `Record`.

In [31]:
array4[2].layout

<Record at="2">
    <RecordArray>
        <type>PointType[int64, float64]</type>
        <field index="0" key="x">
            <NumpyArray format="l" shape="3" data="1 2 3" at="0x5616f0ed6a90">
                <type>int64</type>
            </NumpyArray>
        </field>
        <field index="1" key="y">
            <NumpyArray format="d" shape="3" data="1.1 2.2 3.3" at="0x5616f0ed8aa0">
                <type>float64</type>
            </NumpyArray>
        </field>
    </RecordArray>
</Record>

## 5. Agreement with NumPy

Awkward array represents a superset of NumPy's core, so it must return the same results as NumPy. This was tricky in the old Awkward, when we restricted ourselves to vectorized functions, and this led to hidden limitations: slices were limited to depth `2`, concatenation was limited to `axis <= 1`, and `choose(n)` was limited to `n < 5`. But now that we can write compiled for loops, there are no such limitations.

In [32]:
deepnumpy = np.arange(2*3*5*7).reshape(2, 3, 5, 7)
deepawkward = ak.Array(deepnumpy)
deepawkward

<Array [[[[0, 1, 2, 3, ... 207, 208, 209]]]] type='2 * 3 * 5 * 7 * int64'>

In [33]:
deepnumpy[1:, :2, [4, 1, 1, -2], ::-1]

array([[[[139, 138, 137, 136, 135, 134, 133],
         [118, 117, 116, 115, 114, 113, 112],
         [118, 117, 116, 115, 114, 113, 112],
         [132, 131, 130, 129, 128, 127, 126]],

        [[174, 173, 172, 171, 170, 169, 168],
         [153, 152, 151, 150, 149, 148, 147],
         [153, 152, 151, 150, 149, 148, 147],
         [167, 166, 165, 164, 163, 162, 161]]]])

In [34]:
deepawkward[1:, :2, [4, 1, 1, -2], ::-1]

<Array [... 166, 165, 164, 163, 162, 161]]]] type='1 * 2 * 4 * 7 * int64'>

In [35]:
ak.tolist(deepnumpy[1:, :2, [4, 1, 1, -2], ::-1]) == ak.tolist(deepawkward[1:, :2, [4, 1, 1, -2], ::-1])

True

## 6. Creating arrays

A few of the examples above create arrays by passing them to the `Array` constructor. This is like old Awkward's `fromiter` function. In fact, new Awkward has a `fromiter` function, but it's implicitly called by the `Array` constructor.

In [36]:
# Calls ak.fromiter, which converts rowwise → columnar data.
ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [37]:
# Calls ak.fromjson, which deserializes.
ak.Array("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [38]:
# Calls ak.fromnumpy, which views.
nparray = np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
akarray = ak.Array(nparray)
akarray

<Array [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]] type='2 * 3 * float64'>

In [39]:
nparray[0, 1] = 999
akarray

<Array [[1.1, 999, 3.3], [4.4, 5.5, 6.6]] type='2 * 3 * float64'>

You can also build these manually from the low-level layouts, but it's a lot of work!

In [40]:
content = ak.layout.NumpyArray(np.array([1.1, 2.2, 3.3, 4.4, 5.5]))
offsets = ak.layout.Index64(np.array([0, 3, 3, 5], dtype=np.int64))   # match 64-bit to 64-bit to avoid copy
listoffsetarray = ak.layout.ListOffsetArray64(offsets, content)
listoffsetarray

<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" at="0x5616f0ec99a0"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x5616f0ee3090"/></content>
</ListOffsetArray64>

In [41]:
ak.Array(listoffsetarray)

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

## 7. FillableArray

The `fromiter` algorithm has been expanded into a builder interface, so that you can accumulate Awkward arrays gradually.

In [42]:
builder = ak.FillableArray()

for i in range(10):
    builder.beginrecord()
    builder.field("x")
    builder.real(np.random.normal())
    builder.field("y")
    builder.beginlist()
    for j in range(np.random.poisson(2.5)):
        builder.integer(np.random.randint(0, 10))
    builder.endlist()
    builder.endrecord()

builder

<FillableArray [{x: -1.67, y: [3, 0, 4, ... y: [3]}] type='10 * {"x": float64, "...'>

This is not a regular array, but you can `snapshot` it to get one (and keep filling the `builder`). A `snapshot` does not copy array data: if you take several snapshots while filling, they _might_ share data. (And they _might_ not, if it has allocated new buffers to grow beyond its reserved space!)

In [43]:
array6 = builder.snapshot()
array6

<Array [{x: -1.67, y: [3, 0, 4, ... y: [3]}] type='10 * {"x": float64, "y": var ...'>

In [44]:
ak.tolist(array6)

[{'x': -1.6703813313070799, 'y': [3, 0, 4, 7]},
 {'x': -0.10744690340458463, 'y': [9, 9, 7]},
 {'x': 0.4079343007449045, 'y': [2]},
 {'x': -0.10967551671000512, 'y': [0, 5]},
 {'x': -0.3754715393782491, 'y': []},
 {'x': 0.8496379166275715, 'y': [8, 5]},
 {'x': -1.6003262252921946, 'y': [7, 9]},
 {'x': 0.7493345808171183, 'y': [3, 3, 1]},
 {'x': 0.5646686087544998, 'y': [7, 1, 6, 3]},
 {'x': 2.071962004838478, 'y': [3]}]

In [45]:
ak.typeof(array6)

10 * {"x": float64, "y": var * int64}

The array that you produce can have nested structure, as shown above. The structure was determined by the order in which `builder` methods were called.

You can write algorithms that build arrays as if you were printing out JSON:

   * call `beginlist()` instead of printing `"["`,
   * call `endlist()` instead of printing `"]"`,
   * call `beginrecord()` instead of printing `"{"`,
   * call `endrecord()` instead of printing `"}"`,
   * call `field(key)` instead of printing `"key":`, etc.

In [46]:
deepbuilder = ak.FillableArray()

def deepnesting(depth):
    if depth == 0:
        deepbuilder.integer(np.random.randint(0, 10))
    else:
        deepbuilder.beginlist()
        for j in range(np.random.poisson(2.5)):
            deepnesting(depth - 1)
        deepbuilder.endlist()

deepnesting(5)

In [47]:
ak.tolist(deepbuilder.snapshot())

[[[[[[5, 0, 4, 7, 3]],
    [[5, 7, 0, 6, 2], [6]],
    [[5, 2, 0]],
    [[8, 3], [6], [9, 0, 0, 7], [6]]],
   [[[4, 0, 0, 8], [3, 6, 5, 8], [8], [6], [9, 3], []],
    [[9, 6]],
    [[2, 2, 5, 7, 1], [2, 7, 4, 8], [6, 1], [2], [6, 6, 3, 3]],
    [[], [5]]],
   [],
   [[[0, 6]]]],
  [[[[8, 9, 6], [1, 0]],
    [[9], [6, 6, 3, 3, 1], [8]],
    [[0, 7], [0]],
    [[6], [1, 1], [9, 8, 7, 0]],
    [[0, 9, 1, 5, 2]]],
   [[[1, 9, 8, 3], [7, 5, 6], []], [[3, 8, 9]]],
   [[[4, 5], [7, 9, 8, 7], [2]]]]]]

In [48]:
ak.typeof(deepbuilder)

1 * var * var * var * var * var * int64

In [49]:
deepbuilder.snapshot().layout

<ListOffsetArray64>
    <type>var * var * var * var * var * int64</type>
    <offsets><Index64 i="[0 2]" offset="0" at="0x5616f0f00520"/></offsets>
    <content><ListOffsetArray64>
        <type>var * var * var * var * int64</type>
        <offsets><Index64 i="[0 4 7]" offset="0" at="0x5616f0f02530"/></offsets>
        <content><ListOffsetArray64>
            <type>var * var * var * int64</type>
            <offsets><Index64 i="[0 4 8 8 9 14 16 17]" offset="0" at="0x5616f0f04540"/></offsets>
            <content><ListOffsetArray64>
                <type>var * var * int64</type>
                <offsets><Index64 i="[0 1 3 4 8 ... 33 34 37 38 41]" offset="0" at="0x5616f0f06550"/></offsets>
                <content><ListOffsetArray64>
                    <type>var * int64</type>
                    <offsets><Index64 i="[0 5 10 11 14 ... 89 92 94 98 99]" offset="0" at="0x5616f0f08560"/></offsets>
                    <content><NumpyArray format="l" shape="99" data="5 0 4 7 3 ... 7 9 8 7 2" 

Both `fromiter` and `fromjson` are implemented using `FillableArray`, the latter using the RapidJSON C++ library for deserialization.

In [50]:
# !wget https://scikit-hep.org/uproot/examples/HZZ.json

In [51]:
hzz = ak.fromjson("HZZ.json")
hzz

<Array [{jets: [], ... weight: 0.00876}] type='2421 * {"jets": var * {"px": floa...'>

In [52]:
for key in hzz.layout.keys():
    print("{:18s} {}".format(key, hzz[key].type))

jets               2421 * var * {"px": float64, "py": float64, "pz": float64, "E": float64, "id": bool}
muons              2421 * var * {"px": float64, "py": float64, "pz": float64, "E": float64, "q": int64, "iso": float64}
electrons          2421 * var * {"px": float64, "py": float64, "pz": float64, "E": float64, "q": int64, "iso": float64}
photons            2421 * var * {"px": float64, "py": float64, "pz": float64, "E": float64, "iso": float64}
MET                2421 * {"x": float64, "y": float64}
MC_hadronic_b      2421 * {"px": float64, "py": float64, "pz": float64}
MC_leptonic_b      2421 * {"px": float64, "py": float64, "pz": float64}
MC_hadronicW_q     2421 * {"px": float64, "py": float64, "pz": float64}
MC_hadronicW_qbar  2421 * {"px": float64, "py": float64, "pz": float64}
MC_lepton          2421 * {"px": float64, "py": float64, "pz": float64, "pdgid": int64}
MC_neutrino        2421 * {"px": float64, "py": float64, "pz": float64}
num_PV             2421 * int64
trigger_isomu

The loop over Python objects or JSON nodes was moved from Python into C++, so it's faster. However, the implementation requires vtable-lookups (the type is discovered at runtime), so it's not _a lot_ faster. There's room for specialized methods when the type is known in advance. (See [src/libawkward/io/root.cpp](https://github.com/scikit-hep/awkward-1.0/blob/master/src/libawkward/io/root.cpp) for a ${\tt std::vector}^N{\tt<number>}$ implementation.)

In general, turning rowwise data into columnar data is about 10× faster than it used to be.

In [53]:
import awkward as oldawkward
import json
aslist = json.load(open("HZZ.json")) * 10
asjson = json.dumps(aslist)

In [54]:
%%timeit -r 3

ak.fromiter(aslist)                       # new fromiter

253 ms ± 15.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [55]:
%%timeit -r 3

oldawkward.fromiter(aslist)               # old fromiter

1.87 s ± 80.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [56]:
%%timeit -r 3

ak.fromjson(asjson)                       # new fromjson

191 ms ± 1.55 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [57]:
%%timeit -r 3

oldawkward.fromiter(json.loads(asjson))   # old equivalent of fromjson

2.29 s ± 97.2 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


## 8. Awkward arrays in Numba

One of the motivating goals of the Awkward re-write was to incorporate Numba on the same footing.

In [58]:
import numba

@numba.jit(nopython=True)
def muon_sumpt(events):
    out = np.zeros(len(events), np.float64)
    i = 0
    for event in events:
        for muon in event["muons"]:
            out[i] += np.sqrt(muon["px"]**2 + muon["py"]**2)
        i += 1
    return out

In [59]:
hzz = ak.Array(json.load(open("HZZ.json")) * 100).layout
muon_sumpt(hzz)

array([91.91225969, 24.41791248, 83.40026411, ..., 33.46153652,
       63.61981771, 42.93994828])

Notice that we can write for loops on event _records_ and muon _records_. We don't have to take apart `JaggedArrays` and write algorithms on offsets and indexes.

Incidentally, my first version of the above raised segfaults because the `i += 1` was in the inner loop, rather than the outer loop (indentation error). Since it's Numba, I could debug it by running the pure Python version.

In [60]:
muon_sumpt.py_func

<function __main__.muon_sumpt(events)>

In [61]:
muon_sumpt.py_func(hzz)

array([91.91225969, 24.41791248, 83.40026411, ..., 33.46153652,
       63.61981771, 42.93994828])

In [62]:
%%timeit -r 3

muon_sumpt(hzz)                           # in Numba

234 ms ± 5.36 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [63]:
%%timeit -r 1

muon_sumpt.py_func(hzz)                   # pure Python

11.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


So, a 50× speedup over Python without changing any code. Debug in Python, accelerate with Numba.

Awkward arrays could be a benefit to Numba users in general: Numba can handle complex data types by converting Python objects to and from equivalent structs, but that puts a translation burden at the entry and exit of every Numba function. Awkward leaves the data in the same form (big array buffers), transforming only its handles to the data. (JSON → Awkward arrays → Numba could become a useful workflow in industry.)

#### Side note...

It looks like there's a performance bug in the current implementation: if we remove all particles but muons from what we pass through Numba, we see a 3× speedup relative to leaving them in that scales with the size of the dataset (100× vs 1000×). That shouldn't happen: unused fields are supposed to be ignored in the compiled code. Once everything is operational, we'll investigate these performance issues.

In [64]:
%%timeit -r 3

muon_sumpt(hzz.astype(None)[["muons"]])   # in Numba, passing only muons through

76.2 ms ± 916 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)


In [65]:
%%timeit -r 3

muon_sumpt(hzz.astype(None))              # in Numba, passing everything through

232 ms ± 5.31 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


## 9. FillableArray in Numba

The example above took an Awkward array into a Numba function and did some processing on it. To write data out, we can use `FillableArrays`.

In [66]:
@numba.jit(nopython=True)
def make_data(builder):
    for i in range(10):
        builder.beginrecord()

        builder.field("x")
        builder.real(i*1.1)

        builder.field("y")
        builder.beginlist()
        for j in range(i):
            builder.integer(j)
        builder.endlist()

        builder.endrecord()

    return builder

In [67]:
builder = ak.layout.FillableArray()
make_data(builder)

<FillableArray length="10" type="{"x": float64, "y": var * int64}"/>

In [68]:
builder.snapshot()

<RecordArray>
    <type>{"x": float64, "y": var * int64}</type>
    <field index="0" key="x">
        <NumpyArray format="d" shape="10" data="0 1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9" at="0x561719fb6470">
            <type>float64</type>
        </NumpyArray>
    </field>
    <field index="1" key="y">
        <ListOffsetArray64>
            <type>var * int64</type>
            <offsets><Index64 i="[0 0 1 3 6 ... 15 21 28 36 45]" offset="0" at="0x56171a1d0470"/></offsets>
            <content><NumpyArray format="l" shape="45" data="0 0 1 0 1 ... 4 5 6 7 8" at="0x561719b661f0">
                <type>int64</type>
            </NumpyArray></content>
        </ListOffsetArray64>
    </field>
</RecordArray>

In [69]:
ak.tolist(builder.snapshot())

[{'x': 0.0, 'y': []},
 {'x': 1.1, 'y': [0]},
 {'x': 2.2, 'y': [0, 1]},
 {'x': 3.3000000000000003, 'y': [0, 1, 2]},
 {'x': 4.4, 'y': [0, 1, 2, 3]},
 {'x': 5.5, 'y': [0, 1, 2, 3, 4]},
 {'x': 6.6000000000000005, 'y': [0, 1, 2, 3, 4, 5]},
 {'x': 7.700000000000001, 'y': [0, 1, 2, 3, 4, 5, 6]},
 {'x': 8.8, 'y': [0, 1, 2, 3, 4, 5, 6, 7]},
 {'x': 9.9, 'y': [0, 1, 2, 3, 4, 5, 6, 7, 8]}]

Since you can walk over data structures and create data structures (and later, assign fields to datasets like the old `Table`), you have complete freedom to manipulate data

   * at compiled-code speeds,
   * without having to leave the Python environment,
   * without having to rethink your algorithm in terms of array-at-a-time functions.

(This _supplements_ the array-at-a-time approach introduced last year.)

## 10. Awkward arrays in C++

Since everything has been implemented in C++, Awkward 1.0 can be used in C++ programs. More importantly, we will (someday) be able to create Awkward arrays in C++ and access them in Python or vice-versa.

In [70]:
open("test-program.cpp", "w").write("""

#include <iostream>

#include "awkward/fillable/FillableArray.h"
#include "awkward/fillable/FillableOptions.h"

namespace ak = awkward;

int main(int, char**) {
  ak::FillableArray builder(ak::FillableOptions(1024, 2.0));
  for (int i = 0;  i < 10;  i++) {
    builder.beginrecord();

    builder.field_fast("x");    // (field_fast means don't check the whole string, just its pointer)
    builder.real(i*1.1);

    builder.field_fast("y");
    builder.beginlist();
    for (int j = 0;  j < i;  j++) {
      builder.integer(j);
    }
    builder.endlist();

    builder.endrecord();
  }
  
  std::cout << builder.snapshot()->tojson(false, 1) << std::endl;
  return 0;
}
""")

672

In [71]:
import pygments.formatters
import pygments.lexers.c_cpp
print(pygments.highlight(open("test-program.cpp").read(),
                         pygments.lexers.c_cpp.CppLexer(),
                         pygments.formatters.Terminal256Formatter()))

[38;5;136m#[39m[38;5;136minclude[39m [38;5;66m<iostream>[39m

[38;5;136m#[39m[38;5;136minclude[39m [38;5;66m"awkward/fillable/FillableArray.h"[39m
[38;5;136m#[39m[38;5;136minclude[39m [38;5;66m"awkward/fillable/FillableOptions.h"[39m

[38;5;28;01mnamespace[39;00m ak [38;5;241m=[39m awkward;

[38;5;125mint[39m [38;5;21mmain[39m([38;5;125mint[39m, [38;5;125mchar[39m[38;5;241m*[39m[38;5;241m*[39m) {
  ak[38;5;241m:[39m[38;5;241m:[39mFillableArray builder(ak[38;5;241m:[39m[38;5;241m:[39mFillableOptions([38;5;241m1024[39m, [38;5;241m2.0[39m));
  [38;5;28;01mfor[39;00m ([38;5;125mint[39m i [38;5;241m=[39m [38;5;241m0[39m;  i [38;5;241m<[39m [38;5;241m10[39m;  i[38;5;241m+[39m[38;5;241m+[39m) {
    builder.beginrecord();

    builder.field_fast([38;5;124m"[39m[38;5;124mx[39m[38;5;124m"[39m);    [38;5;66m// (field_fast means don't check the whole string, just its pointer)[39m
    builder.real(i[38;5;241m*[39m[38;5;241m1.1

In [72]:
!g++ -I../../include -L../../awkward1 test-program.cpp -lawkward-static -lawkward-cpu-kernels-static -o test-program

In [73]:
!./test-program

[{"x":0.0,"y":[]},{"x":1.1,"y":[0]},{"x":2.2,"y":[0,1]},{"x":3.3,"y":[0,1,2]},{"x":4.4,"y":[0,1,2,3]},{"x":5.5,"y":[0,1,2,3,4]},{"x":6.6,"y":[0,1,2,3,4,5]},{"x":7.7,"y":[0,1,2,3,4,5,6]},{"x":8.8,"y":[0,1,2,3,4,5,6,7]},{"x":9.9,"y":[0,1,2,3,4,5,6,7,8]}]


## 11. Identities: database-like index for arrays

In the [PartiQL toy language](https://github.com/jpivarski/PartiQL#readme), it became apparent that set operations, in which unique records are identified by reference, rather than by value, are important. They provide such operations as joins and lossless unions.

No set operations have been implemented, but implementing them will require an index that tracks particle identities through all other operations. This concept of an index is the primary distinction between an array library like NumPy and a relational library like Pandas. In Awkward, this index is called an `Identity` and can optionally be attached to arrays.

**Note:** this interface is the most likely to change. Identities have only been implemented at this early stage so that they don't have to be painfully retrofitted later.

In [74]:
hzzlayout = ak.fromjson("HZZ.json").layout

In [75]:
hzzlayout.setid()
hzzlayout.id

<Identity32 ref="0" fieldloc="[]" width="1" offset="0" length="2421" at="0x561719a2a800"/>

In [76]:
hzzlayout.field("muons").content.field("px").id

<Identity64 ref="2" fieldloc="[(0, 'muons') (1, 'px')]" width="2" offset="0" length="3825" at="0x561719cb5530"/>

In [77]:
np.asarray(hzzlayout.field("muons").content.field("px").id)

array([[   0,    0],
       [   0,    1],
       [   1,    0],
       ...,
       [2418,    0],
       [2419,    0],
       [2420,    0]], dtype=int64)

An `Identity` is a 2-dimensional array with the same structure as a Pandas row `MultiIndex` with a `fieldloc` for the nested columns. They're equivalent to paths from root (wherever you called `setid`) to the element in question.

In [78]:
hzzlayout[1000, "muons", 1].location

(1000, 'muons', 1)

As a nice side-effect of having indexes, we can give better error messages about where an indexing error occurs. You might use `Identities` just for debugging.

In [79]:
# Indexing error with an Identity:
try:
    hzzlayout[1000, "muons", 2]
except Exception as err:
    print(err)

in ListArray64 at id[1000, "muons"] attempting to get 2, index out of range


In [80]:
# Indexing error without an Identity:
try:
    ak.fromjson("HZZ.json").layout[1000, "muons", 2]
except Exception as err:
    print(err)

in ListArray64 attempting to get 2, index out of range


When the array goes through any kind of transformation, such as the boolean filter below, the `Identity` is similarly selected.

In [81]:
mask = np.random.randint(0, 100, len(hzzlayout)) == 0
mask

array([False, False, False, ..., False, False, False])

In [82]:
selected = hzzlayout[mask]
np.asarray(selected.id)

array([[  14],
       [ 190],
       [ 322],
       [ 336],
       [ 357],
       [ 531],
       [ 600],
       [ 932],
       [ 934],
       [1077],
       [1080],
       [1484],
       [1606],
       [1619],
       [1682],
       [1684],
       [1763],
       [1767],
       [1876],
       [2074],
       [2091],
       [2144]], dtype=int32)

In this way, the `Identity` acts as a set of labels that are permanently glued to the array elements.

## 12. Conclusions

Four months into the six-month sprint, the basic shape of the new library is established. It is lacking many features, mostly the ones that worked well in the old library and can be copied over. Yana has started working with me, Josh is beginning to use Awkward in an external (C++) library, and I'll be collaborating with Henry on behavioral mix-ins for Lorentz vectors, and maybe with Lukas on ATLAS data types, soon.

In the next two months, Yana and I will be solidifying and feature-loading the internals. I'll also be developing the high-level interface (e.g. adding NumPy ufuncs, NumExpr and Pandas interoperability) and keeping the Numba port up-to-date. I'm also open to contributions to the high-level interface from Coffea. Perhaps we can eliminate the need for a `JaggedCandidateArray` by integrating the ease-of-use features and putting all of the domain-specific methods in a thin behavioral mix-in.