# 2019-12-20-coffea-demo

This demo of the new Awkward Array was presented on December 20, 2019, before the final 1.0 version was released. Some interfaces may have changed. To run this notebook, make sure you have version 0.1.33 ([GitHub](https://github.com/scikit-hep/awkward-1.0/releases/tag/0.1.33), [pip](https://pypi.org/project/awkward1/0.1.33/)) by installing

```bash
pip install 'awkward1==0.1.33'
```

The basic concepts of Awkward arrays are presented on the [old Awkward README](https://github.com/scikit-hep/awkward-array/tree/0.12.17#readme) and the motivation for a 1.0 rewrite are presented on the [new Awkward README](https://github.com/scikit-hep/awkward-1.0/tree/0.1.32#readme).

In [1]:
# Please ignore the man behind the curtain...
import sys
import os
sys.path.insert(0, os.path.join(os.getcwd(), "..", ".."))

## High-level array class

The biggest user-facing change is that, instead of mixing NumPy arrays and `JaggedArray` objects, the new Awkward has a single `Array` class.

In [2]:
import numpy as np
import awkward1 as ak

array1 = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
array1

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [3]:
array2 = ak.Array([{"x": 0, "y": []}, {"x": 1, "y": [1.1]}, {"x": 2, "y": [1.1, 2.2]}])
array2

<Array [{x: 0, y: []}, ... y: [1.1, 2.2]}] type='3 * {"x": int64, "y": var * flo...'>

The same `Array` class is used for all data structures, such as the array of lists in `array1` and the array of records in `array2`.

There won't be any user-level functions that apply to some data types and not others. The result of an operation is likely type-dependent, but its accessibility is not. (At this time, the only existing operations are conversions and descriptions.)

(Incidentally, the width of that string representation is exactly large enough to fit into GitHub and StackOverflow text boxes without scrolling.)

In [4]:
ak.tolist(array1)

[[1.1, 2.2, 3.3], [], [4.4, 5.5]]

In [5]:
ak.tojson(array1)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [6]:
ak.tolist(array2)

[{'x': 0, 'y': []}, {'x': 1, 'y': [1.1]}, {'x': 2, 'y': [1.1, 2.2]}]

In [7]:
ak.tojson(array2)

'[{"x":0,"y":[]},{"x":1,"y":[1.1]},{"x":2,"y":[1.1,2.2]}]'

In [8]:
ak.typeof(array1)

3 * var * float64

In [9]:
ak.typeof(array2)

3 * {"x": int64, "y": var * float64}

(Data types are described using the [datashape language](https://datashape.readthedocs.io/en/latest/). Some Awkward features are [not expressible](https://github.com/blaze/datashape/issues/237) in the current datashape specification, so they're expressed in an extension of the language using the same style of syntax.)

The next major change in interface is that operations on arrays, such as `ak.tolist` and `ak.typeof` above, are free-standing functions, rather than class methods. This is because it's desirable to put domain specific (e.g. physics) methods on the array object itself; using free-standing functions for array manipulations avoids conflicts. For example,

   * `ak.cross(array1, array2)` is an array-manipulation function (the cross-join of `array1` and `array2`)
   * `array1.cross(array2)` could be a user-defined method, such as the 3D cross-product, if `array1` and `array2` represent (arrays of) 3D vectors.
   * `array1.somefield` is a shortcut for `array1["somefield"]`.

## Low-level array classes

The old `JaggedArray` and `Table` are still available, but you have to ask for them explicitly with `layout`. They're not "private" or "internal implementations" (there's no underscore in `layout`): they're public for frameworks like Coffea but hidden from data analysts.

As such, their string representations have more low-level detail: the contents of indexes, rather than what they mean as high-level types. (The XML formatting is just an elaboration on Python's angle-bracket convention for `repr` and the fact that we need to denote nesting.)

In [10]:
array1.layout

<ListOffsetArray64>
    <type>var * float64</type>
    <offsets><Index64 i="[0 3 3 5]" offset="0" at="0x55aa1e0824f0"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x55aa1e084500">
        <type>float64</type>
    </NumpyArray></content>
</ListOffsetArray64>

In [11]:
array2.layout

<RecordArray>
    <type>{"x": int64, "y": var * float64}</type>
    <field index="0" key="x">
        <NumpyArray format="l" shape="3" data="0 1 2" at="0x55aa1e08bd40">
            <type>int64</type>
        </NumpyArray>
    </field>
    <field index="1" key="y">
        <ListOffsetArray64>
            <type>var * float64</type>
            <offsets><Index64 i="[0 0 1 3]" offset="0" at="0x55aa1e08dd50"/></offsets>
            <content><NumpyArray format="d" shape="3" data="1.1 1.1 2.2" at="0x55aa1e08fd60">
                <type>float64</type>
            </NumpyArray></content>
        </ListOffsetArray64>
    </field>
</RecordArray>

These classes are defined in C++ and wrapped by pybind11. The `awkward1.Array` class is pure Python. Many of the same operations work for layout classes, though less attention has been paid to its interface.

In [12]:
ak.typeof(array1)

3 * var * float64

In [13]:
ak.typeof(array1.layout)

var * float64

In [14]:
ak.tojson(array1)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [15]:
ak.tojson(array1.layout)

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

In [16]:
array1.layout.tojson()

'[[1.1,2.2,3.3],[],[4.4,5.5]]'

## Behavioral mix-ins

The primary use of Awkward arrays so far has been to represent arrays or jagged arrays of physics objects with physics methods on the array objects themselves. In Awkward 0.x, this was implemented with Python multiple inheritance, but that's a Python-only solution that can't be passed into C++ (and it was brittle: easy for an array component to lose its methods).

Now behavioral mix-ins are a "first class citizen," built into Awkward 1.0's type system.

In [17]:
class PointClass(ak.Record):
    def __repr__(self):
        return "<Point({}, {})>".format(self["x"], self["y"])
    
    def mag(self):
        return abs(np.sqrt(self["x"]**2 + self["y"]**2))

ak.namespace["Point"] = PointClass

In [18]:
array3 = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
array3

<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>

In [19]:
array3.layout.type

{"x": int64, "y": float64}

Types can have arbitrary parameters, which modify their meaning. These types are JSON-encoded and passed through C++ or wherever the arrays get sent.

In [20]:
pointtype = array3.layout.type
pointtype["__class__"] = "Point"
pointtype

struct[["x", "y"], [int64, float64], parameters={"__class__": "Point"}]

In [21]:
pointtype["__str__"] = "PointType[{}, {}]".format(pointtype.field("x"), pointtype.field("y"))
pointtype

PointType[int64, float64]

In [22]:
# There will be a better interface for assigning types...
array4 = ak.Array(array3.layout, type=ak.ArrayType(pointtype, len(array3.layout)))
array4

<Array [<Point(1, 1.1)>, ... <Point(3, 3.3)>] type='3 * PointType[int64, float64]'>

In [23]:
[x.mag() for x in array4]

[1.4866068747318506, 2.973213749463701, 4.459820624195552]

The elements of this array are `PointClass` instances because the `__class__` parameter is `"Point"`, a name that is recognized in Awkward's class namespace.

In [24]:
ak.namespace

{'char': awkward1.behavior.string.CharBehavior,
 'string': awkward1.behavior.string.StringBehavior,
 'Point': __main__.PointClass}

As you can see, arrays of characters and variable-length strings are implemented as mix-ins. Apart from this type annotation, a string is just a jagged array of 8-bit integers.

In [25]:
array5 = ak.Array(["Daisy", "Daisy", "give", "me", "your", "answer", "do."])
array5

<Array ['Daisy', 'Daisy', ... 'answer', 'do.'] type='7 * string'>

In [26]:
array5.layout

<ListOffsetArray64>
    <type>string</type>
    <offsets><Index64 i="[0 5 10 14 16 20 26 29]" offset="0" at="0x55aa1e0aa120"/></offsets>
    <content><NumpyArray format="B" shape="29" data="0x 44616973 79446169 73796769 76656d65 796f7572 616e7377 6572646f 2e" at="0x55aa1e04c550">
        <type>utf8</type>
    </NumpyArray></content>
</ListOffsetArray64>

In [27]:
ak.tolist(array5.layout)

[[68, 97, 105, 115, 121],
 [68, 97, 105, 115, 121],
 [103, 105, 118, 101],
 [109, 101],
 [121, 111, 117, 114],
 [97, 110, 115, 119, 101, 114],
 [100, 111, 46]]

In [28]:
ak.tolist(array5.layout[:, 1:])

[[97, 105, 115, 121],
 [97, 105, 115, 121],
 [105, 118, 101],
 [101],
 [111, 117, 114],
 [110, 115, 119, 101, 114],
 [111, 46]]

In [29]:
array5[:, 1:]

<Array ['aisy', 'aisy', ... 'nswer', 'o.'] type='7 * string'>

The string interpretation is _only_ applied to the high-level `Array` and _not_ to the layout classes. Thus,

   * superclass-based mix-ins don't have to be captured and passed on through all operations,
   * mix-ins can pass through C++ because they are only JSON-encoded type parameters, not a Python class,
   * mix-in classes don't have to be dynamically generated (`PointClass` has a "fixed address" for pickling),
   * the mechanism for array mix-ins (e.g. `string`) is the same as for producing objects (e.g. `PointClass`); there is no need to introduce an `ObjectArray`,
   * unlike old Awkward's `ObjectArray`, these records remain Awkward data structures when instantiated.

In [30]:
array4[2]

<Point(3, 3.3)>

In [31]:
array4[2].layout

<Record at="2">
    <RecordArray>
        <type>PointType[int64, float64]</type>
        <field index="0" key="x">
            <NumpyArray format="l" shape="3" data="1 2 3" at="0x55aa1e0a19c0">
                <type>int64</type>
            </NumpyArray>
        </field>
        <field index="1" key="y">
            <NumpyArray format="d" shape="3" data="1.1 2.2 3.3" at="0x55aa1e0a39d0">
                <type>float64</type>
            </NumpyArray>
        </field>
    </RecordArray>
</Record>

## Creating arrays

A few of the examples above create arrays by passing them to the `Array` constructor. This is like old Awkward's `fromiter` function. In fact, new Awkward has a `fromiter` function, but it's implicitly called by the `Array` constructor.

In [42]:
# Calls ak.fromiter, which converts rowwise → columnar data.
ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [43]:
# Calls ak.fromjson, which deserializes.
ak.Array("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

In [44]:
# Calls ak.fromnumpy, which views.
nparray = np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
akarray = ak.Array(nparray)
akarray

<Array [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]] type='2 * 3 * float64'>

In [45]:
nparray[0, 1] = 999
akarray

<Array [[1.1, 999, 3.3], [4.4, 5.5, 6.6]] type='2 * 3 * float64'>

You can also build these manually from the layouts, but it's a lot of work!

In [46]:
content = ak.layout.NumpyArray(np.array([1.1, 2.2, 3.3, 4.4, 5.5]))
offsets = ak.layout.Index64(np.array([0, 3, 3, 5], dtype=np.int64))   # match 64-bit to avoid copy
listoffsetarray = ak.layout.ListOffsetArray64(offsets, content)
listoffsetarray

<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" at="0x55aa1e095760"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x55aa1e0e3ee0"/></content>
</ListOffsetArray64>

In [47]:
ak.Array(listoffsetarray)

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

## Interaction with C++

Since everything has been implemented in C++, it can be used in C++ programs. More importantly, we will (someday) be able to create Awkward arrays in C++ and access them in Python or vice-versa.

Here is a standalone example from the unit tests.

In [41]:
import pygments.formatters
import pygments.lexers.c_cpp

cpp_code = open(os.path.join(os.getcwd(), "..", "..", "tests", "test_PR019_use_json_library.cpp")).read()
print(pygments.highlight(cpp_code, pygments.lexers.c_cpp.CppLexer(),
                         pygments.formatters.Terminal256Formatter()))

[38;5;66m// BSD 3-Clause License; see https://github.com/jpivarski/awkward-1.0/blob/master/LICENSE[39m

[38;5;136m#[39m[38;5;136minclude[39m [38;5;66m"awkward/Slice.h"[39m
[38;5;136m#[39m[38;5;136minclude[39m [38;5;66m"awkward/fillable/FillableArray.h"[39m
[38;5;136m#[39m[38;5;136minclude[39m [38;5;66m"awkward/fillable/FillableOptions.h"[39m

[38;5;28;01mnamespace[39;00m ak [38;5;241m=[39m awkward;

[38;5;125mint[39m [38;5;21mmain[39m([38;5;125mint[39m, [38;5;125mchar[39m[38;5;241m*[39m[38;5;241m*[39m) {
  std[38;5;241m:[39m[38;5;241m:[39mvector[38;5;241m<[39mstd[38;5;241m:[39m[38;5;241m:[39mvector[38;5;241m<[39mstd[38;5;241m:[39m[38;5;241m:[39mvector[38;5;241m<[39m[38;5;125mdouble[39m[38;5;241m>[39m[38;5;241m>[39m[38;5;241m>[39m vector [38;5;241m=[39m
    {{{[38;5;241m0.0[39m, [38;5;241m1.1[39m, [38;5;241m2.2[39m}, {}, {[38;5;241m3.3[39m, [38;5;241m4.4[39m}}, {{[38;5;241m5.5[39m}}, {}, {{[38;5;241m6.6[39m, 

Below is the same thing in Python, demonstrating equivalence.

In [35]:
vector = [[[0.0, 1.1, 2.2], [], [3.3, 4.4]], [[5.5]], [], [[6.6, 7.7, 8.8, 9.9]]]

builder = ak.layout.FillableArray()
for x in vector: builder.fill(x)
array = builder.snapshot()

array[::-1, ::2, 1::].tojson()

'[[[7.7,8.8,9.9]],[],[[]],[[1.1,2.2],[4.4]]]'