<br><br><br><br><br>

# Awkward datasets

<br><br><br><br><br>

<br><br><br><br><br>

It's not uncommon for data to be non-rectangular. Jagged ("ragged") arrays, cross-references, trees, and graphs are frequently encountered, but difficult to cast as Numpy arrays or Pandas DataFrames.

<br>

**Let's start with NASA's exoplanet database:** each star can have an arbitrary number of planets (jagged array).

<br><br><br><br><br>

In [None]:
import pandas

# NASA provides this dataset as a CSV file, which suggests a rectangular table: one row per planet.
exoplanets = pandas.read_csv("data/nasa-exoplanets.csv")
exoplanets

In [None]:
# Quite a few planets in this table have the same star ("host") name.

numplanets = exoplanets.groupby("pl_hostname").size()
numplanets[numplanets > 1]

In [None]:
# Use Pandas's MultiIndex to represent a sparse, 2D index (stars × planets without missing values).

exoplanets.index = pandas.MultiIndex.from_arrays([exoplanets["pl_hostname"], exoplanets["pl_letter"]])
exoplanets.index.names = ["star", "planet"]
exoplanets

In [None]:
# Simplify the table to show 5 star attributes and 5 planet attributes. Star attributes are repeated.

df = exoplanets[["ra", "dec", "st_dist", "st_mass", "st_rad", "pl_orbsmax", "pl_orbeccen", "pl_orbper", "pl_bmassj", "pl_radj"]]
df.columns = pandas.MultiIndex.from_arrays([["star"] * 5 + ["planet"] * 5,
    ["right asc. (deg)", "declination (deg)", "distance (pc)", "mass (solar)", "radius (solar)", "orbit (AU)", "eccen.", "period (days)", "mass (Jupiter)", "radius (Jupiter)"]])
df

In [None]:
# DataFrame.unstack moves the sparse planet index into a dense set of columns.
# Every column (reduced to 2: orbit and mass) is duplicated 8 times because one star has 8 planets.

df[[("planet", "orbit (AU)"), ("planet", "mass (Jupiter)")]].unstack("planet")

In [None]:
# We can also select a cross-section (xs) of the index by planet letter to focus on one at a time.

df.xs("b", level="planet")   # try "c", "d", "e", "f", "g", "h", "i"

<br><br><br><br><br>

### Alternative: stars and planets as nested objects

<br><br><br><br><br>

In [None]:
# Despite the nice tools Pandas provides, it's easier to think of stars and planets as objects.

stardicts = []
for (starname, planetname), row in df.iterrows():
    if len(stardicts) == 0 or stardicts[-1]["name"] != starname:
        stardicts.append({"name": starname,
                          "ra": row["star", "right asc. (deg)"],
                          "dec": row["star", "declination (deg)"],
                          "dist": row["star", "distance (pc)"],
                          "mass": row["star", "mass (solar)"],
                          "radius": row["star", "radius (solar)"],
                          "planets": []})
    stardicts[-1]["planets"].append({"name": planetname,
                                     "orbit": row["planet", "orbit (AU)"],
                                     "eccen": row["planet", "eccen."],
                                     "period": row["planet", "period (days)"],
                                     "mass": row["planet", "mass (Jupiter)"],
                                     "radius": row["planet", "radius (Jupiter)"]})

stardicts[:30]

In [None]:
# But this destroys Numpy's array-at-a-time performance and (in some cases) convenience.

# Here's a way to get both (disclosure: I'm the author).
import awkward

stars = awkward.fromiter(stardicts)
stars

In [None]:
# The data are logically a collection of nested lists and dicts...

stars[:30].tolist()

In [None]:
# ...but they have been entirely converted into arrays.
for starattr in "name", "ra", "dec", "dist", "mass", "radius":
    print("{:15s} =".format("stars[{!r:}]".format(starattr)), stars[starattr])

print()
for planetattr in "name", "orbit", "eccen", "period", "mass", "radius":
    print("{:26s} =".format("stars['planets'][{!r:}]".format(planetattr)), stars["planets"][planetattr])

In [None]:
# The object structure is a façade, built on Numpy arrays.

planet_masses = stars["planets"]["mass"]

# It appears to be a list of lists;
print("\nplanet_masses =", planet_masses)

# but it is a JaggedArray class instance;
print("\ntype(planet_masses) =", type(planet_masses))

# whose numerical data are in a content array;
print("\nplanet_masses.content =", planet_masses.content)

# and divisions between stars are encoded in an offsets array.
print("\nplanet_masses.offsets =", planet_masses.offsets)

In [None]:
# Pandas's unstack becomes...

stars["planets"][["orbit", "mass"]].pad(8).tolist()

In [None]:
# ...which can be used to produce regular Numpy arrays.

maxplanets = stars["planets"].counts.max()

stars["planets"]["mass"].pad(maxplanets).fillna(float("nan")).regular()

In [None]:
# Pandas's cross-section becomes...

stars["planets"][:, 0].tolist()

In [None]:
# ...though the first dimension must be selected for >= n subelements to ask for the nth subelement.

print("stars['planets'].counts =", stars["planets"].counts)

atleast3 = (stars["planets"].counts >= 3)
print("atleast3 =", atleast3)

stars["planets"][atleast3, 2].tolist()

In [None]:
# Motivated by particle physics analyses, which have particularly complex events.
import uproot

# Open a simplified file (for tutorials).
lhc_data = uproot.open("http://scikit-hep.org/uproot/examples/HZZ.root")["events"]

# Read columns of data for particle energies.
particle_energies = lhc_data.arrays(["*_E"], namedecode="utf-8")

# There's a different number of particles for each particle type in each event.
for name, array in particle_energies.items():
    print("\nparticle_energies['{}'] = {}".format(name, array))

<br><br>

### Overview of Awkward Arrays

Awkward Array (`import awkward`) has been designed to resemble a generalization of Numpy to

   * jagged arrays
   * non-rectangular tables
   * nullable types
   * heterogeneous lists
   * cross-references and cyclic references
   * non-contiguous arrays
   * virtual data and objects

<br><br>

In [None]:
# Generate simple data or convert from JSON using fromiter.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

# Columnar structure is built into the resulting object.
print("\na =", a)
print("\ntype(a) =", type(a))
print("\na.content =", a.content)
print("\na.offsets =", a.offsets)

In [None]:
# Numpy ufuncs pass through the structure for array-at-a-time calculations.

# (Uses the same __array_ufunc__ trick as CuPy and Dask...)

import numpy

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
print(numpy.sqrt(a))

In [None]:
# Array-at-a-time calculations are only possible if all arguments have the same structure.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward.fromiter([[100, 200, 300], [], [400, 500]])

print("a + b =", a + b)

In [None]:
# In Numpy, scalars can be "broadcasted" to be used in calculations with arrays.

# Generalizing this, Numpy arrays can be "broadcasted" to fit jagged arrays.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = numpy.array([100, 200, 300])

print("a + b =", a + b)

In [None]:
# Slicing works like Numpy.

a = awkward.fromiter([[1.1, 2.2, 3.3, 4.4], [5.5, 6.6], [7.7, 8.8, 9.9]])

# Take the first two outer lists.
print("\na[:2]    =", a[:2])

# Take the first two of each inner list.
print("\na[:, :2] =", a[:, :2])

In [None]:
# Masking works like Numpy, but with new capabilities for jagged masks.

a          = awkward.fromiter([[ 1.1,   2.2,   3.3,  4.4], [  5.5,  6.6], [  7.7,  8.8,   9.9]])
mask       = awkward.fromiter([True,                       False,         True])
jaggedmask = awkward.fromiter([[True, False, False, True], [False, True], [False, False, False]])

# Filter outer lists.
print("\na[mask]       =", a[mask])

# Filter inner lists.
print("\na[jaggedmask] =", a[jaggedmask])

In [None]:
# Integer indexing works like Numpy, but with new capabilities for jagged indexes.

a           = awkward.fromiter([[1.1, 2.2, 3.3, 4.4], [5.5, 6.6], [7.7, 8.8, 9.9]])
index       = awkward.fromiter([2, 1, 1, 0])
jaggedindex = awkward.fromiter([[3, 0, 0, 1, 2], [], [-1]])

# Apply an integer function to outer lists.
print("\na[index]       =", a[index])

# Apply an integer function to inner lists.
print("\na[jaggedindex] =", a[jaggedindex])

In [None]:
# In Numpy, "reducers" turn arrays into scalars.

# Generalizing this, jagged arrays can be "reduced" to Numpy arrays.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

print("\na.sum() =", a.sum())
print("\na.max() =", a.max())

In [None]:
# Like Numpy, argmax and argmin produce integer indexes appropriate for application to arrays.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward.fromiter([[100, 200, 300], [], [400, 500]])

indexes = a.argmax()
print("\nindexes    =", indexes)
print("\nb[indexes] =", b[indexes])

In [None]:
# Since we often deal with different numbers of objects in the same event, we need ways to
# match them for comparison.

a = awkward.fromiter([[1.1, 2.2, 3.3], [],   [4.4, 5.5]])
b = awkward.fromiter([[10, 20],        [30], [40]])

print("\na.cross(b) =", a.cross(b))
print("\na.cross(b).i0 (lefts)  =", a.cross(b).i0)
print("\na.cross(b).i1 (rights) =", a.cross(b).i1)

<br><br><br><br><br>

### Application to a realistic problem

Based on a typical case in particle physics, but general enough for all sciences.

<br><br><br><br><br>

In [None]:
# Suppose we have a variable number of real objects in each event.

import collections
T = collections.namedtuple("T", ["x", "y"])

truth = []
for i in range(10):
    truth.append([])
    for j in range(numpy.random.poisson(2)):
        truth[-1].append(T(*numpy.random.randint(0, 100, 2)/100))

truth

In [None]:
# When we try to reconstruct these objects from the signals they produce,
# the measurements have error, some unlucky objects are lost, and some spurious noise is added.

M = collections.namedtuple("M", ["x", "y"])

error = lambda: numpy.random.normal(0, 0.001)
unlucky = lambda: numpy.random.uniform(0, 1) < 0.2

observed = []
for event in truth:
    observed.append([M(x + error(), y + error()) for x, y in event if not unlucky()])
    for j in range(numpy.random.poisson(0.25)):
        observed[-1].append(M(*numpy.random.normal(0.5, 0.25, 2)))

observed

In [None]:
# So the simulated data look like this:

data = awkward.Table(truth=awkward.fromiter(truth), observed=awkward.fromiter(observed))
data.tolist()

In [None]:
# The measured objects were reconstructed from raw signals in our simulation by a complex process.

# We want to match real and measured to learn what the simulation is telling us about measurement
# errors, missing fraction, and spurious fraction.

pairs = data["truth"].cross(data["observed"], nested=True)    # pairs for all combinations

distances = numpy.sqrt((pairs.i0["x"] - pairs.i1["x"])**2 +   # compute distance for all
                       (pairs.i0["y"] - pairs.i1["y"])**2)
print("\ndistances[0] =", distances[0])

best = distances.argmin()                                     # pick smallest distance
print("\nbest =", best)

good_enough = (distances[best] < 0.005)                       # exclude if the distance is too large
print("\ngood_enough =", good_enough)

good_pairs = pairs[best][good_enough].flatten(axis=1)         # select best and good enough; reduce
print("\ngood_pairs[0] =", good_pairs[0])

#### **Explode:** create deeper structures by combining the ones we have

<center><img src="img/explode.png" width="25%"></center>

#### **Flat:** compute something in a vectorized way

<center><img src="img/flat.png" width="25%"></center>

#### **Reduce:** use the new values to eliminate structure (max, sum, mean...)

<center><img src="img/reduce.png" width="25%"></center>

In [None]:
# Other awkward types: nullable, heterogeneous lists, nested records...

a = awkward.fromiter([[1.1, 2.2, None, 3.3, None],
                      [4.4, [5.5]],
                      [{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]
                     ])

# Array type as a function signature
print(a.type)
print()

# Vectorized operations all the way down
(a + 100).tolist()

In [None]:
# Cross-references
data = awkward.fromiter([
    {"tracks": [{"phi": 1.0}, {"phi": 2.0}],
     "hits": [{"id": 100, "pos": 3.7}, {"id": 50, "pos": 2.1}, {"id": 75, "pos": 2.5}]},
    {"tracks": [{"phi": 1.5}],
     "hits": [{"id": 100, "pos": 1.4}, {"id": 50, "pos": 0.7}, {"id": 75, "pos": 3.0}]}])
data["tracks"]["hits-on-track"] = \
    awkward.JaggedArray.fromcounts([2, 1],
        awkward.JaggedArray.fromcounts([2, 2, 1, 1],
            awkward.IndexedArray([0, 1, 1, 2, 3, 5],
                data["hits"].content)))
data.tolist()

In [None]:
# Cyclic references
tree = awkward.fromiter([
    {"value": 1.23, "left":    1, "right":    2},     # node 0
    {"value": 3.21, "left":    3, "right":    4},     # node 1
    {"value": 9.99, "left":    5, "right":    6},     # node 2
    {"value": 3.14, "left":    7, "right": None},     # node 3
    {"value": 2.71, "left": None, "right":    8},     # node 4
    {"value": 5.55, "left": None, "right": None},     # node 5
    {"value": 8.00, "left": None, "right": None},     # node 6
    {"value": 9.00, "left": None, "right": None},     # node 7
    {"value": 0.00, "left": None, "right": None},     # node 8
])
left = tree.contents["left"].content
right = tree.contents["right"].content
left[(left < 0) | (left > 8)] = 0         # satisfy overzealous validity checks
right[(right < 0) | (right > 8)] = 0
tree.contents["left"].content = awkward.IndexedArray(left, tree)
tree.contents["right"].content = awkward.IndexedArray(right, tree)

tree[0].tolist()

| Array type | Purpose | Members | Usage |
|:-----------|:--------|:--------|:------|
| JaggedArray | variable-sized data structures | starts, stops, content | ubiquitous |
| Table | struct-like objects in columns | contents _(dict)_ | ubiquitous |
| ObjectArray | arbitrary Python types on demand | generator, content | common |
| Methods | mix-in methods and properties on any array type | _(none)_ | common |
| MaskedArray | allow nullable values (`None`) | mask _(bytes)_, content | occasional |
| BitMaskedArray | same, but with a bit-mask | mask _(bits)_, content | from Arrow |
| IndexedMaskedArray | same, but with dense content | mask-index _(integers)_ content | rare |
| IndexedArray | lazy integer indexing: "pointers" | index, content | rare |
| SparseArray | huge array defined at a few indexes | index, content, default | rare |
| UnionArray | heterogeneous types or data sources | tags, index, contents _(list)_ | rare |
| StringArray | special case: jagged array of characters | starts, stops, content, string methods | common |
| ChunkedArray | discontiguous array presented as a whole | counts, chunks _(lists)_ | from Parquet |
| AppendableArray | chunked allocation for efficient appending | counts, chunks _(lists)_ | rare |
| VirtualArray | array generated from a function when needed | generator, possible cached array | from Parquet |