# Data model

Compared to what is provided in most general-purpose programming languages, only a small set of abstract data types are needed for analysis (or high-level programming in general). Numbers, booleans, and fixed-size (rectangular) arrays of numbers and booleans are sufficient for data analysis in most fields of study. Particle physicists demonstrably need variable-length lists of numbers and booleans as well: interest in awkward-array has overwhelmingly focused on the `JaggedArray` class, which only provides this capability. For the intuitive notion of a "particle object," some sort of record type is needed as well. To manipulate sets without restriction, we also need heterogeneity, which can be expressed by a union type.

## Data types

The following four type generators ("PLUR") provide a system of surprising generality:

   * **P**rimitive integers, floating-point numbers, booleans, and any fixed byte-width value (e.g. UUIDs, IP addresses, ...),
   * **L**ists of variable length but homogeneous type,
   * **U**nions of heterogeneous types, such as "electrons and muons" (with different fields in each), and
   * **R**ecords of named, typed fields (a.k.a. objects, structs, composites, classes...).

For instance, JSON (with a schema) is a PLUR system, in which numbers, boolean, and `null` are the primitives, and strings are regarded as a special case of "lists of 1-byte characters." Protobuf, Avro, Thrift, Parquet, and Arrow are statically typed PLUR systems. Unions and records are the sum types and product types of [algebraic type theory](https://en.wikipedia.org/wiki/Algebraic_data_type), respectively. Only one thing from general-purpose programming might be missed by physicists:

   * **P**ointers between objects.

However, we can add this as a fifth type generator ("PLURP") by allowing cross-references and circular references. In a PLUR system, data structures are trees with a maximum depth, limited by the type schema, but in a PLURP system, data structures may be arbitrarily deep trees or even graphs. Awkward-array is PLURP with extra features beyond just representing types.

Data types in a general-purpose programming language can be constructed from the above if interpreted through the appropriate interfaces. For instance, an open file object is an integer that makes system calls, a linked list is a tree of records, and a hash-table is a list with hash-collision handling. PLURP provides a layer of abstraction between raw, serialized memory (e.g. the arrays and structs of the C programming language) and rarified types of a high-level language (e.g. classes with hidden implementations).

## Multi-paradigm columnar processing

Awkward-array has been useful as a Numpy extension for particle physics, and I expect its role to increase. However, I don't think the array-programming paradigm is good for all problems. In fact, I'd like to provide three ways to perform computations on these same data structures:

   * array programming, in which the columnar nature of the arrays is visible and there is no index,
   * imperative programming in Numba, in which the columnar nature of the arrays is hidden—the user works with "lists" and "records" in compiled Python—and there is no index, and
   * declarative programming, in which the columnar nature of the arrays is hidden and there is an index to define identity for set operations.

My goal is for the three paradigms to be usable on the same data structures without translation. For example, a physicist-user might apply a first transformation of their data as an array, then do something more complex in a Numba-compiled block, treating the records as Python objects (though Numba compiles their actions into array manipulations under the hood), then do something even more complex, relying on the identity of records in a way not possible in imperative programming, using a declarative language like PartiQL.

## Mini-awkward-array

Rather than using awkward-array in this demo, I reimplemented a simple version of it so that relationships between the types and the index is more clear.

This implementation has only four classes: `PrimitiveArray`, `ListArray`, `UnionArray`, and `RecordArray`. As in awkward-array, the data are stored in columns but may be thought of as nested objects.

In [1]:
import data

events = data.RecordArray({
        "muons": data.ListArray([0, 3, 3, 5], [3, 3, 5, 9], data.RecordArray({
            "pt": data.PrimitiveArray([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]),
            "iso": data.PrimitiveArray([0, 0, 100, 50, 30, 1, 2, 3, 4])
    })),
        "jets": data.ListArray([0, 5, 6, 8], [5, 6, 8, 12], data.RecordArray({
            "pt": data.PrimitiveArray([1, 2, 3, 4, 5, 100, 30, 50, 1, 2, 3, 4]),
            "mass": data.PrimitiveArray([10, 10, 10, 10, 10, 5, 15, 15, 9, 8, 7, 6])
        })),
        "met": data.PrimitiveArray([100, 200, 300, 400])
    })

The semi-realistic event structure above is defined in its columnar representation—Python lists are stand-ins for the arrays we would use in an efficient system. We can extract data as jagged arrays:

In [2]:
events["muons"]["pt"]

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]>

Or as the nested objects the arrays represent.

In [3]:
events[0]["muons"].tolist()

[{'pt': 1.1, 'iso': 0}, {'pt': 2.2, 'iso': 0}, {'pt': 3.3, 'iso': 100}]

In awkward-array, we would describe the high-level type of this structure as follows:

```
[0, 4) -> "muons" -> [0, ?) -> "pt" -> float64
                               "iso" -> float64
          "jets" -> [0, ?) -> "pt" -> float64
                              "mass" -> float64
          "met" -> float64
```

because an array is like a function that takes an integer or string in square brackets and returns something else—an array or a scalar primitive. The first argument after `events` can be any non-negative integer less than `4`, and this returns an array/function that takes `"muons"`, `"jets"`, or `"met"`. The return type of the next argument depends on which string was passed. The variable-length lists take a non-negative integer `[0, ?)` because its limits are too complex to encode in the type description. (The size of the type description should not be allowed to scale with the size of the array, and this limits its expressiveness for variable-length lists.) If awkward-array is based on Numpy, the leaves of its type terminate on Numpy dtypes, such as `float64`. If the array includes a cross-reference or circular reference, the type description would be a graph with interconnections, not a tree.

In awkward-array, it's possible to change the order of these arguments:

In [4]:
# [0, 4) before "muons"/"jets"/"met"
events[0]["muons"].tolist()

[{'pt': 1.1, 'iso': 0}, {'pt': 2.2, 'iso': 0}, {'pt': 3.3, 'iso': 100}]

In [5]:
# "muons"/"jets"/"met" before [0, 4)
events["muons"][0].tolist()

[{'pt': 1.1, 'iso': 0}, {'pt': 2.2, 'iso': 0}, {'pt': 3.3, 'iso': 100}]

Passing a string argument to a `ListArray`, as though it were a `RecordArray`, creates a `ListArray` of one of the nested `RecordArray`'s fields. It is a projection through the nested records.

In [6]:
events["muons"]["pt"]

<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]>

In [7]:
events["muons"]["iso"]

<Array [[0, 0, 100], [], [50, 30], [1, 2, 3, 4]]>

Formally, we find that string arguments and integer arguments commute with each other, though string arguments do not commute with string arguments (you can't reverse the order of nested records) and integer arguments do not commute with integer arguments (you can't reverse the order of nested lists).

This commutation relation will be important for defining the awkward indexes.

## Indexes and keys

The key concept that an SQL-like query language would add to awkward-array is indexing—giving each field a unique identifier.

