# Array node layouts

This document describes, in a minimal way, each of the composable array layout nodes that collectively describe structured data in a columnar way.

All of the constraints on allowed values in an Awkward Array are expressed here as assertions in `__init__`. The `random` constructors create a random valid array. The data structure is defined by its length `__len__`, its `__getitem__` behavior for integers, slices (without step), and string fields. Iteration with `__iter__` converts the columnar data into rowwise data.

## General structure

Arrays are composed of [Content](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/Content.h) subclasses and integer arrays. Integer arrays (called [Index](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/Index.h)) determine the structure, but not the content of the data structure.

We'll define a Content base class with some facilities that will be used by all subclasses.

In [1]:
class Content:
    def __iter__(self):
        "Iterate over the data structure, converting columnar data into rowwise."
        
        def convert(x):
            if isinstance(x, Content):
                return list(x)
            elif isinstance(x, tuple):
                return tuple(convert(y) for y in x)
            elif isinstance(x, dict):
                return {n: convert(y) for n, y in x.items()}
            else:
                return x

        for i in range(len(self)):
            yield convert(self[i])

    def __repr__(self):
        "Print an XML representation of the data."
        
        return self.tostring_part("", "", "").rstrip()

    @staticmethod
    def random(minlen=0, choices=None):
        "Generate a random array from a set of possible classes."
        
        if choices is None:
            choices = [x for x in globals().values() if isinstance(x, type) and issubclass(x, Content)]
        else:
            choices = list(choices)
        if minlen != 0 and EmptyArray in choices:
            choices.remove(EmptyArray)
        assert len(choices) > 0
        cls = random.choice(choices)
        return cls.random(minlen, choices)

These are some utilities for generating random data.

In [2]:
import math
import random

def random_number():
    return round(random.gauss(5, 3), 1)

def random_length(minlen=0, maxlen=None):
    if maxlen is None:
        return minlen + int(math.floor(random.expovariate(0.1)))
    else:
        return random.randint(minlen, maxlen)

## Leaf nodes

Only three types of nodes can terminate an array data structure: RawArray, NumpyArray, and EmptyArray.

### RawArray

The [RawArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/RawArray.h) class is intended for use in C++. (Its implementation is is header-only and templated.) RawArray is defined by

   * `ptr`: the data themselves, a raw buffer. It can contain anything; the RawArray is just wraps the data as a Content.

If the type is a numerical type, a RawArray corresponds to an Apache Arrow [Primitive array](primitive-value-arrays).

In [3]:
class RawArray(Content):
    def __init__(self, ptr):
        assert isinstance(ptr, list)
        self.ptr = ptr

    @staticmethod
    def random(minlen=0, choices=None):
        return RawArray([random_number() for i in range(random_length(minlen))])

    def __len__(self):
        return len(self.ptr)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.ptr[where]
        elif isinstance(where, slice) and where.step is None:
            return RawArray(self.ptr[where])
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<RawArray>\n"
        out += indent + "    <ptr>" + " ".join(repr(x) for x in self.ptr) + "</ptr>\n"
        out += indent + "</RawArray>" + post
        return out

Here is an example.

In [7]:
x = RawArray.random()
x

<RawArray>
    <ptr>4.2 6.8 3.1 3.4 7.6 9.4</ptr>
</RawArray>

In [8]:
list(x)

[4.2, 6.8, 3.1, 3.4, 7.6, 9.4]

### NumpyArray

The [NumpyArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/NumpyArray.h) class is more general, describing multidimensional data with

   * `ptr`: the data themselves, a raw buffer.
   * `shape`: non-negative integers, at least one. Each represents the number of components in a dimension; a `shape` of length _N_ represents an _N_ dimensional tensor. The number of items in the array is the product of all values in `shape` (may be zero).
   * `strides`: the same number of integers as `shape`. The `strides` describes how many items in `ptr` to skip per element in a dimension. (Strides can be negative or zero.)
   * `offset`: the number of items in `ptr` to skip before the first element of the array.

In NumPy and Awkward, there is also

   * `itemsize`: the number of bytes per item (i.e. 1 for characters, 4 for `int32`, 8 for `double` types).

The `strides` and `offset` are measured in bytes, but in this simplified representation, we ignore this, assuming all items have `itemsize` of 1.

If the `shape` is one-dimensional, a NumpyArray corresponds to an Apache Arrow [Primitive array](primitive-value-arrays).

In [9]:
class NumpyArray(Content):
    def __init__(self, ptr, shape, strides, offset):
        assert isinstance(ptr, list)
        assert isinstance(shape, list)
        assert isinstance(strides, list)
        for x in ptr:
            assert isinstance(x, (bool, int, float))
        assert len(shape) > 0
        assert len(strides) == len(shape)
        for x in shape:
            assert isinstance(x, int)
            assert x >= 0
        for x in strides:
            assert isinstance(x, int)
        assert isinstance(offset, int)
        if all(x != 0 for x in shape):
            assert 0 <= offset < len(ptr)
            assert shape[0] * strides[0] + offset <= len(ptr)
        self.ptr = ptr
        self.shape = shape
        self.strides = strides
        self.offset = offset

    @staticmethod
    def random(minlen=0, choices=None):
        shape = [random_length(minlen)]
        for i in range(random_length(0, 2)):
            shape.append(random_length(1, 3))
        strides = [1]
        for x in shape[:0:-1]:
            skip = random_length(0, 2)
            strides.insert(0, x * strides[0] + skip)
        offset = random_length()
        ptr = [random_number() for i in range(shape[0] * strides[0] + offset)]
        return NumpyArray(ptr, shape, strides, offset)

    def __len__(self):
        return self.shape[0]

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            offset = self.offset + self.strides[0] * where
            if len(self.shape) == 1:
                return self.ptr[offset]
            else:
                return NumpyArray(self.ptr, self.shape[1:], self.strides[1:], offset)
        elif isinstance(where, slice) and where.step is None:
            offset = self.offset + self.strides[0] * where.start
            shape = [where.stop - where.start] + self.shape[1:]
            return NumpyArray(self.ptr, shape, self.strides, offset)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<NumpyArray>\n"
        out += indent + "    <ptr>" + " ".join(str(x) for x in self.ptr) + "</ptr>\n"
        out += indent + "    <shape>" + " ".join(str(x) for x in self.shape) + "</shape>\n"
        out += indent + "    <strides>" + " ".join(str(x) for x in self.strides) + "</strides>\n"
        out += indent + "    <offset>" + str(self.offset) + "</offset>\n"
        out += indent + "</NumpyArray>" + post
        return out

Here is an example.

In [13]:
x = NumpyArray.random()
x

<NumpyArray>
    <ptr>5.4 1.0 3.5 7.0 2.2 6.6</ptr>
    <shape>2 2</shape>
    <strides>2 1</strides>
    <offset>2</offset>
</NumpyArray>

In [14]:
list(x)

[[3.5, 7.0], [2.2, 6.6]]

### EmptyArray

The [EmptyArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/EmptyArray.h) class is used whenever an array's type is not known because it is empty (when determining types from observed elements).

EmptyArray has no equivalent in Apache Arrow.

In [15]:
class EmptyArray(Content):
    def __init__(self):
        pass

    @staticmethod
    def random(minlen=0, choices=None):
        assert minlen == 0
        return EmptyArray()

    def __len__(self):
        return 0

    def __getitem__(self, where):
        if isinstance(where, int):
            assert False
        elif isinstance(where, slice) and where.step is None:
            return EmptyArray()
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        return indent + pre + "<EmptyArray/>" + post

Here is an example.

In [16]:
x = EmptyArray.random()
x

<EmptyArray/>

In [17]:
list(x)

[]

## Arrays of lists

Lists may have uniform lengths or unequal lengths. RegularArray describes the first case and ListOffsetArray and ListArray are two ways of describing the second case.

### RegularArray

The [RegularArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/RegularArray.h) class describes lists that all have the same length, the single integer `size`. Its underlying `content` is a flattened view of the data—that is, each list is not stored separately in memory, but is inferred as a subinterval of the underlying data.

If the `content` is not an integer multiple of `size`, then the length of the RegularArray is truncated to the largest integer multiple.

A multidimensional NumpyArray is equivalent to a one-dimensional NumpyArray nested within several RegularArrays, one for each dimension. However, RegularArrays can be used to make lists of _any_ other type.

RegularArray corresponds to an Apache Arrow [Tensor](https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html).

In [18]:
class RegularArray(Content):
    def __init__(self, content, size):
        assert isinstance(content, Content)
        assert isinstance(size, int)
        assert size > 0
        self.content = content
        self.size = size

    @staticmethod
    def random(minlen=0, choices=None):
        size = random_length(1, 5)
        return RegularArray(Content.random(random_length(minlen) * size, choices), size)

    def __len__(self):
        return len(self.content) // self.size   # floor division

    def __getitem__(self, where):
        if isinstance(where, int):
            return self.content[(where) * self.size:(where + 1) * self.size]
        elif isinstance(where, slice) and where.step is None:
            start = where.start * self.size
            stop = where.stop * self.size
            return RegularArray(self.content[start:stop], self.size)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<RegularArray>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "    <size>" + str(self.size) + "</size>\n"
        out += indent + "</RegularArray>" + post
        return out

Here is an example.

In [25]:
x = RegularArray.random(choices=[RawArray])
x

<RegularArray>
    <content><RawArray>
        <ptr>2.1 5.0 3.9 4.4 7.9 8.8 7.8 3.4 3.8 5.1 7.5 5.7</ptr>
    </RawArray></content>
    <size>4</size>
</RegularArray>

In [26]:
list(x)

[[2.1, 5.0, 3.9, 4.4], [7.9, 8.8, 7.8, 3.4], [3.8, 5.1, 7.5, 5.7]]

### ListOffsetArray

The [ListOffsetArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/ListOffsetArray.h) class describes unequal-length lists (often called a "jagged" or "ragged" array). Like RegularArray, the underlying data for all lists are in a contiguous `content`. It is subdivided into lists according to an `offsets` array, which specifies the starting and stopping index of each list.

The `offsets` must have at least length 1 (an empty array), but it need not start with zero or completely cover the `content`. Just as RegularArray can have unreachable `content` if it is not an integer multiple of `size`, a ListOffsetArray can have unreachable `content` before the first list and after the last list.

ListOffsetArray corresponds to Apache Arrow [List type](https://arrow.apache.org/docs/memory_layout.html#list-type).

In [27]:
class ListOffsetArray(Content):
    def __init__(self, offsets, content):
        assert isinstance(offsets, list)
        assert isinstance(content, Content)
        assert len(offsets) != 0
        for i in range(len(offsets) - 1):
            start = offsets[i]
            stop = offsets[i + 1]
            assert isinstance(start, int)
            assert isinstance(stop, int)
            if start != stop:
                assert start < stop   # i.e. start <= stop
                assert start >= 0
                assert stop <= len(content)
        self.offsets = offsets
        self.content = content

    @staticmethod
    def random(minlen=0, choices=None):
        counts = [random_length() for i in range(random_length(minlen))]
        offsets = [random_length()]
        for x in counts:
            offsets.append(offsets[-1] + x)
        return ListOffsetArray(offsets, Content.random(offsets[-1], choices))
        
    def __len__(self):
        return len(self.offsets) - 1

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.content[self.offsets[where]:self.offsets[where + 1]]
        elif isinstance(where, slice) and where.step is None:
            offsets = self.offsets[where.start : where.stop + 1]
            if len(offsets) == 0:
                offsets = [0]
            return ListOffsetArray(offsets, self.content)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<ListOffsetArray>\n"
        out += indent + "    <offsets>" + " ".join(str(x) for x in self.offsets) + "</offsets>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "</ListOffsetArray>" + post
        return out

Here is an example.

In [32]:
x = ListOffsetArray.random(choices=[RawArray])
x

<ListOffsetArray>
    <offsets>0 0 9 11</offsets>
    <content><RawArray>
        <ptr>7.7 5.1 -2.3 3.7 5.5 9.0 7.1 6.9 7.3 5.8 7.6 2.3 -0.4 8.2 8.1 5.3 3.4 2.0 -1.7 1.7 6.6 6.7 6.6 3.5 3.0 8.8 6.8 8.7 6.1 3.7 8.5 3.7 3.8 8.1</ptr>
    </RawArray></content>
</ListOffsetArray>

In [33]:
list(x)

[[], [7.7, 5.1, -2.3, 3.7, 5.5, 9.0, 7.1, 6.9, 7.3], [5.8, 7.6]]

### ListArray

The [ListArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/ListArray.h) class generalizes ListOffsetArray by not requiring its `content` to be in order or not have unreachable elements between lists. Instead of a single `offsets` array, ListArray has

   * `starts`: the starting index of each list.
   * `stops`: the stopping index of each list.

ListOffsetArray's `offsets` may be related to `starts` and `stops`:

```python
starts = offsets[:-1]
stops = offsets[1:]
```

ListArrays are a useful byproduct of structure manipulation: as a result of some operation, we might want to view slices or permutations of the `content` without copying it to make a contiguous version of it. For that reason, ListArrays are more useful in a data-manipulation library like Awkward Array than in a data-representation library like Apache Arrow. There is not equivalent of ListArray in Apache Arrow.

In [34]:
class ListArray(Content):
    def __init__(self, starts, stops, content):
        assert isinstance(starts, list)
        assert isinstance(stops, list)
        assert isinstance(content, Content)
        assert len(stops) >= len(starts)   # usually equal
        for i in range(len(starts)):
            start = starts[i]
            stop = stops[i]
            assert isinstance(start, int)
            assert isinstance(stop, int)
            if start != stop:
                assert start < stop   # i.e. start <= stop
                assert start >= 0
                assert stop <= len(content)
        self.starts = starts
        self.stops = stops
        self.content = content

    @staticmethod
    def random(minlen=0, choices=None):
        content = Content.random(0, choices)
        length = random_length(minlen)
        if len(content) == 0:
            starts = [random.randint(0, 10) for i in range(length)]
            stops = list(starts)
        else:
            starts = [random.randint(0, len(content) - 1) for i in range(length)]
            stops = [x + min(random_length(), len(content) - x) for x in starts]
        return ListArray(starts, stops, content)
        
    def __len__(self):
        return len(self.starts)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.content[self.starts[where]:self.stops[where]]
        elif isinstance(where, slice) and where.step is None:
            starts = self.starts[where.start:where.stop]
            stops = self.stops[where.start:where.stop]
            return ListArray(starts, stops, self.content)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<ListArray>\n"
        out += indent + "    <starts>" + " ".join(str(x) for x in self.starts) + "</starts>\n"
        out += indent + "    <stops>" + " ".join(str(x) for x in self.stops) + "</stops>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "</ListArray>" + post
        return out

Here is an example.

In [37]:
x = ListArray.random(choices=[RawArray])
x

<ListArray>
    <starts>1 2 0 1 2 3 2 2 1 1 2 1 0 2 3 3 3</starts>
    <stops>4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4</stops>
    <content><RawArray>
        <ptr>9.8 2.2 3.6 5.7</ptr>
    </RawArray></content>
</ListArray>

In [38]:
list(x)

[[2.2, 3.6, 5.7],
 [3.6, 5.7],
 [9.8, 2.2, 3.6, 5.7],
 [2.2, 3.6, 5.7],
 [3.6, 5.7],
 [5.7],
 [3.6, 5.7],
 [3.6, 5.7],
 [2.2, 3.6, 5.7],
 [2.2, 3.6, 5.7],
 [3.6, 5.7],
 [2.2, 3.6, 5.7],
 [9.8, 2.2, 3.6, 5.7],
 [3.6, 5.7],
 [5.7],
 [5.7],
 [5.7]]

## Indirection

Data structures consist of more than just lists, so several nodes are dedicated to building structure.

### RedirectArray

**TODO:** points to another node within the data structure, allowing for the creation of directed acyclic graphs and fully cyclic graphs.

### SlicedArray

**TODO:** represents an array to be sliced (lazily) by a `start:stop:step` slice (internally represented by `offset:length:stride`).

### IndexedArray

The [IndexedArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/IndexedArray.h) class is a general-purpose tool for restructuring its `content`. Its `index` array is a lazily applied [numpy.take](https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html) (integer-array slice, also known as "advanced indexing"). It has many uses:

   * generalizing SlicedArray to advanced indexing.
   * emulating pointers when paired with RedirectArray.
   * emulating Apache Arrow's [dictionary encoding](https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding).

In [39]:
class IndexedArray(Content):
    def __init__(self, index, content):
        assert isinstance(index, list)
        assert isinstance(content, Content)
        for x in index:
            assert isinstance(x, int)
            assert 0 <= x < len(content)   # index[i] must not be negative
        self.index = index
        self.content = content

    @staticmethod
    def random(minlen=0, choices=None):
        if minlen == 0:
            content = Content.random(0, choices)
        else:
            content = Content.random(1, choices)
        if len(content) == 0:
            index = []
        else:
            index = [random.randint(0, len(content) - 1) for i in range(random_length(minlen))]
        return IndexedArray(index, content)

    def __len__(self):
        return len(self.index)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.content[self.index[where]]
        elif isinstance(where, slice) and where.step is None:
            return IndexedArray(self.index[where.start:where.stop], self.content)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<IndexedArray>\n"
        out += indent + "    <index>" + " ".join(str(x) for x in self.index) + "</index>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "</IndexedArray>\n"
        return out

Here is an example.

In [42]:
x = IndexedArray.random(choices=[RawArray])
x

<IndexedArray>
    <index>4 0 4 3 4 6 8 7 1 5</index>
    <content><RawArray>
        <ptr>3.7 4.5 5.3 4.9 2.9 5.8 6.7 4.3 1.4 6.7 1.7</ptr>
    </RawArray></content>
</IndexedArray>

In [43]:
list(x)

[2.9, 3.7, 2.9, 4.9, 2.9, 6.7, 1.4, 4.3, 4.5, 5.8]

## Missing data

Missing values may be represented as _n/a_, floating-point `NaN`, or a value like `null` or `None`. [Data of this type](https://en.wikipedia.org/wiki/Nullable_type) are variously called "nullable", "maybe", "option", or "optional."

Awkward Array has several methods of representing missing data, the most general of which is IndexedOptionArray.

### IndexedOptionArray

The [IndexedOptionArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/IndexedArray.h) class is an IndexedArray for which negative values in the `index` are interpreted as missing. In C++, it is implemented as a template specialization of IndexedArray.

In [45]:
class IndexedOptionArray(Content):
    def __init__(self, index, content):
        assert isinstance(index, list)
        assert isinstance(content, Content)
        for x in index:
            assert isinstance(x, int)
            assert x < len(content)   # index[i] may be negative
        self.index = index
        self.content = content

    @staticmethod
    def random(minlen=0, choices=None):
        content = Content.random(0, choices)
        index = []
        for i in range(random_length(minlen)):
            if len(content) == 0 or random.randint(0, 4) == 0:
                index.append(-random_length(1))   # a random number, but not necessarily -1
            else:
                index.append(random.randint(0, len(content) - 1))
        return IndexedOptionArray(index, content)

    def __len__(self):
        return len(self.index)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            if self.index[where] < 0:
                return None
            else:
                return self.content[self.index[where]]
        elif isinstance(where, slice) and where.step is None:
            return IndexedOptionArray(self.index[where.start:where.stop], self.content)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<IndexedOptionArray>\n"
        out += indent + "    <index>" + " ".join(str(x) for x in self.index) + "</index>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "</IndexedOptionArray>\n"
        return out

Here is an example.

In [46]:
x = IndexedOptionArray.random(choices=[RawArray])
x

<IndexedOptionArray>
    <index>0 -1 0 1 -2 -69</index>
    <content><RawArray>
        <ptr>6.8 9.4</ptr>
    </RawArray></content>
</IndexedOptionArray>

In [47]:
list(x)

[6.8, None, 6.8, 9.4, None, None]

### ByteMaskedArray

**TODO:** option-typed data defined by a `mask` of bytes.

### BitMaskedArray

**TODO:** option-typed data defined by a `mask` of bits.

### UnmaskedArray

**TODO:** option-typed data in which all values are valid (i.e. no mask). This is a placeholder for possibly missing data that are not actually missing.

## Lazily-loaded data

It can be advantageous to represent data that are not loaded into memory yet. Virtual arrays are functions that produce arrays on demand.

### PyVirtualArray

**TODO:** materializes arrays by calling a Python function.

## Discontiguous data

It can also be advantagous for arrays that represent a single sequence to be physically separate, particularly if they are lazily loaded. ChunkedArray and RegularChunkedArray often contain PyVirtualArrays.

### ChunkedArray

**TODO:** independently allocated arrays representing a single sequence.

ChunkedArrays correspond to Apache Arrow [Chunked Arrays](https://arrow.apache.org/docs/cpp/arrays.html#chunked-arrays).

### RegularChunkedArray

**TODO:** a `ChunkedArray` with equal-sized chunks.

## Records and tuples

All of the above types are linearly nested: they each have zero or one `content`. It's common for data to be packaged together, variously called "records", "structs", or [product types](https://en.wikipedia.org/wiki/Product_type).

   * If the joined data are unnamed, they are "tuples" (a fixed number of ordered fields; each may have a different type).
   * If they are named, they are "fields" (a fixed number of unordered fields; each may have a different type).

### RecordArray

The [RecordArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/RecordArray.h) class represents an array of tuples or records, all with the same type. Its `contents` is an ordered _list_ of array Contents.

   * If a `recordlookup` is absent, the data are tuples, defined only by their order.
   * If a `recordlookup` is present, it is an ordered list of names with the same length as the `contents`, associating a field name to every content.

The length of the RecordArray is the length of its shortest content; all are aligned element-by-element. If a RecordArray has zero `contents`, it may still represent a non-empty array. In that case, its length is specified by a `length` parameter (only used for this case).

`RecordArrays` correspond to Apache Arrow's [struct type](https://arrow.apache.org/docs/memory_layout.html#struct-type).

In [48]:
class RecordArray(Content):
    def __init__(self, contents, recordlookup, length):
        assert isinstance(contents, list)
        if len(contents) == 0:
            assert isinstance(length, int)
            assert length >= 0
        else:
            assert length is None
            for x in contents:
                assert isinstance(x, Content)
        assert recordlookup is None or isinstance(recordlookup, list)
        if isinstance(recordlookup, list):
            assert len(recordlookup) == len(contents)
            for x in recordlookup:
                assert isinstance(x, str)
        self.contents = contents
        self.recordlookup = recordlookup
        self.length = length

    @staticmethod
    def random(minlen=0, choices=None):
        length = random_length(minlen)
        contents = []
        for i in range(random.randint(0, 2)):
            contents.append(Content.random(length, choices))
        if len(contents) != 0:
            length = None
        if random.randint(0, 1) == 0:
            recordlookup = None
        else:
            recordlookup = ["x" + str(i) for i in range(len(contents))]
        return RecordArray(contents, recordlookup, length)

    def __len__(self):
        if len(self.contents) == 0:
            return self.length
        else:
            return min(len(x) for x in self.contents)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            record = [x[where] for x in self.contents]
            if self.recordlookup is None:
                return tuple(record)
            else:
                return dict(zip(self.recordlookup, record))
        elif isinstance(where, slice) and where.step is None:
            if len(self.contents) == 0:
                start = min(max(where.start, 0), self.length)
                stop = min(max(where.stop, 0), self.length)
                if stop < start:
                    stop = start
                return RecordArray([], self.recordlookup, stop - start)
            else:
                return RecordArray([x[where] for x in self.contents], self.recordlookup, self.length)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<RecordArray>\n"
        if len(self.contents) == 0:
            out += indent + "    <length>" + str(self.length) + "</length>\n"
        if self.recordlookup is None:
            for i, content in enumerate(self.contents):
                out += content.tostring_part(indent + "    ", "<content i=\"" + str(i) + "\">", "</content>\n")
        else:
            for i, (key, content) in enumerate(zip(self.recordlookup, self.contents)):
                out += content.tostring_part(indent + "    ", "<content i=\"" + str(i) + "\" key=\"" + repr(key) + "\">", "</content>\n")
        out += indent + "</RecordArray>" + post
        return out

Here are some examples.

In [51]:
x = RecordArray.random(choices=[RawArray])
x

<RecordArray>
    <content i="0"><RawArray>
        <ptr>8.4 3.8 6.3 5.4 3.8 2.5 0.1 4.1 4.1 5.1 8.8 7.2 5.8 7.7 2.4 7.9 2.3 -0.9 6.1 -0.2 7.9 6.2 5.0 3.5 3.0 3.4 4.0 7.9 6.9 2.5 6.0 3.6 5.4 3.5</ptr>
    </RawArray></content>
    <content i="1"><RawArray>
        <ptr>3.8 5.2 5.9 6.4 3.0</ptr>
    </RawArray></content>
</RecordArray>

In [52]:
list(x)

[(8.4, 3.8), (3.8, 5.2), (6.3, 5.9), (5.4, 6.4), (3.8, 3.0)]

In [71]:
x = RecordArray.random(choices=[RawArray])
x

<RecordArray>
    <content i="0" key="'x0'"><RawArray>
        <ptr>6.0 7.1 4.1 7.6 1.6 7.8 5.0 3.0 10.1 17.3 0.0 5.1 0.2 5.0 7.4 4.9 7.3 11.4 5.2 2.5 9.6 -0.3 6.0</ptr>
    </RawArray></content>
    <content i="1" key="'x1'"><RawArray>
        <ptr>2.4 7.0 6.4 7.0 5.7 7.6 6.0 2.6 0.3 5.9 6.8 3.8 6.2 5.3 4.3 3.0 0.3 5.2 4.9 6.3 8.7 4.5 3.8 1.8 4.8 2.1 7.3 3.8 1.1 3.3 0.5 5.7 5.0 6.3 5.4 3.9 10.7 6.3 4.2 6.3 3.8 7.4</ptr>
    </RawArray></content>
</RecordArray>

In [72]:
list(x)

[{'x0': 6.0, 'x1': 2.4},
 {'x0': 7.1, 'x1': 7.0},
 {'x0': 4.1, 'x1': 6.4},
 {'x0': 7.6, 'x1': 7.0},
 {'x0': 1.6, 'x1': 5.7},
 {'x0': 7.8, 'x1': 7.6},
 {'x0': 5.0, 'x1': 6.0},
 {'x0': 3.0, 'x1': 2.6},
 {'x0': 10.1, 'x1': 0.3},
 {'x0': 17.3, 'x1': 5.9},
 {'x0': 0.0, 'x1': 6.8},
 {'x0': 5.1, 'x1': 3.8},
 {'x0': 0.2, 'x1': 6.2},
 {'x0': 5.0, 'x1': 5.3},
 {'x0': 7.4, 'x1': 4.3},
 {'x0': 4.9, 'x1': 3.0},
 {'x0': 7.3, 'x1': 0.3},
 {'x0': 11.4, 'x1': 5.2},
 {'x0': 5.2, 'x1': 4.9},
 {'x0': 2.5, 'x1': 6.3},
 {'x0': 9.6, 'x1': 8.7},
 {'x0': -0.3, 'x1': 4.5},
 {'x0': 6.0, 'x1': 3.8}]

In [55]:
x = RecordArray.random(choices=[RawArray])
x

<RecordArray>
    <length>7</length>
</RecordArray>

In [56]:
list(x)

[{}, {}, {}, {}, {}, {}, {}]

## Heterogeneous data

All of the above types represent arrays of identically typed data. To allow different data types in the same array, we build a "[tagged union](https://en.wikipedia.org/wiki/Tagged_union)" or "sum type" with a UnionArray.

RecordArrays and UnionArrays are the only types with multiple `contents`. Product types and sum types have a fundamental duality in type theory.

### UnionArray

The [UnionArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/UnionArray.h) class represents data drawn from an ordered list of `contents`, which can have different types, using

   * `tags`: array of integers indicating which content each array element draws from.
   * `index`: array of integers indicating which element from the content to draw from.

`UnionArrays` correspond to Apache Arrow's [dense union type](https://arrow.apache.org/docs/memory_layout.html#dense-union-type). Awkward Array has no direct equivalent for Apache Arrow's [sparse union type](https://arrow.apache.org/docs/memory_layout.html#sparse-union-type), but an appropriate `index` may be generated as needed.

In [73]:
class UnionArray(Content):
    def __init__(self, tags, index, contents):
        assert isinstance(tags, list)
        assert isinstance(index, list)
        assert isinstance(contents, list)
        assert len(index) >= len(tags)   # usually equal
        for x in tags:
            assert isinstance(x, int)
            assert 0 <= x < len(contents)
        for i, x in enumerate(index):
            assert isinstance(x, int)
            assert 0 <= x < len(contents[tags[i]])
        self.tags = tags
        self.index = index
        self.contents = contents

    @staticmethod
    def random(minlen=0, choices=None):
        contents = []
        unshuffled_tags = []
        unshuffled_index = []
        for i in range(random.randint(1, 3)):
            if minlen == 0:
                contents.append(Content.random(0, choices))
            else:
                contents.append(Content.random(1, choices))
            if len(contents[-1]) != 0:
                thisindex = [random.randint(0, len(contents[-1]) - 1) for i in range(random_length(minlen))]
                unshuffled_tags.extend([i] * len(thisindex))
                unshuffled_index.extend(thisindex)
        permutation = list(range(len(unshuffled_tags)))
        random.shuffle(permutation)
        tags = [unshuffled_tags[i] for i in permutation]
        index = [unshuffled_index[i] for i in permutation]
        return UnionArray(tags, index, contents)

    def __len__(self):
        return len(self.tags)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.contents[self.tags[where]][self.index[where]]
        elif isinstance(where, slice) and where.step is None:
            return UnionArray(self.tags[where], self.index[where], self.contents)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<UnionArray>\n"
        out += indent + "    <tags>" + " ".join(str(x) for x in self.tags) + "</tags>\n"
        out += indent + "    <index>" + " ".join(str(x) for x in self.index) + "</index>\n"
        for i, content in enumerate(self.contents):
            out += content.tostring_part(indent + "    ", "<content i=\"" + str(i) + "\">", "</content>\n")
        out += indent + "</UnionArray>" + post
        return out

Here is an example.

In [93]:
x = UnionArray.random(choices=[RawArray, ListOffsetArray])
x

<UnionArray>
    <tags>0 1 2 0 2 2 1</tags>
    <index>0 16 9 0 10 0 13</index>
    <content i="0"><ListOffsetArray>
        <offsets>12 13 13</offsets>
        <content><ListOffsetArray>
            <offsets>10 21 22 50 54 55 59 89 92 101 111 119 120 131 138 158 165 171 173</offsets>
            <content><RawArray>
                <ptr>0.5 4.8 8.6 -1.3 4.0 2.5 5.0 3.3 5.0 1.5 9.3 2.5 5.4 2.1 7.1 5.3 10.8 -2.1 6.4 7.6 5.6 6.2 4.9 8.0 6.2 4.1 6.6 -1.3 4.0 3.8 0.3 5.7 9.9 5.6 9.9 9.4 1.4 3.9 6.2 6.3 3.4 6.2 10.1 3.7 8.3 -0.6 2.8 9.7 3.3 6.5 6.5 2.1 4.9 5.8 1.0 6.8 2.7 3.2 6.0 6.4 1.9 8.1 5.5 6.3 4.8 5.5 1.1 0.1 4.0 1.8 10.0 3.8 3.9 2.5 1.8 6.0 5.2 6.0 9.6 11.7 6.4 7.9 4.3 5.3 4.4 7.0 8.6 6.1 11.2 4.7 5.9 9.3 7.0 5.1 8.0 6.9 8.4 3.7 5.8 4.8 1.6 -1.5 -0.9 6.0 2.8 -0.2 8.1 2.9 7.6 5.7 8.3 8.1 5.5 7.1 6.5 0.8 4.3 1.9 0.2 7.7 5.6 -0.5 2.1 6.1 7.1 4.5 4.5 4.2 9.1 5.7 2.2 9.0 2.6 3.8 7.2 3.2 5.1 6.6 3.0 6.6 6.3 4.8 2.6 3.7 7.0 5.2 1.8 4.2 5.9 2.2 7.1 6.1 1.8 4.2 3.6 3.0 5.7 2.1 7.7 1.5 3.8 6.4 

In [94]:
list(x)

[[[5.6, -0.5, 2.1, 6.1, 7.1, 4.5, 4.5, 4.2, 9.1, 5.7, 2.2]],
 0.5,
 5.6,
 [[5.6, -0.5, 2.1, 6.1, 7.1, 4.5, 4.5, 4.2, 9.1, 5.7, 2.2]],
 2.3,
 6.2,
 4.7]

## All together

Here is an example that draws from all the possible node types. Any node type may be the content for any that has a `content` or `contents` attribute.

In [105]:
x = Content.random()
x

<UnionArray>
    <tags>1 1 1 2 2 1 1 1 2 2 1 1</tags>
    <index>0 0 0 8 25 1 1 0 5 4 0 1</index>
    <content i="0"><IndexedArray>
        <index></index>
        <content><IndexedArray>
            <index></index>
            <content><RawArray>
                <ptr>3.4 5.5 9.6 9.2 5.3</ptr>
            </RawArray></content>
        </IndexedArray>
    </IndexedArray>
    <content i="1"><NumpyArray>
        <ptr>6.6 9.3 8.1 3.6</ptr>
        <shape>2 1</shape>
        <strides>1 1</strides>
        <offset>2</offset>
    </NumpyArray></content>
    <content i="2"><RecordArray>
        <content i="0" key="'x0'"><IndexedArray>
            <index>5 14 5 0 6 5 5 14 3 12 6 1 9 12 11 13 15 6 10 4 13 7 3 14 7 10 13 0 5 15 6 1 6 15 6 13 15 7 1 12 13 7 5 4 11 10 8 7 7 8 7 4 11 0 9 11</index>
            <content><RawArray>
                <ptr>2.7 8.1 1.3 5.6 7.1 4.8 2.1 5.1 5.0 3.8 -2.6 5.8 5.1 2.6 4.8 9.1</ptr>
            </RawArray></content>
        </IndexedArray>
        <content i="1"

In [106]:
list(x)

[[8.1],
 [8.1],
 [8.1],
 {'x0': 5.6, 'x1': {}},
 {'x0': -2.6, 'x1': {}},
 [3.6],
 [3.6],
 [8.1],
 {'x0': 4.8, 'x1': {}},
 {'x0': 2.1, 'x1': {}},
 [8.1],
 [3.6]]