# Array node layouts

This document describes, in a minimal way, each of the composable array layout nodes that collectively describe structured data in a columnar way.

All of the constraints on allowed values in an Awkward Array are expressed here as assertions in `__init__`. The `random` constructors create a random valid array. The data structure is defined by its length `__len__`, its `__getitem__` behavior for integers, slices (without step), and string fields. Iteration with `__iter__` converts the columnar data into rowwise data.

## General structure

Arrays are composed of `Content` subclasses and integer arrays. Integer arrays (called `Indexes`) determine the structure, but not the content of the data structure.

We'll define a `Content` base class with some facilities that will be used by all subclasses.

In [1]:
class Content:
    def __iter__(self):
        "Iterate over the data structure, converting columnar data into rowwise."
        
        def convert(x):
            if isinstance(x, Content):
                return list(x)
            elif isinstance(x, tuple):
                return tuple(convert(y) for y in x)
            elif isinstance(x, dict):
                return {n: convert(y) for n, y in x.items()}
            else:
                return x

        for i in range(len(self)):
            yield convert(self[i])

    def __repr__(self):
        "Print an XML representation of the data."
        
        return self.tostring_part("", "", "").rstrip()

    @staticmethod
    def random(minlen=0, choices=None):
        "Generate a random array from a set of possible classes."
        
        if choices is None:
            choices = [x for x in globals().values() if isinstance(x, type) and issubclass(x, Content)]
        else:
            choices = list(choices)
        if minlen != 0 and EmptyArray in choices:
            choices.remove(EmptyArray)
        assert len(choices) > 0
        cls = random.choice(choices)
        return cls.random(minlen, choices)

These are some utilities for generating random data.

In [2]:
import math
import random

def random_number():
    return round(random.gauss(5, 3), 1)

def random_length(minlen=0, maxlen=None):
    if maxlen is None:
        return minlen + int(math.floor(random.expovariate(0.1)))
    else:
        return random.randint(minlen, maxlen)

## Leaf nodes

Only three types of nodes can terminate an array data structure: `RawArray`, a one-dimensional array of a fixed-width data, `NumpyArray`, a rectilinear tensor equivalent to NumPy data, and `EmptyArray`, data of unknown type and zero length.

### RawArray

The `RawArray` class is intended for use in C++, the [RawArray](https://github.com/scikit-hep/awkward-1.0/blob/master/include/awkward/array/RawArray.h) class is header-only and templated. It is defined by

   * `ptr`: the data themselves, a raw buffer. It can contain anything; the `RawArray` is just wraps the data as a `Content`.

If the type is a numerical type, a `RawArray` corresponds to an Apache Arrow [Primitive array](primitive-value-arrays).

In [3]:
class RawArray(Content):
    def __init__(self, ptr):
        assert isinstance(ptr, list)
        self.ptr = ptr

    @staticmethod
    def random(minlen=0, choices=None):
        return RawArray([random_number() for i in range(random_length(minlen))])

    def __len__(self):
        return len(self.ptr)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.ptr[where]
        elif isinstance(where, slice) and where.step is None:
            return RawArray(self.ptr[where])
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<RawArray>\n"
        out += indent + "    <ptr>" + " ".join(repr(x) for x in self.ptr) + "</ptr>\n"
        out += indent + "</RawArray>" + post
        return out

Here is an example.

In [7]:
x = RawArray.random()
x

<RawArray>
    <ptr>4.2 6.8 3.1 3.4 7.6 9.4</ptr>
</RawArray>

In [8]:
list(x)

[4.2, 6.8, 3.1, 3.4, 7.6, 9.4]

### NumpyArray

The `NumpyArray` class is more general, describing multidimensional data with

   * `ptr`: the data themselves, a raw buffer.
   * `shape`: non-negative integers, at least one. Each represents the number of components in a dimension; a `shape` of length _N_ represents an _N_ dimensional tensor. The number of items in the array is the product of all values in `shape` (may be zero).
   * `strides`: the same number of integers as `shape`. The `strides` describes how many items in `ptr` to skip per element in a dimension. (Strides can be negative or zero.)
   * `offset`: the number of items in `ptr` to skip before the first element of the array.

In NumPy and Awkward, there is also

   * `itemsize`: the number of bytes per item (i.e. 1 for characters, 4 for `int32`, 8 for `double` types).

The `strides` and `offset` are measured in bytes, but in this simplified representation, we ignore this, assuming all items have `itemsize` of 1.

If the `shape` is one-dimensional, a `NumpyArray` corresponds to an Apache Arrow [Primitive array](primitive-value-arrays).

In [9]:
class NumpyArray(Content):
    def __init__(self, ptr, shape, strides, offset):
        assert isinstance(ptr, list)
        assert isinstance(shape, list)
        assert isinstance(strides, list)
        for x in ptr:
            assert isinstance(x, (bool, int, float))
        assert len(shape) > 0
        assert len(strides) == len(shape)
        for x in shape:
            assert isinstance(x, int)
            assert x >= 0
        for x in strides:
            assert isinstance(x, int)
        assert isinstance(offset, int)
        if all(x != 0 for x in shape):
            assert 0 <= offset < len(ptr)
            assert shape[0] * strides[0] + offset <= len(ptr)
        self.ptr = ptr
        self.shape = shape
        self.strides = strides
        self.offset = offset

    @staticmethod
    def random(minlen=0, choices=None):
        shape = [random_length(minlen)]
        for i in range(random_length(0, 2)):
            shape.append(random_length(1, 3))
        strides = [1]
        for x in shape[:0:-1]:
            skip = random_length(0, 2)
            strides.insert(0, x * strides[0] + skip)
        offset = random_length()
        ptr = [random_number() for i in range(shape[0] * strides[0] + offset)]
        return NumpyArray(ptr, shape, strides, offset)

    def __len__(self):
        return self.shape[0]

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            offset = self.offset + self.strides[0] * where
            if len(self.shape) == 1:
                return self.ptr[offset]
            else:
                return NumpyArray(self.ptr, self.shape[1:], self.strides[1:], offset)
        elif isinstance(where, slice) and where.step is None:
            offset = self.offset + self.strides[0] * where.start
            shape = [where.stop - where.start] + self.shape[1:]
            return NumpyArray(self.ptr, shape, self.strides, offset)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<NumpyArray>\n"
        out += indent + "    <ptr>" + " ".join(str(x) for x in self.ptr) + "</ptr>\n"
        out += indent + "    <shape>" + " ".join(str(x) for x in self.shape) + "</shape>\n"
        out += indent + "    <strides>" + " ".join(str(x) for x in self.strides) + "</strides>\n"
        out += indent + "    <offset>" + str(self.offset) + "</offset>\n"
        out += indent + "</NumpyArray>" + post
        return out

Here is an example.

In [13]:
x = NumpyArray.random()
x

<NumpyArray>
    <ptr>5.4 1.0 3.5 7.0 2.2 6.6</ptr>
    <shape>2 2</shape>
    <strides>2 1</strides>
    <offset>2</offset>
</NumpyArray>

In [14]:
list(x)

[[3.5, 7.0], [2.2, 6.6]]

### EmptyArray

The `EmptyArray` class is used whenever an array's type is not known because it is empty (when determining types from observed elements).

`EmptyArray` has no equivalent in Apache Arrow.

In [15]:
class EmptyArray(Content):
    def __init__(self):
        pass

    @staticmethod
    def random(minlen=0, choices=None):
        assert minlen == 0
        return EmptyArray()

    def __len__(self):
        return 0

    def __getitem__(self, where):
        if isinstance(where, int):
            assert False
        elif isinstance(where, slice) and where.step is None:
            return EmptyArray()
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        return indent + pre + "<EmptyArray/>" + post

Here is an example.

In [16]:
x = EmptyArray.random()
x

<EmptyArray/>

In [17]:
list(x)

[]

## Arrays of lists

Lists may have uniform lengths or unequal lengths. `RegularArray` describes the first case and `ListOffsetArray` and `ListArray` are two ways of describing the second case.

### RegularArray

The `RegularArray` class describes lists that all have the same length, the single integer `size`. Its underlying `content` is a flattened view of the data—that is, each list is not stored separately in memory, but is inferred as a subinterval of the underlying data.

If the `content` is not an integer multiple of `size`, then the length of the `RegularArray` is truncated to the largest integer multiple.

A multidimensional `NumpyArray` is equivalent to a one-dimensional `NumpyArray` nested within several `RegularArrays`, one for each dimension. However, `RegularArrays` can be used to make lists of _any_ other type.

`RegularArray` corresponds to an Apache Arrow [Tensor](https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html).

In [18]:
class RegularArray(Content):
    def __init__(self, content, size):
        assert isinstance(content, Content)
        assert isinstance(size, int)
        assert size > 0
        self.content = content
        self.size = size

    @staticmethod
    def random(minlen=0, choices=None):
        size = random_length(1, 5)
        return RegularArray(Content.random(random_length(minlen) * size, choices), size)

    def __len__(self):
        return len(self.content) // self.size   # floor division

    def __getitem__(self, where):
        if isinstance(where, int):
            return self.content[(where) * self.size:(where + 1) * self.size]
        elif isinstance(where, slice) and where.step is None:
            start = where.start * self.size
            stop = where.stop * self.size
            return RegularArray(self.content[start:stop], self.size)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<RegularArray>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "    <size>" + str(self.size) + "</size>\n"
        out += indent + "</RegularArray>" + post
        return out

Here is an example.

In [25]:
x = RegularArray.random(choices=[RawArray])
x

<RegularArray>
    <content><RawArray>
        <ptr>2.1 5.0 3.9 4.4 7.9 8.8 7.8 3.4 3.8 5.1 7.5 5.7</ptr>
    </RawArray></content>
    <size>4</size>
</RegularArray>

In [26]:
list(x)

[[2.1, 5.0, 3.9, 4.4], [7.9, 8.8, 7.8, 3.4], [3.8, 5.1, 7.5, 5.7]]

### ListOffsetArray

The `ListOffsetArray` class describes unequal-length lists (often called a "jagged" or "ragged" array). Like `RegularArray`, the underlying data for all lists are in a contiguous `content`. It is subdivided into lists according to an `offsets` array, which specifies the starting and stopping index of each list.

The `offsets` must have at least length 1 (an empty array), but it need not start with zero or completely cover the `content`. Just as `RegularArray` can have unreachable `content` if it is not an integer multiple of `size`, a `ListOffsetArray` can have unreachable `content` before the first list and after the last list.

`ListOffsetArray` corresponds to Apache Arrow [List type](https://arrow.apache.org/docs/memory_layout.html#list-type).

In [27]:
class ListOffsetArray(Content):
    def __init__(self, offsets, content):
        assert isinstance(offsets, list)
        assert isinstance(content, Content)
        assert len(offsets) != 0
        for i in range(len(offsets) - 1):
            start = offsets[i]
            stop = offsets[i + 1]
            assert isinstance(start, int)
            assert isinstance(stop, int)
            if start != stop:
                assert start < stop   # i.e. start <= stop
                assert start >= 0
                assert stop <= len(content)
        self.offsets = offsets
        self.content = content

    @staticmethod
    def random(minlen=0, choices=None):
        counts = [random_length() for i in range(random_length(minlen))]
        offsets = [random_length()]
        for x in counts:
            offsets.append(offsets[-1] + x)
        return ListOffsetArray(offsets, Content.random(offsets[-1], choices))
        
    def __len__(self):
        return len(self.offsets) - 1

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.content[self.offsets[where]:self.offsets[where + 1]]
        elif isinstance(where, slice) and where.step is None:
            offsets = self.offsets[where.start : where.stop + 1]
            if len(offsets) == 0:
                offsets = [0]
            return ListOffsetArray(offsets, self.content)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<ListOffsetArray>\n"
        out += indent + "    <offsets>" + " ".join(str(x) for x in self.offsets) + "</offsets>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "</ListOffsetArray>" + post
        return out

Here is an example.

In [32]:
x = ListOffsetArray.random(choices=[RawArray])
x

<ListOffsetArray>
    <offsets>0 0 9 11</offsets>
    <content><RawArray>
        <ptr>7.7 5.1 -2.3 3.7 5.5 9.0 7.1 6.9 7.3 5.8 7.6 2.3 -0.4 8.2 8.1 5.3 3.4 2.0 -1.7 1.7 6.6 6.7 6.6 3.5 3.0 8.8 6.8 8.7 6.1 3.7 8.5 3.7 3.8 8.1</ptr>
    </RawArray></content>
</ListOffsetArray>

In [33]:
list(x)

[[], [7.7, 5.1, -2.3, 3.7, 5.5, 9.0, 7.1, 6.9, 7.3], [5.8, 7.6]]

### ListArray

The `ListArray` class generalizes `ListOffsetArray` by not requiring its `content` to be in order or not have unreachable elements between lists. Instead of a single `offsets` array, `ListArray` has

   * `starts`: the starting index of each list.
   * `stops`: the stopping index of each list.

`offsets` may be related to `starts` and `stops`:

```python
starts = offsets[:-1]
stops = offsets[1:]
```

`ListArrays` are a useful byproduct of structure manipulation: as a result of some operation, we might want to view slices or permutations of the `content` without copying it to make a contiguous version of it. For that reason, `ListArrays` are more useful in a data-manipulation library like Awkward Array than in a data-representation library like Apache Arrow. There is not equivalent of `ListArray` in Apache Arrow.

In [34]:
class ListArray(Content):
    def __init__(self, starts, stops, content):
        assert isinstance(starts, list)
        assert isinstance(stops, list)
        assert isinstance(content, Content)
        assert len(stops) >= len(starts)   # usually equal
        for i in range(len(starts)):
            start = starts[i]
            stop = stops[i]
            assert isinstance(start, int)
            assert isinstance(stop, int)
            if start != stop:
                assert start < stop   # i.e. start <= stop
                assert start >= 0
                assert stop <= len(content)
        self.starts = starts
        self.stops = stops
        self.content = content

    @staticmethod
    def random(minlen=0, choices=None):
        content = Content.random(0, choices)
        length = random_length(minlen)
        if len(content) == 0:
            starts = [random.randint(0, 10) for i in range(length)]
            stops = list(starts)
        else:
            starts = [random.randint(0, len(content) - 1) for i in range(length)]
            stops = [x + min(random_length(), len(content) - x) for x in starts]
        return ListArray(starts, stops, content)
        
    def __len__(self):
        return len(self.starts)

    def __getitem__(self, where):
        if isinstance(where, int):
            assert 0 <= where < len(self)
            return self.content[self.starts[where]:self.stops[where]]
        elif isinstance(where, slice) and where.step is None:
            starts = self.starts[where.start:where.stop]
            stops = self.stops[where.start:where.stop]
            return ListArray(starts, stops, self.content)
        else:
            raise AssertionError(where)

    def tostring_part(self, indent, pre, post):
        out = indent + pre + "<ListArray>\n"
        out += indent + "    <starts>" + " ".join(str(x) for x in self.starts) + "</starts>\n"
        out += indent + "    <stops>" + " ".join(str(x) for x in self.stops) + "</stops>\n"
        out += self.content.tostring_part(indent + "    ", "<content>", "</content>\n")
        out += indent + "</ListArray>" + post
        return out

Here is an example.

In [37]:
x = ListArray.random(choices=[RawArray])
x

<ListArray>
    <starts>1 2 0 1 2 3 2 2 1 1 2 1 0 2 3 3 3</starts>
    <stops>4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4</stops>
    <content><RawArray>
        <ptr>9.8 2.2 3.6 5.7</ptr>
    </RawArray></content>
</ListArray>

In [38]:
list(x)

[[2.2, 3.6, 5.7],
 [3.6, 5.7],
 [9.8, 2.2, 3.6, 5.7],
 [2.2, 3.6, 5.7],
 [3.6, 5.7],
 [5.7],
 [3.6, 5.7],
 [3.6, 5.7],
 [2.2, 3.6, 5.7],
 [2.2, 3.6, 5.7],
 [3.6, 5.7],
 [2.2, 3.6, 5.7],
 [9.8, 2.2, 3.6, 5.7],
 [3.6, 5.7],
 [5.7],
 [5.7],
 [5.7]]

## Indirection

