# Part 3: Arbitrary data structures

So far, all the arrays we've dealt with have been rectangular (in $n$ dimensions; "rectilinear").

<center>
<img src="../img/8-layer_cube.jpg" width="50%">
</center>

What if we had data like this?

```json
[
  [[1.84, 0.324]],
  [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
  [[0.459, -1.517, 1.545], [0.33, 0.292]],
  [[-0.376, -1.46, -0.206], [0.65, 1.278]]
],
[
  [[-0.106, 0.611]],
  [[0.118, -1.788, 0.794, 0.658], [-0.105]]
],
[
  [[-0.384], [0.697, -0.856]],
  [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305]]
],
[
  [[0.205, -0.355], [-0.265], [1.042]],
  [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
  [[1.741, -0.199, 0.827]]
]
```

What if we had data like this?

```json
[
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 5.27453, "y": 1.03276},
    {"x": -3.51280, "y": 1.74849}]}
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 8.21630, "y": 4.07844},
    {"x": -0.79157, "y": 3.49478}, {"x": 16.38932, "y": 5.29399},
    {"x": 10.38641, "y": 0.10832}, {"x": -2.07070, "y": 14.07140},
    {"x": 9.57021, "y": -0.94823}, {"x": 1.97332, "y": 3.62380},
    {"x": 5.66760, "y": 11.38001}, {"x": 0.25497, "y": 3.39276},
    {"x": 3.86585, "y": 6.22051}, {"x": -0.67393, "y": 2.20572}]}
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": 3.59528, "y": 7.37191},
    {"x": 0.59192, "y": 2.91503}, {"x": 4.02932, "y": -1.13601},
    {"x": -1.01593, "y": 1.95894}, {"x": 1.03666, "y": 0.05251}]}
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": -8.78510, "y": -0.00497},
    {"x": -15.22688, "y": 3.90244}, {"x": 5.74593, "y": 4.12718}]}
  {"fill": "none", "stroke": "#000000", "points": [{"x": 4.40625, "y": -6.953125},
    {"x": 4.34375, "y": -7.09375}, {"x": 4.3125, "y": -7.140625},
    {"x": 4.140625, "y": -7.140625}]}
  {"fill": "none", "stroke": "#808080", "points": [{"x": 0.46875, "y": -0.09375},
    {"x": 0.46875, "y": -0.078125}, {"x": 0.46875, "y": 0 0.53125}]}
]
```

What if we had data like this?

```json
[
    {"movie": "Evil Dead", "year": 1981, "actors":
        ["Bruce Campbell", "Ellen Sandweiss", "Richard DeManincor", "Betsy Baker"]
    },
    {"movie": "Darkman", "year": 1900, "actors":
        ["Liam Neeson", "Bruce Campbell", "Frances McDormand", "Larry Drake"]
    },
    {"movie": "Army of Darkness", "year": 1992, "actors":
        ["Bruce Campbell", "Embeth Davidtz", "Marcus Gilbert", "Bridget Fonda",
         "Ted Raimi", "Patricia Tallman"]
    },
    {"movie": "Spider-Man", "year": 2002, "actors":
        ["Tobey Maguire", "Kristen Dunst", "Willem Dafoe", "James Franco",
         "Cliff Robertson", "Rosemary Harris"]
    },
    {"movie": "Spider-Man 2", "year": 2004, "actors":
        ["Tobey Maguire", "Kristen Dunst", "Stan Lee", "Alfred Molina",
         "Bruce Campbell"]
    },
]
```

It might be possible to turn these datasets into tabular form using surrogate keys and database normalization, but

 * they may be inconvenient or less efficient in that form, depending on what we want to do
 * they are very likely _given_ in a ragged/untidy form; data cleaning is an important step in analysis.

<br>

Dealing with these datasets as JSON or Python objects is inefficient for the same reason as for lists of numbers.

<br>

We want arbitrary data structure with array-oriented interface and performance...

<center>
<img src="../img/awkward-motivation-venn-diagram.svg" width="40%">
</center>

<table>
<tr style="background: white;"><td width="35%"><img src="../img/logo-arrow.svg" width="100%"></td><td style="padding-left: 50px;">In-memory format and an ecosystem of tools, an "exploded database" (database functionality provided as interchangeable pieces). Strong focus on delivering data, zero-copy, between processes.</td></tr>
<tr style="background: white; height: 30px;"><td></td><td></td></tr>
<tr style="background: white;"><td width="35%"><img src="../img/logo-awkward.svg" width="100%"></td><td style="padding-left: 50px;">Library for array-oriented programming like NumPy, but for arbitrary data structures. Losslessly zero-copy convertible to/from Arrow and Parquet.</td></tr>
<tr style="background: white; height: 30px;"><td></td><td></td></tr>
<tr style="background: white;"><td width="35%"><img src="../img/logo-parquet.svg" width="100%"></td><td style="padding-left: 50px;">Disk format for storing large datasets and (selectively) retrieving them.</td></tr>
</table>

<img src="../img/logo-arrow.svg" width="30%">

<br>

In [1]:
import pyarrow as pa

<br>

In [2]:
arrow_array = pa.array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
])

<br>

In [3]:
arrow_array.type

ListType(list<item: struct<x: double, y: list<item: int64>>>)

<br>

In [4]:
arrow_array

<pyarrow.lib.ListArray object at 0x7f8bd30df7c0>
[
  -- is_valid: all not null
  -- child 0 type: double
    [
      1.1,
      2.2,
      3.3
    ]
  -- child 1 type: list<item: int64>
    [
      [
        1
      ],
      [
        1,
        2
      ],
      [
        1,
        2,
        3
      ]
    ],
  -- is_valid: all not null
  -- child 0 type: double
    []
  -- child 1 type: list<item: int64>
    [],
  -- is_valid: all not null
  -- child 0 type: double
    [
      4.4,
      5.5
    ]
  -- child 1 type: list<item: int64>
    [
      [
        1,
        2,
        3,
        4
      ],
      [
        1,
        2,
        3,
        4,
        5
      ]
    ]
]

<img src="../img/logo-awkward.svg" width="30%">

<br>

In [5]:
import awkward as ak

<br>

In [6]:
awkward_array = ak.from_arrow(arrow_array)
awkward_array

<img src="../img/logo-parquet.svg" width="30%">

<br>

In [7]:
ak.to_parquet(awkward_array, "/tmp/file.parquet")

<pyarrow._parquet.FileMetaData object at 0x7f8bd044db80>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 2
  num_rows: 3
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 0

<br>

In [8]:
ak.from_parquet("/tmp/file.parquet")

## Awkward Array

Dealing with ragged data as though it were a NumPy array:

In [10]:
ragged = ak.Array([
    [
      [[1.84, 0.324]],
      [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
      [[0.459, -1.517, 1.545], [0.33, 0.292]],
      [[-0.376, -1.46, -0.206], [0.65, 1.278]]
    ],
    [
      [[-0.106, 0.611]],
      [[0.118, -1.788, 0.794, 0.658], [-0.105]]
    ],
    [
      [[-0.384], [0.697, -0.856]],
      [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305]]
    ],
    [
      [[0.205, -0.355], [-0.265], [1.042]],
      [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
      [[1.741, -0.199, 0.827]]
    ]
])

**Multidimensional indexing**

In [20]:
ragged[3, 1, -1, 2]

0.726

<br>

**Basic slicing**

In [27]:
ragged[3, 1:, -1, 1:3]

<br>

**Advanced slicing**

In [39]:
ak.num(ragged[[True, False, False, True], [1, -2]], axis=2)

ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (4,) and arg 1 with shape (2,).

This error occurred while attempting to slice

    <Array [[[[1.84, 0.324]], ..., [...]], ...] type='4 * var * var * var *...'>

with

    ([True, False, False, True], [1, -2])