# Part 4: Arbitrary data structures

So far, all the arrays we've dealt with have been rectangular (in $n$ dimensions; "rectilinear").

![](../img/8-layer_cube.jpg)

<br><br><br>

But what if we had data like this?

In [None]:
numbers = [
  [
    [[1.84, 0.324]],
    [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
    [[0.459, -1.517, 1.545], [0.33, 0.292]],
    [[-0.376, -1.46, -0.206], [0.65, 1.278]],
    [[], [], [1.617]],
    []
  ],
  [
    [[-0.106, 0.611]],
    [[0.118, -1.788, 0.794, 0.658], [-0.105]]
  ],
  [
    [[-0.384], [0.697, -0.856]],
    [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
  ],
  [
    [[0.205, -0.355], [-0.265], [1.042]],
    [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
    [[1.741, -0.199, 0.827]]
  ]
]

Or this?

In [None]:
nlp = [
    [
        ("John", "NNP"), ("arrived", "VBD"), ("yesterday", "RB"), (".", ".")
    ],
    [
        ("He", "PRP"), ("visited", "VBD"), ("the", "DT"), ("Eiffel", "NNP"),
        ("Tower", "NNP"), (".", ".")
    ],
    [
        ("IBM", "NNP"), ("hired", "VBD"), ("Alice", "NNP")
    ],
    [
        ("Alice", "NNP"), ("is", "VBZ"), ("from", "IN"), ("London", "NNP"),
        (".", ".")
    ],
    [
        ("Did", "VBD"), ("they", "PRP"), ("meet", "VB"), ("in", "IN"),
        ("New", "NNP"), ("York", "NNP"), ("City", "NNP"), ("?", ".")
    ],
]

Or this?

In [None]:
geojson = {
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {"name": "Convention Center"},
      "geometry": {
        "type": "Polygon", "coordinates": [[
          [-122.43972, 47.24888], [-122.43958, 47.24829], [-122.43964, 47.24828],
          [-122.43958, 47.24802], [-122.43841, 47.24815], [-122.43860, 47.24899],
          [-122.43972, 47.24888]]]
      }
    },
    {
      "type": "Feature",
      "properties": {"name": "Mariott"},
      "geometry": {
        "type": "Polygon", "coordinates": [[
          [-122.43944, 47.24760], [-122.43908, 47.24764], [-122.43903, 47.24738],
          [-122.43855, 47.24743], [-122.43856, 47.24745], [-122.43822, 47.24749],
          [-122.43836, 47.24812], [-122.43845, 47.24811], [-122.43846, 47.24814],
          [-122.43954, 47.24803], [-122.43944, 47.24760]]]
      }
    },
    {
      "type": "Feature",
      "properties": {"name": "Carlton Center"},
      "geometry": {
        "type": "Polygon", "coordinates": [[
          [-122.43860, 47.24716], [-122.43833, 47.24720], [-122.43821, 47.24747],
          [-122.43865, 47.24742], [-122.43860, 47.24716]]]
      }
    },
    {
      "type": "Feature",
      "properties": {"name": "Loading Docks"},
      "geometry": {
        "type": "Polygon", "coordinates": [[
          [-122.43982, 47.24804], [-122.43974, 47.24772], [-122.43947, 47.24774],
          [-122.43954, 47.24803], [-122.43958, 47.24802], [-122.43962, 47.24816],
          [-122.43982, 47.24804]]]
      }
    },
    {
      "type": "Feature",
      "properties": {"name": "Parking Garage"},
      "geometry": {
        "type": "Polygon", "coordinates": [[
          [-122.44022, 47.24800], [-122.44000, 47.24702], [-122.43953, 47.24707],
          [-122.43968, 47.24772], [-122.43974, 47.24772], [-122.43982, 47.24804],
          [-122.44022, 47.24800]]]
      }
    }
  ]
}

<br><br><br>

It's possible to work with data like these in pure Python, but what if the dataset is huge?

<br><br><br>

It's possible to work with data like these as a set of rectangular tables, using integer keys to establish relationships between them.

In [None]:
import sqlite3
import pandas as pd

In [None]:
connection = sqlite3.connect(":memory:")
cursor = connection.cursor()

cursor.execute("""
CREATE TABLE properties (
  feature_id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT
)
""")

cursor.execute("""
CREATE TABLE coordinates (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  feature_id INTEGER,
  longitude REAL,
  latitude REAL,
  point_index INTEGER,
  FOREIGN KEY(feature_id) REFERENCES properties(feature_id)
)
""")

for feature in geojson["features"]:
    cursor.execute("INSERT INTO properties (name) VALUES (?)", [feature["properties"]["name"]])
    feature_id = cursor.lastrowid
    outer_ring = feature["geometry"]["coordinates"][0]
    for point_index, (longitude, latitude) in enumerate(outer_ring):
        cursor.execute(
            "INSERT INTO coordinates (feature_id, longitude, latitude, point_index) VALUES (?, ?, ?, ?)",
            [feature_id, longitude, latitude, point_index],
        )

connection.commit()

In [None]:
pd.read_sql_query("SELECT * FROM properties", connection).set_index("feature_id")

In [None]:
pd.read_sql_query("SELECT * FROM coordinates", connection).set_index("id")

But some operations are now much more complicated.

What if you want to compute the area of each polygon?

<br><br><br>

We want arbitrary data structure with array-oriented interface and performance...

![](../img/awkward-motivation-venn-diagram.svg)

<br><br><br>

## Libraries for irregular arrays

<br><br><br>

![](../img/logo-arrow.svg)

In-memory format and an ecosystem of tools, an "exploded database" (database functionality provided as interchangeable pieces). Strong focus on delivering data, zero-copy, between processes.

In [None]:
import pyarrow as pa

In [None]:
arrow_array = pa.array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
])

In [None]:
arrow_array.type

In [None]:
arrow_array

<br><br><br>

![](../img/logo-awkward.svg)

Library for array-oriented programming like NumPy, but for arbitrary data structures. Interconvertible with Arrow and Parquet.

In [None]:
import awkward as ak

In [None]:
awkward_array = ak.from_arrow(arrow_array)
awkward_array

<br><br><br>

![](../img/logo-parquet.svg)

Disk format for storing large datasets and (selectively) retrieving them.

In [None]:
ak.to_parquet(awkward_array, "/tmp/file.parquet")

In [None]:
ak.from_parquet("/tmp/file.parquet")

<br><br><br>

## Data analysis in Awkward Array

In [None]:
ragged = ak.Array([
    [
      [[1.84, 0.324]],
      [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
      [[0.459, -1.517, 1.545], [0.33, 0.292]],
      [[-0.376, -1.46, -0.206], [0.65, 1.278]],
      [[], [], [1.617]],
      []
    ],
    [
      [[-0.106, 0.611]],
      [[0.118, -1.788, 0.794, 0.658], [-0.105]]
    ],
    [
      [[-0.384], [0.697, -0.856]],
      [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
    ],
    [
      [[0.205, -0.355], [-0.265], [1.042]],
      [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
      [[1.741, -0.199, 0.827]]
    ]
])

<br><br><br>

**Multidimensional indexing**

In [None]:
ragged[3, 1, -1, 2]

<br><br><br>

**Basic slicing**

In [None]:
ragged[3, 1:, -1, 1:3]

<br><br><br>

**Advanced slicing**

In [None]:
ragged[[False, False, True, True], [0, -1, 0, -1], 0, -1]

<br><br><br>

**Awkward slicing**

In [None]:
ragged > 0

In [None]:
ragged[ragged > 0]

<br><br><br>

**Reductions**

In [None]:
ak.sum(ragged)

In [None]:
ak.sum(ragged, axis=-1)

In [None]:
ak.sum(ragged, axis=0)

<br><br><br>

How are reductions even defined for ragged arrays?

![](../img/example-reducer-2d.svg)

In [None]:
import numpy as np

In [None]:
regular = np.array([
    [  1,   2,   3,   4],
    [ 10,  20,  30,  40],
    [100, 200, 300, 400],
])

In [None]:
np.sum(regular, axis=0)

In [None]:
np.sum(regular, axis=1)

<br><br><br>

Assume all variable-length lists are left-justified.

![](../img/example-reduction-sum.svg)

In [None]:
irregular = ak.Array([
    [   1,    2,    4],
    [                ],
    [None,    8      ],
    [  16            ],
])

In [None]:
ak.sum(irregular, axis=0)

In [None]:
ak.sum(irregular, axis=1)

<br><br><br>

**Elementwise formulas**

In [None]:
svg_paths = ak.Array([
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 5.27453, "y": 1.03276},
    {"x": -3.51280, "y": 1.74849}]},
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 8.21630, "y": 4.07844},
    {"x": -0.79157, "y": 3.49478}, {"x": 16.38932, "y": 5.29399},
    {"x": 10.38641, "y": 0.10832}, {"x": -2.07070, "y": 14.07140},
    {"x": 9.57021, "y": -0.94823}, {"x": 1.97332, "y": 3.62380},
    {"x": 5.66760, "y": 11.38001}, {"x": 0.25497, "y": 3.39276},
    {"x": 3.86585, "y": 6.22051}, {"x": -0.67393, "y": 2.20572}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": 3.59528, "y": 7.37191},
    {"x": 0.59192, "y": 2.91503}, {"x": 4.02932, "y": -1.13601},
    {"x": -1.01593, "y": 1.95894}, {"x": 1.03666, "y": 0.05251}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": -8.78510, "y": -0.00497},
    {"x": -15.22688, "y": 3.90244}, {"x": 5.74593, "y": 4.12718}]},
  {"fill": "none", "stroke": "#000000", "points": [{"x": 4.40625, "y": -6.953125},
    {"x": 4.34375, "y": -7.09375}, {"x": 4.3125, "y": -7.140625},
    {"x": 4.140625, "y": -7.140625}]},
  {"fill": "none", "stroke": "#808080", "points": [{"x": 0.46875, "y": -0.09375},
    {"x": 0.46875, "y": -0.078125}, {"x": 0.46875, "y": 0.53125}]}
])

In [None]:
np.sqrt(svg_paths["points", "x"]**2 + svg_paths["points", "y"]**2)

<br><br><br>

## How to think in Awkward Arrays

We'll be getting to the challenge exercise soon. But first, let's do one together.

<br><br><br>

Given the following dataset:

In [None]:
sam_raimi_movies = ak.Array([
    {"movie": "Evil Dead", "year": 1981, "actors":
        ["Bruce Campbell", "Ellen Sandweiss", "Richard DeManincor", "Betsy Baker"]
    },
    {"movie": "Darkman", "year": 1900, "actors":
        ["Liam Neeson", "Frances McDormand", "Larry Drake", "Bruce Campbell"]
    },
    {"movie": "Army of Darkness", "year": 1992, "actors":
        ["Bruce Campbell", "Embeth Davidtz", "Marcus Gilbert", "Bridget Fonda",
         "Ted Raimi", "Patricia Tallman"]
    },
    {"movie": "A Simple Plan", "year": 1998, "actors":
        ["Bill Paxton", "Billy Bob Thornton", "Bridget Fonda", "Brent Briscoe"]
    },
    {"movie": "Spider-Man 2", "year": 2004, "actors":
        ["Tobey Maguire", "Kristen Dunst", "Alfred Molina", "James Franco",
         "Rosemary Harris", "J.K. Simmons", "Stan Lee", "Bruce Campbell"]
    },
    {"movie": "Drag Me to Hell", "year": 2009, "actors":
        ["Alison Lohman", "Justin Long", "Lorna Raver", "Dileep Rao", "David Paymer"]
    }
])

Select movies that do _not_ contain `"Bruce Campbell"`.

See [ak.all](https://awkward-array.org/doc/main/reference/generated/ak.all.html), [ak.any](https://awkward-array.org/doc/main/reference/generated/ak.any.html), [np.invert](https://numpy.org/doc/stable/reference/generated/numpy.invert.html), and [ak.num](https://awkward-array.org/doc/main/reference/generated/ak.num.html).

<br><br><br>

In [None]:
is_bruce_campbell = (sam_raimi_movies["actors"] == "Bruce Campbell")
is_bruce_campbell

<br><br><br>

In [None]:
all_not_bruce_campbell = ak.all(~is_bruce_campbell, axis=1)
all_not_bruce_campbell

<br><br><br>

In [None]:
sam_raimi_movies[all_not_bruce_campbell]

<br><br><br>

On to the [project.ipynb](project.ipynb)!