# Quick Start

This first example uses `pandas`, because it remains the most popular dataframe library for python. However, exactly the same methods would be called for any of the other supported backends.

Import the packages that we'll use: 

In [1]:
import awkward as ak
import akimbo.pandas
import pandas as pd
import numpy as np

## Vectorizing ragged data

Consider a series made of python lists. This may happen a lot in ``pandas``. It is also possible in other dataframe libraries, but less likely. This data is only a couple of MB big, and the simplest amount of ragged nesting imaginable.

In [2]:
s = pd.Series([[1, 2, 3], [0], [4, 5]] * 100000)

In [3]:
s

0         [1, 2, 3]
1               [0]
2            [4, 5]
3         [1, 2, 3]
4               [0]
            ...    
299995          [0]
299996       [4, 5]
299997    [1, 2, 3]
299998          [0]
299999       [4, 5]
Length: 300000, dtype: object

First let's do a super simple operation: get the maximum of each list. There are a number of different ways to do this, we'll comment and time several.

We can put the series in a DataFrame with another built-in pandas type, e.g. a column of integers:

In [4]:
print("\nnumpy function")
%timeit s.map(np.max);
print("\npython function")
%timeit s.map(max);
print("\ncomprehension/iteration")
%timeit [max(_) for _ in s];
print("\nak with conversion")
%timeit s.ak.max(axis=1);
print("\nak after conversion")
s2 = s.ak.to_output()
%timeit s2.ak.max(axis=1)


numpy function
883 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

python function
48.6 ms ± 81.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

comprehension/iteration
34 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ak with conversion
34.1 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ak after conversion
3.32 ms ± 46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Some interesting results!
- numpy is terrible at this, where most of the cost is converting the lists to arrays. numpy is not esigned for tiny arrays
- using builting python functions and iteraction is OK when the data size isn't too big; this doesn't scale to millions of elements or lists-of-lists
- sometimes you can shave off runtime when you ignore the index; both ak versions maintain the index
- ak is just as fast even accounting for converting the data; but if the data is already in optimized form (which also uses less memory), ak is **much** faster than any other method. There is no equivalent numpy representation of the data.

**NOTE**:
 Pandas supports arrow storage of data such as this, and some IO functions can create it
without intermediate python objects with the argument dtype_backend="pyarrow". For dask,
arrow is already the default, but object are still common, and for polars and cuDF,
arrow is the only storage available, so you are guaranteed fast operations.


### Nested records

Let's look at a tiny example of nested record-oriented data. 

This small fake sports dataset contains some players names, their team, and how many goals they've scored in some variable number of games that they've appeared in.


The raw data:

In [5]:
text = """- name: Bob\n  team: tigers\n  goals: [0, 0, 0, 1, 2, 0, 1]\n\n- name: Alice\n  team: bears\n  goals: [3, 2, 1, 0, 1]\n\n- name: Jack\n  team: bears\n  goals: [0, 0, 0, 0, 0, 0, 0, 0, 1]\n\n- name: Jill\n  team: bears\n  goals: [3, 0, 2]\n\n- name: Ted\n  team: tigers\n  goals: [0, 0, 0, 0, 0]\n\n- name: Ellen\n  team: tigers\n  goals: [1, 0, 0, 0, 2, 0, 1]\n\n- name: Dan\n  team: bears\n  goals: [0, 0, 3, 1, 0, 2, 0, 0]\n\n- name: Brad\n  team: bears\n  goals: [0, 0, 4, 0, 0, 1]\n\n- name: Nancy\n  team: tigers\n  goals: [0, 0, 1, 1, 1, 1, 0]\n\n- name: Lance\n  team: bears\n  goals: [1, 1, 1, 1, 1]\n\n- name: Sara\n  team: tigers\n  goals: [0, 1, 0, 2, 0, 3]\n\n- name: Ryan\n  team: tigers\n  goals: [1, 2, 3, 0, 0, 0, 0]\n"""

This is in YAML format, so that we can include it in a single line. Notice that YAML allows us to see the nesting and variable-length of the data clearly. 
The data in YAML format:

In [6]:
print(text)

- name: Bob
  team: tigers
  goals: [0, 0, 0, 1, 2, 0, 1]

- name: Alice
  team: bears
  goals: [3, 2, 1, 0, 1]

- name: Jack
  team: bears
  goals: [0, 0, 0, 0, 0, 0, 0, 0, 1]

- name: Jill
  team: bears
  goals: [3, 0, 2]

- name: Ted
  team: tigers
  goals: [0, 0, 0, 0, 0]

- name: Ellen
  team: tigers
  goals: [1, 0, 0, 0, 2, 0, 1]

- name: Dan
  team: bears
  goals: [0, 0, 3, 1, 0, 2, 0, 0]

- name: Brad
  team: bears
  goals: [0, 0, 4, 0, 0, 1]

- name: Nancy
  team: tigers
  goals: [0, 0, 1, 1, 1, 1, 0]

- name: Lance
  team: bears
  goals: [1, 1, 1, 1, 1]

- name: Sara
  team: tigers
  goals: [0, 1, 0, 2, 0, 3]

- name: Ryan
  team: tigers
  goals: [1, 2, 3, 0, 0, 0, 0]



Awkward Array happily deals with this kind of data:

In [7]:
import yaml

dicts = yaml.safe_load(text)
data = ak.Array(dicts)

In [8]:
data

but we use `akimbo` to transform it into a Series. This will allow us to use dataframe functionality such as groupby, below.

In [9]:
s = akimbo.pandas.PandasAwkwardAccessor._to_output(data)

The dataset in Awkward Array form as three fields: "name", "team" and "goals"

Of these, two are "normal" fields - they can be made into dataframe columns containing no nesting. To unwrap the top record-like structure of the data, we can use ``unmerge``.

In [10]:
df = s.ak.unmerge()
df

Unnamed: 0,name,team,goals
0,Bob,tigers,[0 0 0 1 2 0 1]
1,Alice,bears,[3 2 1 0 1]
2,Jack,bears,[0 0 0 0 0 0 0 0 1]
3,Jill,bears,[3 0 2]
4,Ted,tigers,[0 0 0 0 0]
5,Ellen,tigers,[1 0 0 0 2 0 1]
6,Dan,bears,[0 0 3 1 0 2 0 0]
7,Brad,bears,[0 0 4 0 0 1]
8,Nancy,tigers,[0 0 1 1 1 1 0]
9,Lance,bears,[1 1 1 1 1]


We can use pure Pandas to investigate the dataset, but since Pandas doesn't have a builtin ability to handle the nested structure of our `goals` column, we're limited to some coarse information.

For example, we can group by the team and see the average number of goals _total_ goals scored. Here we use the ``.ak`` accessor _on each group_, to be able to do arithmetic on the variable-length data, but while maintaining the pandas index.

In [11]:
df.set_index("name") \
  .groupby("team", group_keys=True) \
  .apply(lambda x: x.goals.ak.mean(axis=1)) \
  .sort_values(ascending=False)

team    name 
bears   Jill     1.666667
        Alice    1.400000
        Lance    1.000000
tigers  Sara     1.000000
        Ryan     0.857143
bears   Brad     0.833333
        Dan      0.750000
tigers  Bob      0.571429
        Ellen    0.571429
        Nancy    0.571429
bears   Jack     0.111111
tigers  Ted      0.000000
dtype: double[pyarrow]

Determine how many games each player has appeared in is simpler, using a direct method:

In [12]:
df["n_games"] = df.goals.ak.num(axis=1)

In [13]:
df

Unnamed: 0,name,team,goals,n_games
0,Bob,tigers,[0 0 0 1 2 0 1],7
1,Alice,bears,[3 2 1 0 1],5
2,Jack,bears,[0 0 0 0 0 0 0 0 1],9
3,Jill,bears,[3 0 2],3
4,Ted,tigers,[0 0 0 0 0],5
5,Ellen,tigers,[1 0 0 0 2 0 1],7
6,Dan,bears,[0 0 3 1 0 2 0 0],8
7,Brad,bears,[0 0 4 0 0 1],6
8,Nancy,tigers,[0 0 1 1 1 1 0],7
9,Lance,bears,[1 1 1 1 1],5


We can also convert the entire dataframe (any dataframe, in fact) back to a `Series`, which is convenient if we want to drop down to the Awkward library for further operations.

In [14]:
s = df.ak.merge()

In [15]:
s  # look at that complex dtype!

0     {'name': 'Bob', 'team': 'tigers', 'goals': arr...
1     {'name': 'Alice', 'team': 'bears', 'goals': ar...
2     {'name': 'Jack', 'team': 'bears', 'goals': arr...
3     {'name': 'Jill', 'team': 'bears', 'goals': arr...
4     {'name': 'Ted', 'team': 'tigers', 'goals': arr...
5     {'name': 'Ellen', 'team': 'tigers', 'goals': a...
6     {'name': 'Dan', 'team': 'bears', 'goals': arra...
7     {'name': 'Brad', 'team': 'bears', 'goals': arr...
8     {'name': 'Nancy', 'team': 'tigers', 'goals': a...
9     {'name': 'Lance', 'team': 'bears', 'goals': ar...
10    {'name': 'Sara', 'team': 'tigers', 'goals': ar...
11    {'name': 'Ryan', 'team': 'tigers', 'goals': ar...
dtype: struct<name: large_string not null, team: large_string not null, goals: large_list<item: int64 not null> not null, n_games: int64 not null>[pyarrow]

And go back to pure awkward (now with our new `n_games` column) using the accessor:

In [16]:
s.ak.array

In [17]:
s.ak.array.fields

['name', 'team', 'goals', 'n_games']

In [18]:
# as series
s.ak["n_games"]

0     7
1     5
2     9
3     3
4     5
5     7
6     8
7     6
8     7
9     5
10    6
11    7
dtype: int64[pyarrow]

In [19]:
# as awkward
s.ak.array["n_games"]

### Behaviours

Let's take an example from upsrteam documentation: vectors are made of two fields, `(x, y)`. We know that adding and the the size of a vector are easily expressed. Let's encode this in a class and apply it to data in a dataframe.

In [20]:
from akimbo import mixin_class, mixin_class_method, behavior
import akimbo.pandas
import numpy as np
import pandas as pd


@mixin_class(behavior)
class Point:

    @mixin_class_method(np.abs)
    def point_abs(self):
        return np.sqrt(self.x ** 2 + self.y ** 2)

    @mixin_class_method(np.add, {"Point"})
    def point_add(self, other):
        return ak.zip(
            {"x": self.x + other.x, "y": self.y + other.y}, with_name="Point",
        )

In [21]:
data = [{"x": 1, "y": 2}] * 100000
s = pd.Series(data).ak.to_output()  # store as arrow

In [22]:
# check that the unary method is there; so tab-complete will work
"point_abs" in dir(s.ak.with_behavior("Point"))

True

In [23]:
# call to get vector sizes
s.ak.with_behavior("Point").point_abs()

0        2.236068
1        2.236068
2        2.236068
3        2.236068
4        2.236068
           ...   
99995    2.236068
99996    2.236068
99997    2.236068
99998    2.236068
99999    2.236068
Length: 100000, dtype: double[pyarrow]

In [24]:
# or do the same with numpy ufunc
np.abs(s.ak.with_behavior("Point"))

0        2.236068
1        2.236068
2        2.236068
3        2.236068
4        2.236068
           ...   
99995    2.236068
99996    2.236068
99997    2.236068
99998    2.236068
99999    2.236068
Length: 100000, dtype: double[pyarrow]

In [25]:
import math
%timeit np.abs(s.ak.with_behavior("Point"))
%timeit s.apply(lambda struct: math.sqrt(struct["x"] ** 2 + struct["y"] ** 2))

3.45 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
60.4 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Of course, we could have done the same by extracting out the arrays (e.g., `s.ak["x"]`) and using numpy directly, which would have been as fast, but this way we have an object-like experience.

Similarly, we defined an overload to add Point arrays together (both operands must be Point type). 
A vector addition is performed. This also happens at obviously vectorized speed - but I am not even sure how you would perform the same thing using python dicts.

In [26]:
%timeit s.ak.with_behavior("Point") + s.ak.with_behavior("Point")
s.ak.with_behavior("Point") + s.ak.with_behavior("Point")

2.63 ms ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


0        {'x': 2, 'y': 4}
1        {'x': 2, 'y': 4}
2        {'x': 2, 'y': 4}
3        {'x': 2, 'y': 4}
4        {'x': 2, 'y': 4}
               ...       
99995    {'x': 2, 'y': 4}
99996    {'x': 2, 'y': 4}
99997    {'x': 2, 'y': 4}
99998    {'x': 2, 'y': 4}
99999    {'x': 2, 'y': 4}
Length: 100000, dtype: struct<x: int64, y: int64>[pyarrow]

### Numba integration

The numpy API is very nice and can do most things you will need. The object-oriented behaviours are very convenient.

However, some algorithms are complex enough that you need to process with a custom function, and the data may be big and complex enough that python iteration over rows is simply not an option. A functional approach may also allow compute operations in a single-pass and without temporaries that cannot be done with the numpy API.

Enter [`numba`](https://numba.pydata.org/) , a JIT-compiler for numerical python, turning iterative, loopy functions into their C-language equivalent. Let's take an example.

In [27]:
import numba


def mean_of_second_biggest(arr):
    count = 0
    total = 0
    for row in arr:
        if len(row) < 2:
            continue
        max = row[0]
        if row[1] > max:
            second = max
            max = row[1]
        else:
            second = row[1]
        for x in row[2:]:
            if x > max:
                second = max
                max = x
            elif x > second:
                second = x
        count += 1
        total += second
    return total / count
            


In [28]:
mean_of_second_biggest([[1], [3, 2, 1], [3, 4, 4], [0, 1, 0]])  # mean of 2, 4, and 0

2.0

In [29]:
jsecond = numba.njit(mean_of_second_biggest)

In [30]:
s = pd.Series([[1], [3, 2, 1], [3, 4, 4], [0, 1, 0]] * 100000).ak.to_output()

In [31]:
s

0             [1]
1         [3 2 1]
2         [3 4 4]
3         [0 1 0]
4             [1]
           ...   
399995    [0 1 0]
399996        [1]
399997    [3 2 1]
399998    [3 4 4]
399999    [0 1 0]
Length: 400000, dtype: list<item: int64>[pyarrow]

In [32]:
s.ak.apply(jsecond)

2.0

In [33]:
# timings
%timeit s.ak.apply(jsecond)
%timeit mean_of_second_biggest(s)

1.03 ms ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
478 ms ± 2.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Question: could you have done this with vectorized numpy calls?