# Quick Start

This first example uses `pandas`, because it remains the most popular dataframe library for python. However, exactly the same methods would be called for any of the other supported backends.

Import the packages that we'll use: 

In [1]:
import awkward as ak
import akimbo.pandas
import pandas as pd
import numpy as np

## Vectorizing ragged data

Consider a series made of python lists. This may happen a lot in ``pandas``. It is also possible in other dataframe libraries, but less likely. This data is only a couple of MB big, and the simplest amount of ragged nesting imaginable.

In [2]:
s = pd.Series([[1, 2, 3], [0], [4, 5]] * 100000)

In [3]:
s

0         [1, 2, 3]
1               [0]
2            [4, 5]
3         [1, 2, 3]
4               [0]
            ...    
299995          [0]
299996       [4, 5]
299997    [1, 2, 3]
299998          [0]
299999       [4, 5]
Length: 300000, dtype: object

First let's do a super simple operation: get the maximum of each list. There are a number of different ways to do this, we'll comment and time several.

We can put the series in a DataFrame with another built-in pandas type, e.g. a column of integers:

In [None]:
print("\nnumpy function")
%timeit s.map(np.max);
print("\npython function")
%timeit s.map(max);
print("\ncomprehension/iteration")
%timeit [max(_) for _ in s];
print("\nak with conversion")
%timeit s.ak.max(axis=1);
print("\nak after conversion")
s2 = s.ak.to_output()
%timeit s2.ak.max(axis=1)


numpy function


Some interesting results!
- numpy is terrible at this, where most of the cost is converting the lists to arrays. numpy is not esigned for tiny arrays
- using builting python functions and iteraction is OK when the data size isn't too big; this doesn't scale to millions of elements or lists-of-lists
- sometimes you can shave off runtime when you ignore the index; both ak versions maintain the index
- ak is just as fast even accounting for converting the data; but if the data is already in optimized form (which also uses less memory), ak is **much** faster than any other method. There is no equivalent numpy representation of the data.

**NOTE**:
 Pandas supports arrow storage of data such as this, and some IO functions can create it
without intermediate python objects with the argument dtype_backend="pyarrow". For dask,
arrow is already the default, but object are still common, and for polars and cuDF,
arrow is the only storage available, so you are guaranteed fast operations.


### Nested records

Let's look at a tiny example of nested record-oriented data. 

This small fake sports dataset contains some players names, their team, and how many goals they've scored in some variable number of games that they've appeared in.


The raw data:

In [None]:
text = """- name: Bob\n  team: tigers\n  goals: [0, 0, 0, 1, 2, 0, 1]\n\n- name: Alice\n  team: bears\n  goals: [3, 2, 1, 0, 1]\n\n- name: Jack\n  team: bears\n  goals: [0, 0, 0, 0, 0, 0, 0, 0, 1]\n\n- name: Jill\n  team: bears\n  goals: [3, 0, 2]\n\n- name: Ted\n  team: tigers\n  goals: [0, 0, 0, 0, 0]\n\n- name: Ellen\n  team: tigers\n  goals: [1, 0, 0, 0, 2, 0, 1]\n\n- name: Dan\n  team: bears\n  goals: [0, 0, 3, 1, 0, 2, 0, 0]\n\n- name: Brad\n  team: bears\n  goals: [0, 0, 4, 0, 0, 1]\n\n- name: Nancy\n  team: tigers\n  goals: [0, 0, 1, 1, 1, 1, 0]\n\n- name: Lance\n  team: bears\n  goals: [1, 1, 1, 1, 1]\n\n- name: Sara\n  team: tigers\n  goals: [0, 1, 0, 2, 0, 3]\n\n- name: Ryan\n  team: tigers\n  goals: [1, 2, 3, 0, 0, 0, 0]\n"""

This is in YAML format, so that we can include it in a single line. Notice that YAML allows us to see the nesting and variable-length of the data clearly. 
The data in YAML format:

In [None]:
print(text)

Awkward Array happily deals with this kind of data:

In [None]:
import yaml

dicts = yaml.safe_load(text)
data = ak.Array(dicts)

In [None]:
data

but we use `akimbo` to transform it into a Series. This will allow us to use dataframe functionality such as groupby, below.

In [None]:
s = akimbo.pandas.PandasAwkwardAccessor._to_output(data)

The dataset in Awkward Array form as three fields: "name", "team" and "goals"

Of these, two are "normal" fields - they can be made into dataframe columns containing no nesting. To unwrap the top record-like structure of the data, we can use ``unmerge``.

In [None]:
df = s.ak.unmerge()
df

We can use pure Pandas to investigate the dataset, but since Pandas doesn't have a builtin ability to handle the nested structure of our `goals` column, we're limited to some coarse information.

For example, we can group by the team and see the average number of goals _total_ goals scored. Here we use the ``.ak`` accessor _on each group_, to be able to do arithmetic on the variable-length data, but while maintaining the pandas index.

In [None]:
df.set_index("name") \
  .groupby("team", group_keys=True) \
  .apply(lambda x: x.goals.ak.mean(axis=1)) \
  .sort_values(ascending=False)

Determine how many games each player has appeared in is simpler, using a direct method:

In [None]:
df["n_games"] = df.goals.ak.num(axis=1)

In [None]:
df

We can also convert the entire dataframe (any dataframe, in fact) back to a `Series`, which is convenient if we want to drop down to the Awkward library for further operations.

In [None]:
s = df.ak.merge()

In [None]:
s  # look at that complex dtype!

And go back to pure awkward (now with our new `n_games` column) using the accessor:

In [None]:
s.ak.array

In [None]:
s.ak.array.fields

In [None]:
# as series
s.ak["n_games"]

In [None]:
# as awkward
s.ak.array["n_games"]

### Behaviours

Let's take an example from upsrteam documentation: vectors are made of two fields, `(x, y)`. We know that adding and the the size of a vector are easily expressed. Let's encode this in a class and apply it to data in a dataframe.

In [None]:
from akimbo import mixin_class, mixin_class_method, behavior
import akimbo.pandas
import numpy as np
import pandas as pd


@mixin_class(behavior)
class Point:

    @mixin_class_method(np.abs)
    def point_abs(self):
        return np.sqrt(self.x ** 2 + self.y ** 2)

    @mixin_class_method(np.add, {"*"})
    def point_add(self, other):
        return ak.zip(
            {"x": self.x + other.x, "y": self.y + other.y}, with_name="Point",
        )

In [None]:
data = [{"x": 1, "y": 2}] * 100000
s = pd.Series(data).ak.to_output()  # store as arrow

In [None]:
# check that the unary method is there; so tab-complete will work
"point_abs" in dir(s.ak.with_behavior("Point"))

In [None]:
# call to get vector sizes
s.ak.with_behavior("Point").point_abs()

In [None]:
# or do the same with numpy ufunc
np.abs(s.ak.with_behavior("Point"))

In [None]:
import math
%timeit np.abs(s.ak.with_behavior("Point"))
%timeit s.apply(lambda struct: math.sqrt(struct["x"] ** 2 + struct["y"] ** 2))

Of course, we could have done the same by extracting out the arrays (e.g., `s.ak["x"]`) and using numpy directly, which would have been as fast, but this way we have an object-like experience.