# Using dask and awkward together

Some basic tests/examples.

Lets first try with a single data file.

## Write out the data files

In [2]:
from pathlib import Path

data = {
    'x': [1, 2, 3, 4, 5],
}

import json

num_files = 100

data_dir = Path("data")
if not data_dir.exists():
    data_dir.mkdir()

for i in range(num_files):
    with open(data_dir / f'data{i}.json', 'w') as f:
        json.dump(data, f)


## Using awkward 2.0

Well, no new features - but just awkward - to load one of the files.

In [4]:
from pathlib import Path
import awkward as ak

file0 = data_dir / 'data0.json'
x = ak.from_json(file0)
x

In [5]:
x.x

In [6]:
x.x[x.x > 2]

## With awkward dask

Look at the same thing, but with awkward dask - run the `compute`...

In [7]:
import dask_awkward as dak

x = dak.from_json(data_dir / "data*.json")
result = x[x.x > 2]
result

dask.awkward<getitem, npartitions=100>

Ok - note that it already knows about the number of partitions here.

In [8]:
result.compute()

Interesting - it is a list of items... I suppose that is because this isn't an array. Ahh... What if I actually access x?

In [9]:
result2 = x[x.x > 2].x
print(result2)

dask.awkward<x, npartitions=100>


In [10]:
result2.compute()

Still not concatenating them - I guess there must be a reducer that already does that... But it did all 100 files no problem.

In [11]:
len(result2)

100

Also interesting - there is no `shape`. :-) Not at all surprised after everything.

## Getting at the compute graph to see what we can do with it.

In [12]:
g = result2.__dask_graph__()

In [13]:
type(g)

dask.highlevelgraph.HighLevelGraph

In [14]:
g.keys()

dict_keys([('getitem-d336e5a0669e9fb71bac5622ee3a3078', 0), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 1), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 2), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 3), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 4), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 5), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 6), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 7), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 8), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 9), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 10), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 11), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 12), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 13), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 14), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 15), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 16), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 17), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 18), ('getitem-d336e5a0669e9fb71bac5622ee3a3078', 19),

Ok - too hard to understand. Lets do it for one file.

In [16]:
import dask_awkward as dak

x = dak.from_json(data_dir / "data0.json")
result3 = x[x.x > 2].x
result3

dask.awkward<x, npartitions=1>

In [17]:
g = result3.__dask_graph__()
g.keys()

dict_keys([('getitem-269115c3a09b1bdb533d0895c8d196d3', 0), ('from-json-acba5d7bdc72da2435e076b415679279', 0), ('x-3e77717131bc0f095721866a03f19fde', 0), ('greater-a0b1915b578dd801441ba6e7b2e46ab8', 0), ('x-a6dee122d47eac04388c613aa3a390dd', 0)])

In [18]:
for k, v in g.items():
    print(k, v)
    print()

('getitem-269115c3a09b1bdb533d0895c8d196d3', 0) (subgraph_callable-9069b17f-5653-48ea-96d0-92b2782417d9, ('from-json-acba5d7bdc72da2435e076b415679279', 0), ('greater-a0b1915b578dd801441ba6e7b2e46ab8', 0))

('from-json-acba5d7bdc72da2435e076b415679279', 0) (subgraph_callable-53dc6c3e-4ed3-4cc2-b4b2-29cd4cd9d75b, 'c:/Users/gordo/Code/iris-hep/awkward-20-testing/notebooks/data/data0.json')

('x-3e77717131bc0f095721866a03f19fde', 0) (subgraph_callable-ac8830ff-f4b2-4578-9ae2-2f0914c6dfd4, ('from-json-acba5d7bdc72da2435e076b415679279', 0), 'x')

('greater-a0b1915b578dd801441ba6e7b2e46ab8', 0) (subgraph_callable-945a6074-49e6-42da-a288-935b6358fd9f, ('x-3e77717131bc0f095721866a03f19fde', 0), 2)

('x-a6dee122d47eac04388c613aa3a390dd', 0) (subgraph_callable-8f2a4333-8cb9-4a6d-861a-902f3105fd01, ('getitem-269115c3a09b1bdb533d0895c8d196d3', 0), 'x')



## Getting a bad variable

Lets see how eager this whole thing is?

In [20]:
import dask_awkward as dak

x = dak.from_json(data_dir / "data0.json")
result4 = x[x.x > 2].y
result4

AttributeError: y not in fields.

Ok - this is terrifically bad. Or we'd have to have everything - and how would we specify methods, etc.?