# DASK to ServiceX

In this demo we'll take advantage of DASK and ServiceX. This work is driven by the fact that `AwkwardInputLayer` seems like it will not take tasks as inputs. So we need to move onto something else.

## Assumptions:

* We don't start anything until we know the number of files that SX will produce
* We are ok with some files failing coming out of SX
* We are going to do one partition per file
* When we start we don't necessarily know all the files produced.

## Design Outline

* A single `dask` task/layer that has a single output per partition. The output is just a string.
* The `AwkwardInputLayer` that has looks at the task input and loads that data from `minio`

This version of things will work only for local files - once this works we can move it to a SX prototype.

This was written before any code below.

## Imports

In [1]:
import dask_awkward as dak
import awkward as ak
import dask
import uproot

from dask.highlevelgraph import Layer, HighLevelGraph
from dask.distributed import Client, LocalCluster
from typing import AbstractSet

import logging

# Make debugging a little easier...
cluster = LocalCluster(processes=False)
client = Client(cluster)

## The `uproot.dask` hack way

### The `awkward.Form` file form

We need the form from the schema to prevent us from having to open files that do not yet exist in hour hack. Eventually we'll have to build this from the schema we know exists from the `func_adl` query.

In [2]:
dummy_filename = "0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1"
with uproot.open(dummy_filename) as file:
    file_form = file['treeme'].arrays().layout.form
    metadata = dak.core.typetracer_array(file['treeme'].arrays())

file_form, metadata

(RecordForm([ListOffsetForm('i64', NumpyForm('float64'))], ['JetPt']),
 <Array-typetracer [...] type='## * {JetPt: var * float64}'>)

Next, lets test it.

In [3]:
test_ar = uproot.dask({dummy_filename: "treeme"}, open_files=False, known_base_form=file_form)
test_ar.JetPt.compute()

## Blockwise Approach

Could we start a blockwise approach on its own?

In [4]:
class SXLayerBW(Layer):
    '''Outputs are just the names of the files that we want to open downstream with uproot'''
    def __init__(self, name, n_files):
        super().__init__()
        self.name = name
        self.dependencies = dict()
        self.tasks = {
            "output_0": (lambda: self.get_file(f"output_0"),)
        }

    def __getitem__(self, key):
        return self.tasks[key]

    def __iter__(self):
        return iter(self.tasks)

    def __len__(self):
        return len(self.tasks)

    def is_materialized(self):
        return False
    
    def get_output_keys(self) -> AbstractSet[str | bytes | int | float]:
        return set(self.tasks.keys())
    
    def get_file(self, name):
        '''Return the info that is needed by uproot to actually open the file'''
        print(f"Returning info for file {name}: {dummy_filename}")
        return (dummy_filename, 'treeme')


And the layer that will load files from the above.

In [5]:
class URLoaderLayer(Layer):
    def __init__(self, name, sx_layer_name, output_name, n_files):
        super().__init__()
        self.name = name
        self.dependencies = {name: sx_layer_name}
        self.tasks = {
            (name, i): (lambda f_name: self.get_data(f_name), f'output_{i}')
            for i in range(n_files)
        }

    def __getitem__(self, key):
        return self.tasks[key]

    def __iter__(self):
        return iter(self.tasks)

    def __len__(self):
        return len(self.tasks)

    def is_materialized(self):
        return False
    
    def get_output_keys(self) -> AbstractSet[str | bytes | int | float]:
        return set(self.tasks.keys())
    
    def get_data(self, name):
        '''Return the info that is needed by uproot to actually open the file'''
        # TODO: This is swallowed unless we use a dask `LocalCluster`.
        logging.warning(f"Returning info for file {name}")
        with uproot.open(name[0]) as file:
            return file[name[1]].arrays()

Ok - lets build up the array.

In [6]:
# the layers

sx_layer = SXLayerBW("sx_fetcher", 1)
loader_layer = URLoaderLayer("uproot_loader", "sx_fetcher", "output", 1)

# Now, the high level layer...
hlg = HighLevelGraph(
    layers={sx_layer.name: sx_layer, loader_layer.name: loader_layer},
    dependencies={loader_layer.name: {sx_layer.name}, sx_layer.name: set()},
)

# And finally the array...
ar = dak.core.new_array_object(hlg, "uproot_loader", meta=metadata, npartitions=1)

In [8]:
ar.JetPt[ar.JetPt > 100.0].compute()



Returning info for file output_0: 0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1


2024-02-24 20:29:12,548 - distributed.worker - ERROR - Scheduler was unaware of this worker 'inproc://192.168.1.16/27104/4'. Shutting down.


## SX Support

The code below belongs in the `sx-awk` library.

This is the ServiceX layer. It is responsible for all communication with ServiceX, and finding the files (and URL's) from `minio`.

In [None]:
from collections.abc import Set
from typing import AbstractSet, Any, Dict, Iterator, KeysView, List
import logging

class SXLayer(Layer):
    def __init__(self, sx_query_guid):
        super().__init__()
        self._query_guid = sx_query_guid

        # Create a task that will be executed when the layer is computed,
        # and will fetch the list of files from SX.
        k = f"SX-query-{self._query_guid}"
        self._tasks: Dict[str, Any] = {k: dask.delayed(self._fetch_files, name=k)}

    def _fetch_files(self) -> str:
        # This is where the actual fetching of the files from SX would happen.
        # For now, just return a list of files.
        logging.warn("Returning a file")
        return "0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1"

    def __getitem__(self, __key) -> Any:
        return self._tasks[str(__key)]
    
    def keys(self):
        return self._tasks.keys()
    
    def __len__(self) -> int:
        return len(self._tasks)
    
    def get_output_keys(self) -> AbstractSet[str]:
        return {f"SX-query-{self._query_guid}"}

    def __iter__(self):
        return iter(self._tasks)
    
    def is_materialized(self) -> bool:
        return False

In [None]:
l = SXLayer("182382781")
hlg = HighLevelGraph(
    layers={"l1": l},
    dependencies={},
)

In [None]:
r = client.compute(hlg)

In [None]:
type(r)


## The `uproot.dask` way

This is a very simple call - here for reference.

In [None]:
filename = "0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1"
ar = uproot.dask({filename: "treeme"}, open_files=False)

In [None]:
pt = ar.JetPt * 5

And lets look at the layers/etc. for this for reference.

In [None]:
pt.compute()

In [None]:
graph = pt.__dask_graph__()
graph

Lets look at the input layer here

In [None]:
from_uproot_key = [k for k in graph.layers.keys() if k.startswith("from")][0]
print(f"From uproot key: {from_uproot_key}")
from_uproot_first_output = list(graph.layers[from_uproot_key].keys())[0]
print(f"From uproot first output: {from_uproot_first_output}")
print(f"The function that gets executed and the arguments: {graph.layers[from_uproot_key][from_uproot_first_output]}")

We want the `('0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1', 'treeme', 0, 1, False)` to be the argument/output from a previous run.

## Toy Demo

Used AI to come up with this (minus a few syntax errors and missing the `Client.get`). This shows how to build everything from scratch.

In [None]:
from collections.abc import Set
import typing


class CustomLayer(Layer):
    def __init__(self, name, dependencies, tasks):
        super().__init__()
        self.name = name
        self.dependencies = dependencies
        self.tasks = tasks

    def __getitem__(self, key):
        return self.tasks[key]

    def __iter__(self):
        return iter(self.tasks)

    def __len__(self):
        return len(self.tasks)

    def is_materialized(self):
        return False
    
    def get_output_keys(self) -> AbstractSet[str | bytes | int | float]:
        return set(self.tasks.keys())

# Define the tasks for each layer
tasks1 = {"output1": (lambda: "result1", ), "output2": (lambda: "result2", )}
tasks2 = {"output3": (lambda x: x + " processed", "output1"), "output4": (lambda x: x + " processed", "output2")}

# Create the layers
layer1 = CustomLayer("layer1", [], tasks1)
layer2 = CustomLayer("layer2", ["layer1"], tasks2)

# Create the high level graph
hlg = HighLevelGraph(
    layers={layer1.name: layer1, layer2.name: layer2},
    dependencies={layer2.name: {layer1.name}, layer1.name: set()},
)

In [None]:
client.get(hlg, ["output4", "output3"])