# DASK to ServiceX

In this demo we'll take advantage of DASK and ServiceX. This work is driven by the fact that `AwkwardInputLayer` seems like it will not take tasks as inputs. So we need to move onto something else.

## Assumptions:

* We don't start anything until we know the number of files that SX will produce. Thus we know the number of partitions up front.
* We are ok with some files failing coming out of SX
* We are going to do one partition per file
* When we start we don't know all the _names_ of the files produced.
* We have to fetch them from minio, and that doesn't put them in any sort of time order.

## Design Outline

* A single `dask` task/layer that has a single output per partition. The output is just a string, indicating the file we need to open in uproot.
* A single thread polls ServiceX looking for new files to show up, and when they do, it passes them to a `dask` output. A `dask` task then tries to open the file.

The problem:
* `dask` is built around the idea that any task can be re-run with the exact same results.
* In this implementation we are using a `queue`, which means each time you run, you'll get a different answer (yes, we could fix that easily).
* But the basic problem ends up being: the thread that inserts the items on the `queue` might be in a different process than the `task` that needs to read the `queue`.
* `dask` points this out by refusing to pickle a lock.

There is a work around: the single thread can write a file to some shared file system with the file names in order of discovery, and then everyone can read that in. However, it might be that there is a better solution - that feels dirty!

## Imports

In [1]:
import dask_awkward as dak
import awkward as ak
import dask
import uproot

from dask.highlevelgraph import Layer, HighLevelGraph
from dask.distributed import Client, LocalCluster
from typing import AbstractSet

import logging

import threading
import queue
import time

# Make debugging a little easier...
cluster = LocalCluster(processes=False)
client = Client(cluster)

## The `uproot.dask` hack way

### The `awkward.Form` file form

We need the form from the schema to prevent us from having to open files that do not yet exist in hour hack. Eventually we'll have to build this from the schema we know exists from the `func_adl` query.

In [2]:
dummy_filename = "0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1"
with uproot.open(dummy_filename) as file:
    file_form = file['treeme'].arrays().layout.form
    metadata = dak.core.typetracer_array(file['treeme'].arrays())

file_form, metadata

(RecordForm([ListOffsetForm('i64', NumpyForm('float64'))], ['JetPt']),
 <Array-typetracer [...] type='## * {JetPt: var * float64}'>)

Next, lets test it.

In [3]:
test_ar = uproot.dask({dummy_filename: "treeme"}, open_files=False, known_base_form=file_form)
test_ar.JetPt.compute()

## Blockwise Approach

Could we start a blockwise approach on its own?

In [4]:
import threading
import multiprocessing
import time

class SXLayerBW(Layer):
    '''Outputs are just the names of the files that we want to open downstream with uproot'''
    def __init__(self, name, n_files):
        super().__init__()
        self.name = name
        self.dependencies = dict()
        manager = multiprocessing.Manager()
        self.queue = manager.Queue()
        self.tasks = {f"output_{i}": (self.get_file, f"output_{i}") for i in range(n_files)}
        self.polling_thread = threading.Thread(target=self.poll_api)
        self.polling_thread.start()

    def __getitem__(self, key):
        return self.tasks[key]

    def __iter__(self):
        return iter(self.tasks)

    def __len__(self):
        return len(self.tasks)

    def is_materialized(self):
        return False
    
    def get_output_keys(self) -> AbstractSet[str | bytes | int | float]:
        return set(self.tasks.keys())
    
    def poll_api(self):
        '''Poll the API and fetch the files'''
        for i in range(len(self.tasks)):
            print(f"Fetching info for file output_{i}: {dummy_filename}")
            time.sleep(2)  # simulate delay
            self.queue.put((dummy_filename, 'treeme'))
    
    def get_file(self, name):
        '''Return the info that is needed by uproot to actually open the file'''
        print(f"Returning info for file {name}")
        return self.queue.get()

And the layer that will load files from the above.

In [5]:
class URLoaderLayer(Layer):
    def __init__(self, name, sx_layer_name, n_files):
        super().__init__()
        self.name = name
        self.dependencies = {name: sx_layer_name}
        self.tasks = {
            (name, i): (lambda f_name: self.get_data(f_name), f'output_{i}')
            for i in range(n_files)
        }

    def __getitem__(self, key):
        return self.tasks[key]

    def __iter__(self):
        return iter(self.tasks)

    def __len__(self):
        return len(self.tasks)

    def is_materialized(self):
        return False
    
    def get_output_keys(self) -> AbstractSet[str | bytes | int | float]:
        return set(self.tasks.keys())
    
    def get_data(self, name):
        '''Return the info that is needed by uproot to actually open the file'''
        # TODO: This is swallowed unless we use a dask `LocalCluster`.
        logging.warning(f"Returning info for file {name}")
        with uproot.open(name[0]) as file:
            return file[name[1]].arrays()

Ok - lets build up the array.

In [6]:
# the layers

sx_layer = SXLayerBW("sx_fetcher", 1)
loader_layer = URLoaderLayer("uproot_loader", "sx_fetcher", 1)

# Now, the high level layer...
hlg = HighLevelGraph(
    layers={sx_layer.name: sx_layer, loader_layer.name: loader_layer},
    dependencies={loader_layer.name: {sx_layer.name}, sx_layer.name: set()},
)

# And finally the array...
ar = dak.core.new_array_object(hlg, "uproot_loader", meta=metadata, npartitions=1)

Fetching info for file output_0: 0fc6e51a5ea6dea107c195591d20a1b2-15.26710677._000019.pool.root.1


In [7]:
ar.JetPt[ar.JetPt > 100.0].compute()

2024-02-24 22:33:45,028 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 3 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x25c91d93390>
 0. sx_fetcher
 1. uproot_loader
 2. getitem-917c0def67e4544c397316e729412f33
>.
Traceback (most recent call last):
  File "c:\Users\gordo\Code\iris-hep\awkward-20-testing\.venv\Lib\site-packages\distributed\protocol\pickle.py", line 63, in dumps
    result = pickle.dumps(x, **dump_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't pickle local object 'unpack_collections.<locals>.repack'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\gordo\Code\iris-hep\awkward-20-testing\.venv\Lib\site-packages\distributed\protocol\pickle.py", line 68, in dumps
    pickler.dump(x)
AttributeError: Can't pickle local object 'unpack_collections.<locals>.repack'

During handling of the above exception, another exception occurr

TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 3 layers.\n<dask.highlevelgraph.HighLevelGraph object at 0x25c91d93390>\n 0. sx_fetcher\n 1. uproot_loader\n 2. getitem-917c0def67e4544c397316e729412f33\n>')