# Combine Dask and RDataFrame worlds

A basic example of how to combine the two worlds. \
The idea is to extract results from two different ROOT files in parallel using a local Dask scheduler. \
Workflow:
* define a function that takes as input the path to a ROOT file and the name of the TTree stored in it from which we want to get results;
* setup a LocalCluster with two workers;
* feed the scheduler with the required parameters; inside each worker, an RDataFrame will be created and the event loop run
* the results are returned as futures, so we need to run ```client.gather(futures)``` to get the actual results

In [None]:
import ROOT
from dask.distributed import Client, LocalCluster
import uproot4

First, let's take a quick look at the TTrees

In [None]:
f = uproot4.open('data/tnp1.root')
t = f['Data_13TeV_All']

t.show()
print('\nEntries: {}'.format(t.num_entries))

In [None]:
f = uproot4.open('data/tnp2.root')
t = f['Data_13TeV_All']

t.show()
print('\nEntries: {}'.format(t.num_entries))

In [None]:
def get_results(tree_name, root_file):
    import ROOT
    rdf = ROOT.RDataFrame(tree_name, root_file)
    names = [name for name in rdf.GetColumnNames()]
    ptrs = [rdf.Histo1D(name) for name in names]
    results = [ptr.GetValue() for ptr in ptrs]
    return results

### Watch out!
* we can't directly feed the function with an instance of RDataFrame, since it can be serialized (yet), see https://github.com/root-project/root/issues/6765
* due to ```cloudpickle``` not being able to serialize ```ROOTFacade```, it is necessary to re-import ROOT (see https://github.com/cloudpipe/cloudpickle/issues/397)

In [None]:
cluster = LocalCluster()
client = Client(cluster)
cluster.scale(2)

In [None]:
client

In [None]:
%%time

futures = []
for file_name in ['data/tnp1.root', 'data/tnp2.root']:
    futures.append(client.submit(get_results, 'Data_13TeV_All', file_name))

histos = [histo for sublist in client.gather(futures) for histo in sublist]

In [None]:
client.close()
cluster.close()

## Considerations

As you probably have noticed, the amount of time taken to extract results from the two RDataFrames is *longer* than running them sequentially with multithreading enabled: this example is indeed only meant to explain how it is possible to combine the two worlds, but on a single machine running with ```EnableImplicitMT``` activated is probably the best solution to reach an optimal performance. \
Also the small size of the events doesn't make convenient to use this approach in this specific case. \
A better application would be to do the same operation in a distributed system.

### If we want to see the histograms (matplotlib)

In [None]:
canvases = []

for histo in histos:
    canvas = ROOT.TCanvas(histo.GetName(), histo.GetName())
    histo.Draw()
    canvases.append(canvas)

for canvas in canvases:
    canvas.Draw()