# Illustration of the BulkRead → Numpy interface

For this demo, we only need to import ROOT and load a sample file with a TTree.

In [None]:
import ROOT

f = ROOT.TFile("/mnt/data/DYJetsToLL_M_50_HT_100to200_13TeV_2/DYJetsToLL_M_50_HT_100to200_13TeV_2_0.root")
t = f.Get("Events")

`TTree` has a new method in Python only: `GetNumpyIterator`. As arguments, it takes strings (branch names) or PyROOT `TBranch` and `TLeaf` objects.

It returns an iterator over _clusters_ that yields

   * the first entry number (inclusive)
   * the last entry number (exclusive)
   * a Numpy array for each branch's cluster data; they may have different lengths!

In [None]:
it = t.GetNumpyIterator("evtNum", "nPU", "pfMET")

In [None]:
next(it)

In [None]:
for start, stop, evtNum, nPU, pfMET in it:
    print(start, stop, evtNum, nPU, pfMET)

Also works for branches with counters (but you have to ask for the counter explicitly).

In [None]:
it = t.GetNumpyIterator("Muon_", "Muon.pt", "Muon.eta", "Muon.phi")
next(it)

In [None]:
start, stop, counter, pt, eta, phi = next(it)
print("first 15: " + " ".join("{:.1f}".format(x) for x in pt[:15]))
print("last 300: " + " ".join("{}".format(int(x)) for x in pt[-300:]))  # why the zeros?

The branch data you get from a leaf with a counter needs to be clipped by the counter.

In [None]:
start, stop, numMuons, pt, eta, phi = next(it)

total = numMuons.sum()   # (vectorized by Numpy)

print(pt[:total])
print(eta[:total])
print(phi[:total])

This is a low-level interface!

String arguments are first interpreted as branches, and failing that, leaves. For more control, pass PyROOT objects.

In [None]:
leaf = t.GetLeaf("Muon_")         # counter is a TLeaf
branch = t.GetBranch("Muon.pt")   # data is a TBranch (that happens to contain only one TLeaf)

for start, end, numMuons, pt in t.GetNumpyIterator(leaf, branch):
    break

total = numMuons.sum()
pt[:total]

What if you want to fill a big array (e.g. Pandas data frame)?

A second method, `GetNumpyIteratorInfo`, has the same arguments and returns everything you need to allocate such an array.

   * `TBranch`/`TLeaf` name
   * Numpy data type
   * Total `TBranch`/`TLeaf` size (as a Numpy shape, which supports multiple dimensions)
   * name of counter or `None`

In [None]:
t.GetNumpyIteratorInfo("nPU", "pfMET", "Muon_", "Muon.pt", "Muon.eta", "Muon.phi")

The higher-level interface would allocate the array (possibly calling external Python libraries) and then iterate over the clusters to fill it, _in Python._

Slow Python statements are only invoked once per cluster, once every ~2000 events.

What's missing?

   * Handling branches containing subbranches? Probably not; this ought to be solved in the high-level interface.
   * Handling branches with multiple leaves: should produce Numpy record arrays. It looks like this hasn't been handled in the bulk reader yet, either.
   * A high-level interface, which ought to be written in Python. Put it next to ROOT.py and Cppyy.py?
   * Testing `TTrees` with different data, deciding where to draw the line in what's supported.