# Illustration of the BulkRead → Numpy interface

For this demo, we only need to import ROOT and load a sample file with a TTree.

In [1]:
import ROOT

f = ROOT.TFile("/mnt/data/DYJetsToLL_M_50_HT_100to200_13TeV_2/DYJetsToLL_M_50_HT_100to200_13TeV_2_0.root")
t = f.Get("Events")

Welcome to JupyROOT 6.11/01


`TTree` has a new method in Python only: `GetNumpyIterator`. As arguments, it takes strings (branch names) or PyROOT `TBranch` and `TLeaf` objects.

It returns an iterator over _clusters_ that yields

   * the first entry number (inclusive)
   * the last entry number (exclusive)
   * a Numpy array for each branch's cluster data; they may have different lengths!

In [2]:
it = t.GetNumpyIterator("evtNum", "nPU", "pfMET")

In [3]:
next(it)

(0L,
 2008L,
 array([    7,    17,    25, ..., 22210, 22217, 22231], dtype=uint32),
 array([17, 11, 12, ..., 17,  9,  7], dtype=uint32),
 array([ 18.57892227,   4.96862411,  21.03972626, ...,  18.01696777,
         39.97037506,  27.76272392], dtype=float32))

In [4]:
for start, stop, evtNum, nPU, pfMET in it:
    print(start, stop, evtNum, nPU, pfMET)

(2008L, 3904L, array([5565798, 5565816, 5565817, ..., 5661162, 5661178, 5661182], dtype=uint32), array([15, 15, 19, ..., 20, 11, 10], dtype=uint32), array([ 19.87429047,  55.72983932,   4.62053728, ...,  66.39812469,
        23.06248093,  34.19698334], dtype=float32))
(3904L, 5846L, array([211484, 211487, 211492, ..., 191586, 191592, 191595], dtype=uint32), array([ 5, 13, 13, ...,  9, 14,  5], dtype=uint32), array([ 13.88227844,   8.45461082,  32.39862823, ...,  34.93431091,
        35.23172379,   5.10090208], dtype=float32))
(5846L, 7832L, array([323571, 323578, 323579, ..., 477783, 477787, 477794], dtype=uint32), array([14,  8, 22, ..., 23, 10, 11], dtype=uint32), array([  41.2771225 ,   10.54334831,  162.40769958, ...,   18.25775337,
         40.62509155,    3.33152866], dtype=float32))
(7832L, 9805L, array([545372, 545402, 545404, ..., 515940, 515949, 515950], dtype=uint32), array([11, 11, 13, ..., 11, 19,  9], dtype=uint32), array([ 173.39450073,   36.2428627 ,   13.97952175, ...,

Also works for branches with counters (but you have to ask for the counter explicitly).

In [5]:
it = t.GetNumpyIterator("Muon_", "Muon.pt", "Muon.eta", "Muon.phi")
next(it)

(0L,
 2008L,
 array([2, 0, 2, ..., 0, 0, 0], dtype=int32),
 array([  6.83854828e+01,   2.22999859e+01,   4.49833145e+01, ...,
          1.16027513e-42,   1.16027513e-42,   1.16027513e-42], dtype=float32),
 array([ -1.11078656e+00,  -1.83841538e+00,  -3.87242258e-01, ...,
          1.16167643e-42,   1.16167643e-42,   1.16167643e-42], dtype=float32),
 array([  9.89233196e-01,  -1.49590659e+00,  -2.65068436e+00, ...,
          1.16167643e-42,   1.16167643e-42,   1.16167643e-42], dtype=float32))

In [6]:
start, stop, counter, pt, eta, phi = next(it)
print("first 15: " + " ".join("{:.1f}".format(x) for x in pt[:15]))
print("last 300: " + " ".join("{}".format(int(x)) for x in pt[-300:]))  # why the zeros?

first 15: 24.6 5.0 40.3 4.8 3.7 55.5 4.6 35.0 19.1 6.0 5.2 16.4 3.7 67.6 54.4
last 300: 11 90 38 8 4 13 105 22 64 19 3 9 8 10 5 3 5 9 5 69 6 64 63 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


The branch data you get from a leaf with a counter needs to be clipped by the counter.

In [7]:
start, stop, numMuons, pt, eta, phi = next(it)

total = numMuons.sum()   # (vectorized by Numpy)

print(pt[:total])
print(eta[:total])
print(phi[:total])

[   4.54250288   60.07530212   20.19425011 ...,  106.60784149    9.09115505
   10.407094  ]
[ 2.1142087  -2.24989772 -0.55641216 ...,  0.19803496  2.3346858
  0.72767466]
[-2.8570509   1.65407753 -2.3271699  ...,  1.62292087  0.2934818
 -1.28502655]


This is a low-level interface!

String arguments are first interpreted as branches, and failing that, leaves. For more control, pass PyROOT objects.

In [8]:
leaf = t.GetLeaf("Muon_")         # counter is a TLeaf
branch = t.GetBranch("Muon.pt")   # data is a TBranch (that happens to contain only one TLeaf)

for start, end, numMuons, pt in t.GetNumpyIterator(leaf, branch):
    break

total = numMuons.sum()
pt[:total]

array([ 68.38548279,  22.29998589,  44.98331451, ...,  16.44164467,
        17.07163429,   6.40282822], dtype=float32)

A second method, `GetNumpyIteratorInfo`, has the same arguments and returns everything you need to allocate such an array.

   * `TBranch`/`TLeaf` name
   * Numpy data type
   * Total `TBranch`/`TLeaf` size (as a Numpy shape, which supports multiple dimensions)
   * name of counter or `None`

In [9]:
t.GetNumpyIteratorInfo("nPU", "pfMET", "Muon_", "Muon.pt", "Muon.eta", "Muon.phi")

((u'nPU', dtype('uint32'), (101861L,), None),
 (u'pfMET', dtype('float32'), (101889L,), None),
 (u'Muon', dtype('int32'), (233936L,), None),
 (u'Muon.pt', dtype('float32'), (191095L,), 'Muon_'),
 (u'Muon.eta', dtype('float32'), (190986L,), 'Muon_'),
 (u'Muon.phi', dtype('float32'), (190986L,), 'Muon_'))

This second function gives us enough information to allocate (whole branch) arrays wherever we like and then fill them with a Python loop over clusters.

Slow Python statements are only invoked once per cluster, once every ~2000 events.

By default, the iterator _allocates_ and _fills_ cluster-sized Numpy arrays in the loop. If we pass `return_new_buffers=False`, it wraps internal data as a read-only view.

This is not the default because it's dangerous: a user could view freed memory by looking at an array object outside the loop.

What's missing?

   * ~~Handling branches containing subbranches?~~ Probably not; this ought to be handled in the high-level interface so that the low-level is only concerned with delivering data quickly.
   * Branches with multiple leaves should yield Numpy record arrays. It looks like this hasn't been handled in the bulk reader yet, either.
   * A high-level interface, which ought to be written in Python. Put it next to ROOT.py and Cppyy.py?
   * Testing `TTrees` containing different data structures, deciding where to draw the line in what's supported and what's out of scope.

Code differences (with respect to Brian's `root-bulkapi-fastread-v2` branch):

   * [GitHub diff](https://github.com/bbockelm/root/compare/root-bulkapi-fastread-v2...jpivarski:root-bulkapi-fastread-v2)