## uproot overview

Uproot is a pure Python + Numpy reader of ROOT files.

   * Without a C++ layer, there are no memory ownership issues between C++ and Python.
   * Different design: instead of delivering event objects, uproot delivers columns of data as (jagged) arrays.
   * Not hampered by slow Python execution because data in ROOT files are laid out as (jagged) arrays: just need to cast them as Numpy arrays.

_(Disclosure: I'm the author of uproot.)_

In uproot, files, directories within files, and TTrees/TBranches behave like Python dicts.

In [1]:
import uproot
file = uproot.open("http://scikit-hep.org/uproot/examples/Event.root")
file.keys()

[b'ProcessID0;1', b'htime;1', b'T;1', b'hstat;1']

In [2]:
file["ProcessID0"]

<TProcessID b'ProcessID0' at 0x7cc1412c57f0>

In [3]:
file["htime"]

<b'TH1F' b'htime' 0x7cc1412daae8>

In [4]:
tree = file["T"]
tree

<TTree b'T' at 0x7cc1412c5c50>

In [5]:
tree.keys()   # allkeys()

[b'event']

To get a sense of what a TTree contains, use `show`.

In [6]:
tree.show()

event                      TStreamerInfo              None
TObject                    TStreamerInfo              None
fUniqueID                  TStreamerBasicType         asdtype('>u4')
fBits                      TStreamerBasicType         asdtype('>u4')

fType[20]                  TStreamerBasicType         asdtype("('i1', (20,))")
fEventName                 TStreamerBasicType         asstring(4)
fNtrack                    TStreamerBasicType         asdtype('>i4')
fNseg                      TStreamerBasicType         asdtype('>i4')
fNvertex                   TStreamerBasicType         asdtype('>u4')
fFlag                      TStreamerBasicType         asdtype('>u4')
fTemperature               TStreamerBasicType         asdtype('>f4', 'float64')
fMeasures[10]              TStreamerBasicType         asdtype("('>i4', (10,))")
fMatrix[4][4]              TStreamerBasicType         asdtype("('>f4', (4, 4))", "('<f8', (4, 4))")
fClosestDistance           TStreamerBasicPointer      None
fEv

To read a (jagged) array, call `array` or `arrays`.

In [7]:
tree["fTracks.fMass2"].array()

<JaggedArray [[4.5 4.5 4.5 ... 4.5 4.5 4.5] [4.5 4.5 4.5 ... 4.5 4.5 4.5] [8.90625 8.90625 8.90625 ... 8.90625 8.90625 8.90625] ... [4.5 4.5 4.5 ... 4.5 4.5 4.5] [4.5 4.5 4.5 ... 4.5 4.5 4.5] [8.90625 8.90625 8.90625 ... 8.90625 8.90625 8.90625]] at 0x7cc1401ecb70>

In [8]:
tree.array("fTracks.fMass2")

<JaggedArray [[4.5 4.5 4.5 ... 4.5 4.5 4.5] [4.5 4.5 4.5 ... 4.5 4.5 4.5] [8.90625 8.90625 8.90625 ... 8.90625 8.90625 8.90625] ... [4.5 4.5 4.5 ... 4.5 4.5 4.5] [4.5 4.5 4.5 ... 4.5 4.5 4.5] [8.90625 8.90625 8.90625 ... 8.90625 8.90625 8.90625]] at 0x7cc14013f828>

In [9]:
tree.arrays(["fTracks.fMass2", "fTracks.fCharge"])

{b'fTracks.fMass2': <JaggedArray [[4.5 4.5 4.5 ... 4.5 4.5 4.5] [4.5 4.5 4.5 ... 4.5 4.5 4.5] [8.90625 8.90625 8.90625 ... 8.90625 8.90625 8.90625] ... [4.5 4.5 4.5 ... 4.5 4.5 4.5] [4.5 4.5 4.5 ... 4.5 4.5 4.5] [8.90625 8.90625 8.90625 ... 8.90625 8.90625 8.90625]] at 0x7cc14011ef60>,
 b'fTracks.fCharge': <JaggedArray [[1.0 1.0 1.0 ... 1.0 1.0 0.0] [1.0 0.0 0.0 ... 0.0 1.0 -1.0] [-1.0 1.0 1.0 ... -1.0 1.0 1.0] ... [1.0 1.0 1.0 ... 0.0 -1.0 1.0] [0.0 0.0 1.0 ... 1.0 0.0 1.0] [1.0 -1.0 0.0 ... 0.0 0.0 1.0]] at 0x7cc14011e320>}

## Interpretations

The translation from ROOT data to an array is given by the branch's `interpretation` (if it has one).

In [10]:
tree["fNtrack"].interpretation

asdtype('>i4')

In [11]:
tree["fTemperature"].interpretation

asdtype('>f4', 'float64')

In [12]:
tree["fMatrix[4][4]"].interpretation

asdtype("('>f4', (4, 4))", "('<f8', (4, 4))")

In [13]:
tree["fTracks.fMass2"].interpretation

asjagged(asfloat16(0.0, 0.0, 8, dtype([('exponent', 'u1'), ('mantissa', '>u2')]), dtype('float32')))

In [14]:
tree["fTracks.fCharge"].interpretation

asjagged(asdouble32(-1.0, 1.0, 2, dtype('>u4'), dtype('float64')))

In [15]:
tree["fH"].interpretation

asgenobj(TH1F)

If a branch has no `interpretation`, it can't be read. Either it's a no-data branch (exists just for structure) or it's an instance of uproot's incompleteness.

In [16]:
print(tree["fTracks.fPointValue"].interpretation)   # as of April 2019, this one has no interpretation

None


The bytes can be read and even divided along entry boundaries, but we don't yet know how to turn the bytes into an array.

In [17]:
uproot.asdebug

asjagged(asdtype('uint8'))

In [18]:
tree["fTracks.fPointValue"].array(uproot.asdebug)

<JaggedArray [[1 85 85 ... 170 170 170] [0 1 85 ... 85 85 85] [0 1 85 ... 85 85 0] ... [0 1 85 ... 170 170 170] [0 0 1 ... 170 170 170] [1 85 85 ... 170 170 170]] at 0x7cc16850f080>

Complex classes are generated based on the ROOT file's self-describing streamers, but they aren't necessarily fast to read (more Python than Numpy).

In [19]:
tree["fH"].interpretation

asgenobj(TH1F)

In [20]:
histograms = tree["fH"].array()
histograms

<ObjectArray [<b'TH1F' b'hstat' 0x7cc140228188> <b'TH1F' b'hstat' 0x7cc140228318> <b'TH1F' b'hstat' 0x7cc140228778> ... <b'TH1F' b'hstat' 0x7cc140228ef8> <b'TH1F' b'hstat' 0x7cc140228778> <b'TH1F' b'hstat' 0x7cc140228188>] at 0x7cc14026c320>

In [21]:
histograms[0].__dict__

{'_classversion': 1,
 '_fName': b'hstat',
 '_fTitle': b'Event Histogram',
 '_fLineColor': 602,
 '_fLineStyle': 1,
 '_fLineWidth': 1,
 '_fFillColor': 0,
 '_fFillStyle': 1001,
 '_fMarkerColor': 1,
 '_fMarkerStyle': 1,
 '_fMarkerSize': 1.0,
 '_fNcells': 102,
 '_fXaxis': <TAxis b'xaxis' at 0x7cc140105ef0>,
 '_fYaxis': <TAxis b'yaxis' at 0x7cc140105940>,
 '_fZaxis': <TAxis b'zaxis' at 0x7cc140105978>,
 '_fBarOffset': 0,
 '_fBarWidth': 1000,
 '_fEntries': 1.0,
 '_fTsumw': 1.0,
 '_fTsumw2': 1.0,
 '_fTsumwx': 0.28261780738830566,
 '_fTsumwx2': 0.07987282505297344,
 '_fMaximum': -1111.0,
 '_fMinimum': -1111.0,
 '_fNormFactor': 0.0,
 '_fContour': [],
 '_fSumw2': [],
 '_fOption': b'',
 '_fFunctions': [],
 '_fBufferSize': 0,
 '_fBuffer': array([], dtype=float64),
 '_fBinStatErrOpt': 0}

## Fitting into memory constraints

Restricting the range of entries avoids reading too many baskets (chunks on disk).

In [22]:
tree.numentries

1000

In [23]:
tree["fMatrix[4][4]"].numbaskets

5

In [24]:
tree["fMatrix[4][4]"].array(entrystart=600, entrystop=800)

array([[[-1.48842251,  0.22893755,  0.73481667,  0.        ],
        [-0.81405222,  0.32352245,  3.30477071,  0.        ],
        [ 1.07104862,  1.1267221 ,  1.35107005,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ]],

       [[ 0.72956365,  0.36551571, -0.92489249,  0.        ],
        [-0.08615459,  0.70054299,  0.84200442,  0.        ],
        [-0.27986413,  1.12187469,  2.70830393,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ]],

       [[ 1.50197184,  0.69582516, -0.31665999,  0.        ],
        [-0.70513338,  1.77296209,  1.1221925 ,  0.        ],
        [-1.39828825,  0.46608233,  3.8889432 ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ]],

       ...,

       [[-0.44362187,  0.12667903,  0.78897899,  0.        ],
        [-0.55786729,  0.78306931,  1.66213036,  0.        ],
        [ 1.32299483,  4.01059055,  5.46123695,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.

Typically, you'd want to read chunk of entries from all interesting branches, do some work, then move on to the next chunk: use `iterate`.

In [25]:
import numpy
for arrays in tree.iterate(["fTracks.fPx", "fTracks.fPy"], entrysteps=300):
    mag = numpy.sqrt(arrays[b"fTracks.fPx"]**2 + arrays[b"fTracks.fPy"]**2)
    print(len(mag), mag[0][0])

300 2.1687002
300 1.9124396
300 0.6829921
100 0.81746


The same for a set of files is `uproot.iterate` (supply file names with wildcards and tree name).

In [26]:
# no wildcards for XRootD and HTTP
filenames = ["http://scikit-hep.org/uproot/examples/HZZ" + x + ".root" for x in ["", "-zlib", "-lz4", "-lzma"]]
for arrays in uproot.iterate(filenames, "events", ["Muon_Px", "Muon_Py"]):
    mag = numpy.sqrt(arrays[b"Muon_Px"]**2 + arrays[b"Muon_Py"]**2)
    print(len(mag), mag[1][0])

2231 24.417913
190 38.49425
2231 24.417913
190 38.49425
2231 24.417913
190 38.49425
2231 24.417913
190 38.49425


## Encodings, outputtypes, and Pandas

In the previous examples, `tree.arrays` returns a dict of arrays. Branch names have no encoding, so the keys of these dicts are bytestrings (a little annoying in Python 3). Here are some things you can do about that.

In [30]:
arrays = tree.arrays(["fTracks.fPx", "fTracks.fPy"], namedecode="utf-8")
arrays.keys()

dict_keys(['fTracks.fPx', 'fTracks.fPy'])

In [33]:
px, py = tree.arrays(["fTracks.fPx", "fTracks.fPy"], outputtype=tuple)
print(px[0][0], py[0][0])

0.8419714 -1.9985858


In [40]:
import collections
arrays = tree.arrays(["fNtrack", "fNseg", "fNvertex"], outputtype=collections.namedtuple)
print(arrays.fNtrack[:5], arrays.fNseg[:5], arrays.fNvertex[:5])

[600 604 603 594 595] [6000 6029 6019 5923 5949] [19 13 14  6 14]


In [47]:
import pandas
tree.arrays(["fTracks.fP*"], outputtype=pandas.DataFrame)   # , flatten=True

Unnamed: 0_level_0,fTracks.fPx,fTracks.fPy,fTracks.fPz
entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[0.8419714, -0.7517185, -1.2051572, -0.427116,...","[-1.9985858, -1.037447, 0.43163738, 1.1485726,...","[2.1687002, 1.2811624, 1.2801229, 1.2254171, 0..."
1,"[1.6047764, -1.0758387, -1.3932744, 1.3841597,...","[-1.4657867, 0.98966086, -0.40328622, 2.375182...","[2.1734393, 1.4617994, 1.4504666, 2.7490704, 1..."
2,"[-0.07160589, 0.66182005, -1.27176, 0.1688395,...","[-0.96637154, 0.63050216, 0.80289334, -1.44528...","[0.96902084, 0.91407806, 1.5039984, 1.4551162,..."
3,"[-0.14304866, -0.72015625, -0.26054904, -1.597...","[-0.52942014, -0.42783213, 0.13375106, -0.6989...","[0.5484055, 0.83765465, 0.29287395, 1.7436261,..."
4,"[-0.35032487, 0.25517753, 0.81522655, 3.515412...","[-0.36952665, -0.07806556, 0.8490566, 0.163935...","[0.50919294, 0.26685166, 1.1770691, 3.519233, ..."
5,"[0.9705175, -0.2339002, -0.50332534, 0.3029228...","[0.3593279, -0.03988636, -0.0053454647, -1.029...","[1.0349014, 0.23727669, 0.5033537, 1.0735381, ..."
6,"[-1.2503744, 0.060332876, -1.322312, -0.742683...","[-0.89010864, -0.13585812, 0.27518225, 0.98975...","[1.5348387, 0.14865223, 1.3506422, 1.2374127, ..."
7,"[-1.0393323, -0.8376865, 0.23049998, -1.224901...","[-0.47842094, -1.1893959, -1.1487722, 0.992551...","[1.1441582, 1.4547788, 1.1716689, 1.5765604, 1..."
8,"[1.8042065, -1.4835553, -0.48568684, -0.218936...","[-0.3811606, 0.27416706, -0.33988178, -0.40725...","[1.8440294, 1.5086763, 0.59279954, 0.46237206,..."
9,"[-1.0214751, -1.0191854, -1.0693055, -0.221288...","[0.23385774, -1.3407346, -0.39071348, 0.367159...","[1.0479031, 1.6841342, 1.1384513, 0.42868945, ..."


If you're outputting to Pandas, you probably want to `namedecode` and `flatten`, so there are `tree.pandas.df`, `tree.pandas.iterate` methods and an `uproot.pandas.iterate` function for convenience.

In [46]:
filenames = "http://scikit-hep.org/uproot/examples/HZZ.root"
for df in uproot.pandas.iterate(filenames, "events", ["MET_p*", "Muon_P*"]):
    print(df)

                   MET_px     MET_py     Muon_Px    Muon_Py     Muon_Pz
entry subentry                                                         
0     0          5.912771   2.563633  -52.899456 -11.654672   -8.160793
      1          5.912771   2.563633   37.737782   0.693474  -11.307582
1     0         24.765203 -16.349110   -0.816459 -24.404259   20.199968
2     0        -25.785088  16.237131   48.987831 -21.723139   11.168285
      1        -25.785088  16.237131    0.827567  29.800508   36.965191
3     0          8.619896 -22.786547   22.088331 -85.835464  403.848450
      1          8.619896 -22.786547   76.691917 -13.956494  335.094208
4     0          5.393139  -1.310052   45.171322  67.248787  -89.695732
      1          5.393139  -1.310052   39.750957  25.403667   20.115053
5     0         -3.759475 -19.417021    9.228110  40.554379  -14.642164
      1         -3.759475 -19.417021   -5.793715 -30.295189   42.954376
6     0         23.962149  -9.049156   12.538717 -42.548710 -124

## Caching

If you 

## Parallel processing

## Lazy evaluation

## Dask (distributed computing)