## uproot overview

Uproot is a pure Python + Numpy reader of ROOT files.

   * Without a C++ layer, there are no memory ownership issues between C++ and Python.
   * Different design: instead of delivering event objects, uproot delivers columns of data as (jagged) arrays.
   * Not hampered by slow Python execution because data in ROOT files are laid out as (jagged) arrays: just need to cast them as Numpy arrays.

_(Disclosure: I'm the author of uproot.)_

In uproot, files, directories within files, and TTrees/TBranches behave like Python dicts.

In [None]:
import uproot
file = uproot.open("http://scikit-hep.org/uproot/examples/Event.root")
file.keys()

In [None]:
file["ProcessID0"]

In [None]:
file["htime"]

In [None]:
tree = file["T"]
tree

In [None]:
tree.keys()   # allkeys()

To get a sense of what a TTree contains, use `show`.

In [None]:
tree.show()

To read a (jagged) array, call `array` or `arrays`.

In [None]:
tree["fTracks.fMass2"].array()

In [None]:
tree.array("fTracks.fMass2")

In [None]:
tree.arrays(["fTracks.fMass2", "fTracks.fCharge"])

## Interpretations

The translation from ROOT data to an array is given by the branch's `interpretation` (if it has one).

In [None]:
tree["fNtrack"].interpretation

In [None]:
tree["fTemperature"].interpretation

In [None]:
tree["fMatrix[4][4]"].interpretation

In [None]:
tree["fTracks.fMass2"].interpretation

In [None]:
tree["fTracks.fCharge"].interpretation

In [None]:
tree["fH"].interpretation

If a branch has no `interpretation`, it can't be read. Either it's a no-data branch (exists just for structure) or it's an instance of uproot's incompleteness.

In [None]:
print(tree["fTracks.fPointValue"].interpretation)   # as of April 2019, this one has no interpretation

The bytes can be read and even divided along entry boundaries, but we don't yet know how to turn the bytes into an array.

In [None]:
uproot.asdebug

In [None]:
tree["fTracks.fPointValue"].array(uproot.asdebug)

Complex classes are generated based on the ROOT file's self-describing streamers, but they aren't necessarily fast to read (more Python than Numpy).

In [None]:
tree["fH"].interpretation

In [None]:
histograms = tree["fH"].array()
histograms

In [None]:
histograms[0].__dict__

## Fitting into memory constraints

Restricting the range of entries avoids reading too many baskets (chunks on disk).

In [None]:
tree.numentries

In [None]:
tree["fMatrix[4][4]"].numbaskets

In [None]:
tree["fMatrix[4][4]"].array(entrystart=600, entrystop=800)

Typically, you'd want to read chunk of entries from all interesting branches, do some work, then move on to the next chunk: use `iterate`.

In [None]:
import numpy
for arrays in tree.iterate(["fTracks.fPx", "fTracks.fPy"], entrysteps=300):
    mag = numpy.sqrt(arrays[b"fTracks.fPx"]**2 + arrays[b"fTracks.fPy"]**2)
    print(len(mag), mag[0][0])

The same for a set of files is `uproot.iterate` (supply file names with wildcards and tree name).

In [None]:
# no wildcards for XRootD and HTTP
filenames = ["http://scikit-hep.org/uproot/examples/HZZ" + x + ".root" for x in ["", "-zlib", "-lz4", "-lzma"]]
for arrays in uproot.iterate(filenames, "events", ["Muon_Px", "Muon_Py"]):
    mag = numpy.sqrt(arrays[b"Muon_Px"]**2 + arrays[b"Muon_Py"]**2)
    print(len(mag), mag[1][0])

## Encodings, outputtypes, and Pandas

In the previous examples, `tree.arrays` returns a dict of arrays. Branch names have no encoding, so the keys of these dicts are bytestrings (a little annoying in Python 3). Here are some things you can do about that.

In [None]:
arrays = tree.arrays(["fTracks.fPx", "fTracks.fPy"], namedecode="utf-8")
arrays

In [None]:
px, py = tree.arrays(["fTracks.fPx", "fTracks.fPy"], outputtype=tuple)
print(px)
print(py)

In [None]:
import collections
arrays = tree.arrays(["fNtrack", "fNseg", "fNvertex"], outputtype=collections.namedtuple)
print(arrays.fNtrack[:5], arrays.fNseg[:5], arrays.fNvertex[:5])

In [None]:
import pandas
tree.arrays(["fTracks.fP*"], outputtype=pandas.DataFrame)   # , flatten=True

If you're outputting to Pandas, you probably want to `namedecode` and `flatten`, so there are `tree.pandas.df`, `tree.pandas.iterate` methods and an `uproot.pandas.iterate` function for convenience.

In [None]:
filenames = "http://scikit-hep.org/uproot/examples/HZZ.root"
for df in uproot.pandas.iterate(filenames, "events", ["MET_p*", "Muon_P*"], entrysteps=1000):
    print(df)

## Caching

Uproot does not cache the arrays that you read (except raw data in HTTP and XRootD transfers). If you pass through the same data more than once, it might pay to cache it.

Any dict-like object may be used as a cache. Simplest case: a real dict (keep forever cache).

In [None]:
cache = {}
tree.arrays("fH", cache=cache)
list(cache.keys())

In [None]:
tree.arrays("fH", cache=cache)   # gets it from the dict, not the file

To put an upper limit on memory use, use an `ArrayCache` (which evicts the least recently accessed).

In [None]:
cache = uproot.cache.ArrayCache(limitbytes=1024**3)   # 1 GB
tree.arrays("fH", cache=cache)
list(cache.keys())

## Parallel processing

In rare cases (e.g. dominated by LZMA decompression), it can be advantageous to read the data in parallel. If you're dominated by processing, just split up your job.

In [None]:
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(4)    # split work into 4 threads

arrays = tree.arrays(["fTracks.fP*"], executor=executor, blocking=False)
arrays

The optional `blocking=False` argument means "return a `wait` function." Reading and decompressing continue while you do other things; calling the function returns the result, waiting if necessary.

In [None]:
arrays()

## Lazy evaluation

Another common pattern is lazy evaluation: get an array-like object and only read/decompress when you access it. If you supply an `ArrayCache`, you can also limit its memory use.

In [None]:
arrays = tree.lazyarrays(["fTracks.fP*"], namedecode="utf-8")
arrays

In [None]:
arrays["fTracks.fPx"][:10]   # now it reads from the first basket

In [None]:
arrays["fTracks.fPx"][-10:]   # now it reads from the last basket

## Dask (parallel processing)

Dask is a parallel processing framework based on lazy evaluation. Similar functions produce Dask arrays and Dask DataFrames.

In [None]:
filenames = "http://scikit-hep.org/uproot/examples/HZZ.root"
arrays = uproot.daskarrays(filenames, "events", ["MET_p*", "Muon_P*"])
arrays

In [None]:
df = uproot.daskframe(filenames, "events", ["MET_p*", "Muon_P*"])
df