# Introduction to uproot

The interface is minimal: open a file with `uproot.open` and extract objects with a dictionary-like interface. Let's open a NanoAOD file.

In [None]:
import uproot
tree = uproot.open("~/storage/data/nano-TTLHE-2017-09-04-lz4.root")["Events"]
tree

Incidentally, the one-liner above is not possible with PyROOT because ROOT's and Python's notions of object ownership conflict.

In [None]:
import ROOT
tree2 = ROOT.TFile("/home/pivarski/storage/data/nano-TTLHE-2017-09-04-lz4.root").Get("Events")
tree2

As with most Python modules, all of the class members and methods that don't start with underscore are public.

In [None]:
print(", ".join(x for x in dir(tree) if not x.startswith("_")))

In [None]:
help(tree.array)

In [None]:
tree.branchnames

In [None]:
tree.array("Electron_pt")

In [None]:
tree.arrays(["Electron_pt", "Electron_eta", "Electron_phi"])

The branch/dtype argument of methods like `arrays` can take a function from `TBranch` to `dtype/None` as an argument, providing a flexible way to select branches and possibly change their Numpy `dtype` on the fly.

In [None]:
tree.arrays(lambda branch: branch.dtype if branch.name.startswith("Electron_") else None)

For instance, we can change all numbers from the "big endian" format ROOT stores them in to your machine's native byte order.

In [None]:
tree.arrays(lambda branch: branch.dtype if branch.name.startswith("Electron_") else None)

If you already have an array, you can pass it in place of the `dtype` argument. This avoids unnecessary copies.

In [None]:
import numpy
electron_pt = numpy.zeros(tree.array("nElectron").sum(), dtype=numpy.float64)
id(electron_pt)

In [None]:
electron_pt

In [None]:
tree.array("Electron_pt", electron_pt)
id(electron_pt)

In [None]:
electron_pt

If the arrays are too large to read all at once, you can iterate over them.

In [None]:
for pt, eta, phi in tree.iterator(1000, ["Electron_pt", "Electron_eta", "Electron_phi"], outputtype=tuple):
    print("px = {}".format(pt*numpy.cosh(eta)*numpy.sin(phi)))

Or over a collection of files (like TChain).

In [None]:
for pt, eta, phi in uproot.iterator(1000, "~/storage/data/nano-TTLHE-2017-09-04-*.root", "Events", ["Electron_pt", "Electron_eta", "Electron_phi"], outputtype=tuple):
    print("px = {}".format(pt*numpy.cosh(eta)*numpy.sin(phi)))

uproot uses Python's `Executor` interface for parallelism. Parallel processing and caching are never implicit: you have to give it an object.

In [None]:
import concurrent.futures
four_workers = concurrent.futures.ThreadPoolExecutor(4)

tree.arrays(["Electron_pt", "Electron_eta", "Electron_phi"], executor=four_workers)

In [None]:
# returns immediately, before actually reading
arrays, errors = tree.arrays(["Electron_pt", "Electron_eta", "Electron_phi"],
                                                             executor=four_workers, block=False)

In [None]:
# evaluate this iterator to wait for arrays to be filled and see if there were any errors
list(errors)

I'm adding connectors to other libraries, but I need your input about which are the most important.

In [None]:
df = tree.pandas.df(lambda branch: branch.dtype if branch.name.startswith("Electron_") else None)
df

Now you can do that whole Pandas-analysis thing. See StackOverflow for help.

In [None]:
%matplotlib inline

px = (df.Electron_pt * numpy.cosh(df.Electron_eta) * numpy.sin(df.Electron_phi))

px.plot.hist(numpy.linspace(-100, 100, 200), edgecolor="none")

## Next steps

By now, you've probably noticed that we're limited to in-memory analytics and flat ntuples.

For a Pandas interface on a large set of files (out-of-memory analytics; what you've come to expect from ROOT's `TChain`), it could be interesting to try Blaze. Ask me about it if you're interested.

For non-flat data (nested classes), let's move on to the next notebook.