# Using Uproot effectively

## How this works as a hands-on tutorial

Even though I don't have formal exercises scattered throughout these notebooks, this session can still be interactive.

   * **You** should open each notebook in Binder (see [GitHub README](https://github.com/jpivarski/2020-06-08-uproot-awkward-columnar-hats)) and evaluate cells, following along with me.
   * **I** should pause frequently and stay open to questions. I'll be monitoring the videoconference chat.
   * **We** should feel free to step off the path and try to answer "What if?" questions in real time.

Not all digressions will lead to an answer—I often realize, "That's why it didn't work!" long after the tutorial is over—but tinkering is how we learn.

Consider this a tour and I'm your guide. The planned route is a suggestion to get things started, but your questions and wayfaring are more important.

<br><br><br>

## Increasingly unnecessary introduction to/motivation for Python

I used to start these tutorials by asking, "Why Python?" but that doesn't seem necessary anymore.

![](img/python-usage.png)

<br><br><br>

<font size="15">Introduction to </font><img src="img/uproot-logo-300px.png" style="vertical-align:middle">

![](img/abstraction-layers.png)

Uproot is an independent implementation of ROOT I/O and only I/O, using standard Python libraries wherever possible.

<br><br><br>

## Why was it written?

![](img/uproot-awkward-timeline.png)

Uproot was originally a part of Femtocode, a query language for calculations on columnar data. I needed an easier way to deploy ROOT I/O.

   * **Uproot 1.x** was released as a Python package "in case anyone finds it useful."
   * Machine learning users did find it useful, so I quickly cleaned it up and made it presentable as **Uproot 2.x**.
   * The way people were using Uproot influenced how I thought about columnar analysis: breaking it out into smaller pieces and eventually the exposing array-at-a-time interface to users, rather than hiding the columnar processing behind a query language.
   * Uproot's "bottom up" JaggedArrays were moved into a new package, **Awkward Array**, replacing the "top down" view of OAMap. This was **Uproot 3.x**.
   * Awkward Array is successful even though it has interface flaws and its pure Python "no for loops!" implementation is hard to maintain.
   * **Awkward 1.x** started last fall with a long development time to "do it right." It is complete, but not very visible because Uproot doesn't produce the new-style arrays yet.
   * **Uproot 4.x** started development in May with a release date of July 1.

Unlike previous version updates (which were more minor), Uproot 3.x and Awkward 0.x will continue to exist as `uproot3` and `awkward0`.

<br><br><br>

Sometime this summer, `uproot4` → `uproot` and `awkward1` → `awkward`. If you need to keep old scripts working, you'll be able to

```python
import uproot3 as uproot
import awkward0 as awkward
```

but new work should use the new libraries. (The old ones will continue to exist, though won't be actively maintained.)

![](img/Raiders-of-the-Lost-Ark-Chamber.jpg)

<br><br><br>

## Opening a file with Uproot

The read-only interface starts with `uproot.open`.

(Also supports HTTP and XRootD URLs, but I don't cover them in this tutorial.)

In [None]:
import uproot

file = uproot.open("data/nesteddirs.root")
file

A file has a dict-like interface, meaning that you can access objects with square brackets and list them with `keys`.

In [None]:
file.keys()

In [None]:
file["one"]

In [None]:
file["one"].keys()

In [None]:
file.allkeys()

In [None]:
file.allclassnames()

### What's the `b` at the beginning of each file path?

These are bytestrings, not strings, and Python 3 emphasizes the difference.

I was worried that old ROOT files would use strange encodings and thought that presuming everything to be UTF-8 would make hist�gr�m title� l��k like th�s.

But the issue of encodings never came up. Dealing with the Python bytestrings has been more of a nuisance.

### Technology preview: Uproot 4

Uproot 4 is only half-written and might fail in simple cases. However, we can try it out side-by-side with Uproot 3 because it has a different package name.

In [None]:
import uproot4

file_uproot4 = uproot4.open("data/nesteddirs.root")
file_uproot4

In [None]:
# recursive=True is now the default; there's no 'allkeys'
file_uproot4.keys()

In [None]:
# now a dict, and no bytestrings
file_uproot4.classnames()

In [None]:
file_uproot4.classname_of("one/two/tree")

In [None]:
file_uproot4.classname_of("one/two/tree;1")

No more bytestrings. (Invalid UTF-8 uses the "surrogate escape" method, so a strangely encoded string won't _break_ anything, at least.)

### What's the `;1` at the end of the key name?

These are ROOT "cycle numbers," which allow objects with the same name to exist in the same directory. We display them to disambiguate, but you don't have to type them to look up an object. (You'll get the latest one; the one with the highest cycle.)

## Exploring a TTree

TTrees also have a dict-like interface, though the `show` method has been very useful.

In [None]:
tree = file["one/two/tree"]
tree

In [None]:
tree.keys()

In [None]:
tree.show()

Left column: branch names, middle column: streamers (which define complex types), right column: how _we_ interpret the branch as an array (Uproot-specific).

In [None]:
tree["Float64"].array()

In [None]:
tree["ArrayInt32"].array()

In [None]:
tree["SliceInt64"].array()

The last of these is a jagged array, which has a variable number of items in each entry.

   * Uproot 3 returns NumPy arrays for scalar and fixed-length per entry types.
   * Uproot 3 returns Awkward 0 JaggedArrays for variable-length per entry types.
   * Uproot 4 (by default) returns Awkward 1 arrays for all branches.

In [None]:
file_uproot4["one/two/tree/Float64"].array()

In [None]:
file_uproot4["one/two/tree/ArrayInt32"].array()

In [None]:
file_uproot4["one/two/tree/SliceInt64"].array()

It's still possible to get NumPy arrays with `library="np"` (i.e. return type depends on what you ask for, not the contents of the file).

In [None]:
file_uproot4["one/two/tree/Float64"].array(library="np")

In [None]:
file_uproot4["one/two/tree/ArrayInt32"].array(library="np")

In [None]:
file_uproot4["one/two/tree/SliceInt64"].array(library="np")

Also, Pandas is a `library`, rather than a special function, as well as CuPy (GPU arrays) and any others we might want to add in the future.

In [None]:
file_uproot4["one/two/tree/SliceInt64"].array(library="pd")

## How ROOT data are organized

Objects in directories are referenced by TKeys—you can ignore these, as they just make the square brackets syntax work.

A TTree's TBranches are either containers of data, convertible to arrays, or placeholders in a hierarchy describing a "split" object (more on that later).

The actual data are broken up into TBaskets, which is the smallest unit that can be read from a compressed file. There's no such thing as "reading one event," unless you have one TBasket per event (which would be inefficient when reading many events).

![](img/terminology.png)

Often, you can ignore TBaskets: Uproot treats TBranches as the fundamental unit, with one TBranch → one array.

But if your file compresses poorly or is slow to read, check the TBasket sizes to see that they are at least 10's to 100's of kilobytes each.

In [None]:
events = uproot.open("data/cms_opendata_2012_nanoaod_DoubleMuParked.root")["Events"]
events

In [None]:
for name in events.keys():
    print(f"{name.decode():20} {events[name].numbaskets:2d} baskets {[events[name].basket_uncompressedbytes(i)/1024 for i in range(events[name].numbaskets)]} kB each")

This affects ROOT performance, but it affects Uproot performance _more_.

![](img/root-none-muon.png)

(The TFile-TTree-TBranch-TBasket structure has to be navigated in slow Python, but reading/decompressing/interpreting a TBasket is a NumPy call, about as fast as the hardware allows.)

## Split objects

ROOT TTrees are intended to deliver collections of C++ objects. Strictly speaking, these objects have no equivalent in Python—certainly their C++ methods can't be executed by Python. (The C++ code is not stored in the file with the data, even if we had a runtime C++ compiler. That's why some ROOT scripts require `.L` to load libraries.)

What the ROOT files _do_ provide is a list of each class's private member data and how they are laid out in bytes (called the `TStreamerInfo`). We can use that to generate Python classes and reconstruct the objects. However, that has to run in slow Python, not fast NumPy.

As a storage optimization, ROOT files can be written with each member datum in a separate branch. This is called the "splitLevel" and [you can control it when writing files](https://root.cern.ch/doc/master/classTTree.html#addingacolumnofobjs) (if you have access to the process that writes files).

Split data are

   * less likely to contain unsupported features (data structures that Uproot can't read might be in a branch you don't need to read);
   * often faster because they can be read in a single NumPy call, rather than many Python statements;
   * possible to read one column at a time, without touching the others (in ROOT and Uproot).

Let's look at an example of the same data in unsplit and split form:

In [None]:
unsplit = uproot.open("data/small-evnt-tree-nosplit.root")["tree"]
split = uproot.open("data/small-evnt-tree-fullsplit.root")["tree"]

In [None]:
unsplit.show()

In [None]:
split.show()

We can read the unsplit data, and they are Python objects with attributes.

In [None]:
unsplit_events = unsplit["evt"].array()
unsplit_events

In [None]:
unsplit_events[5]._SliceI64

We could ask for all attributes of one event.

In [None]:
{name: getattr(unsplit_events[5], "_" + name) for name in unsplit_events[5]._fields}

Or we could ask for the same attribute from all events.

In [None]:
list_of_numpy_arrays = [x._SliceI64 for x in unsplit_events]
list_of_numpy_arrays

The above approximates what the split file naturally has: a column representing a single field of all events.

In [None]:
jagged_array = split["SliceI64"].array()
jagged_array

It looks different because it is different:

   * the Python list comprehension over unsplit objects made a list of NumPy arrays;
   * the split data was directly read into a JaggedArray.

The JaggedArray has features that the list of NumPy arrays doesn't (more on Awkward Array in the second hour).

For instance, you can slice the second dimension, which has variable length.

In [None]:
jagged_array[:, :3]

But not in a Python list of NumPy arrays. Python doesn't think of the objects in the list as being part of the list.

In [None]:
list_of_numpy_arrays[:, :3]

Through a construction, we can build the same kind of objects from unsplit data:

In [None]:
import awkward1 as ak

events = ak.Array([{name: getattr(obj, "_" + name) for name in obj._fields if name != "P3"} for obj in unsplit_events])
events

This is now an array of everything; its type shows the full structure.

In [None]:
ak.type(events)

In [None]:
events.SliceI64

In [None]:
ak.from_awkward0(jagged_array)

In [None]:
ak.from_awkward0(jagged_array) == events.SliceI64

In [None]:
ak.all(ak.from_awkward0(jagged_array) == events.SliceI64)

But you shouldn't have to write this manually. Uproot 4 will do that for you (taking advantage of some Awkward 1 features).

## Histograms

Sometimes, though, we want objects to have methods. TTree (also auto-generated from `TStreamerInfo`, like anything else) is a prime example: we want TTrees to have methods that read TBaskets and convert them into arrays.

Uproot has a stable set of "mixin classes," which define methods but no data, as well as the auto-generated "models" that deserialize and store data. Runtime classes inherit from both.

Histograms, for instance, have some analysis methods.

In [None]:
histograms = uproot.open("data/hepdata-example.root")
histograms.classnames()

In [None]:
histograms["hpx"].show()

A shout-out: see [scikit-hep/histoprint](https://github.com/ast0815/histoprint) for a more fully featured package that will take over the job of pretty-printing histograms.

It can do overlays, stacks, and terminal colors (not in Jupyter, though).

In [None]:
import histoprint

histoprint.print_hist(histograms["hpx"])

The histogram methods are convenient ways to access C++ private members. For instance,

In [None]:
histograms["hpx"]._fTitle

In [None]:
histograms["hpx"].title

In [None]:
histograms["hpx"]._fXaxis

In [None]:
histograms["hpx"].bins

In [None]:
histograms["hpx"].edges

In [None]:
print(histograms["hpx"].hepdata())

The `numpy` method turns the ROOT histogram into the same form that `np.histogram` would return.

In [None]:
histograms["hpx"].numpy()

Same for 2-D histograms and `np.histogram2d`.

In [None]:
histograms["hpxpy"].numpy()

Unfortunately, Matplotlib, the predominant Python plotting package, does not like to take prebinned histogram data.

This idea of filling histograms in a separate job from plotting them, which [we've been doing at least since HBOOK was released in 1974](https://indico.cern.ch/event/667648/attachments/1526850/2425425/cern17.pdf), is largely unknown beyond particle physics.

The best you can do is a bar chart.

In [None]:
import matplotlib.pyplot as plt

content, edges = histograms["hpx"].numpy()

plt.bar(edges[:-1], content)

But there are projects in Scikit-HEP that are seeking to address that (another shout-out).

In [None]:
import mplhep as hep

hep.histplot(histograms["hpx"].numpy())

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

hep.histplot(histograms["hpx"].numpy(), ax=ax1)
ax1.set_xlabel(histograms["hpx"].title)

content, ((xbins, ybins),) = histograms["hpxpy"].numpy()
hep.hist2dplot(content, xbins, ybins, ax=ax2)
ax2.set_xlabel(histograms["hpxpy"].title)

For histogramming, take a look at the following. With the exception of `hist` (a new project in Scikit-HEP), they are all complete, stable, and actively maintained.

   * [Boost.Histogram](https://www.boost.org/doc/libs/1_73_0/libs/histogram/doc/html/index.html): minimal-dependencies, fast-filling, flexible, HEP-style histograms in C++, which has been accepted into Boost.
   * [boost-histogram](https://github.com/scikit-hep/boost-histogram): Python bindings for Boost.Histogram.
   * [mplhep](https://github.com/scikit-hep/mplhep): plotting interface over Matplotlib, providing CMS and ATLAS styles and other HEP conveniences.
   * [histoprint](https://github.com/ast0815/histoprint): histogram renderer for terminals and command-lines.
   * [Physt](https://github.com/janpipek/physt): complete histogramming package with a HEP-like point of view.
   * [hist](https://github.com/scikit-hep/hist): Pythonic "one-stop-shop" for histogramming, pulling in all dependencies to make plotting easier. Filling via boost-histogram, plotting via mplhep, text output via histoprint.
   * [Coffea histograms](https://coffeateam.github.io/coffea/notebooks/histograms.html): intended as an intermediate, presages some of the features of hist.

Of course there's also [PyROOT](https://root.cern.ch/pyroot) and [rootpy](https://pypi.org/project/rootpy). If you're a theorist, you've probably heard of [YODA](https://yoda.hepforge.org/pydoc).

But there's many others, too: [fast-histogram](https://pypi.org/project/fast-histogram), [qhist](https://pypi.org/project/qhist), [hdrhistogram](https://pypi.python.org/pypi/hdrhistogram), [multihist](https://pypi.python.org/pypi/multihist), [matplotlib-hep](https://github.com/ibab/matplotlib-hep), [pyhistogram](https://pypi.python.org/pypi/pyhistogram), [histogram](https://pypi.python.org/pypi/histogram), [SimpleHist](https://pypi.python.org/pypi/SimpleHist), [paida](https://pypi.org/project/paida), [histogramy](https://pypi.python.org/pypi/histogramy), [pypeaks](https://pypi.python.org/pypi/pypeaks), [hierogram](https://pypi.python.org/pypi/hierogram), [histo](https://pypi.python.org/pypi/histo), [python-metrics](https://pypi.python.org/pypi/python-metrics), [statscounter](https://pypi.python.org/pypi/statscounter), [datagram](https://pypi.python.org/pypi/datagram), [histogram](https://github.com/theodoregoetz/histogram) and [dashi](http://www.ifh.de/~middell/dashi/index.html), most of which seem to have been abandoned.

That's not even counting the six I've written: [plothon](http://code.google.com/p/plothon), [svgfig](http://code.google.com/p/svgfig), [cassius](https://github.com/opendatagroup/cassius), [histogrammar](https://github.com/histogrammar), [histbook](https://github.com/scikit-hep/histbook), and [aghast](https://github.com/scikit-hep/aghast) (though this last one is a middleware tool, not user-facing).

**Moral:** starting a histogram package is easy, growing a community around one so that it develops is hard.

Uproot's histogram functionality will defer more to [hist](https://github.com/scikit-hep/hist) as it develops.

## Active objects from TTrees

Above, you've seen how we could extract auto-generated `Event` objects from a TTree, as well as auto-generated histograms from a TDirectory.

Can objects from a TTree have methods?

**Yes!** In fact, they can be histograms. (Thanks to Cédric Hernalsteens for supplying this example in [Uproot issue #399](https://github.com/scikit-hep/uproot/issues/399).)

In [None]:
array_of_histograms = uproot.open("data/issue399.root")["Event/Histos.histograms1D"].array()
array_of_histograms

(In fact, the issue was that these are _lists_ of histograms in each TTree entry.)

In [None]:
fig, axes = plt.subplots(10, 6, figsize=(18, 30))
fig.subplots_adjust(bottom=-0.1, left=-0.2)

i, j = 0, 0
for hists in array_of_histograms:
    for hist in hists:
        hep.histplot(hist.numpy(), ax=axes[i][j])
        axes[i][j].set_xlabel(hist.title)
        j += 1
        if j == 6:
            i += 1
            j = 0

## Lorentz vectors

Perhaps the most important active objects are Lorentz vectors.

In [None]:
without_lorentz = uproot.open("data/HZZ.root")["events"]
with_lorentz = uproot.open("data/HZZ-objects.root")["events"]

In [None]:
without_lorentz.show()

In [None]:
with_lorentz.show()

Despite the fact that this is a jagged array of objects, Lorentz vectors have a fixed-width structure and can be extracted in a fast NumPy call.

In [None]:
lorentz_array = with_lorentz["muonp4"].array()
lorentz_array

Another consequence is that the mix-in methods (e.g. `pt`, `eta`, `phi`) can be attached to the JaggedArray as well as the individual objects.

In [None]:
lorentz_array[32, 3]

In [None]:
lorentz_array[32, 3].pt

In [None]:
lorentz_array[32]

In [None]:
lorentz_array[32].pt

In [None]:
lorentz_array.pt

This idea of "lifting" an operation from scalar → scalar, like Lorentz object → pT, to array → array and even jagged array → jagged array is the heart of columnar analysis.

We'll be seeing more of it in the session on Awkward Array.

Nevertheless, this is how we can do a Z mass peak in one line:

In [None]:
plt.hist((lorentz_array[lorentz_array.counts >= 2, 0] + lorentz_array[lorentz_array.counts >= 2, 1]).mass, bins=100, range=(60, 120));

## Writing ROOT files

Reading and writing are asymmetric: they come with different sets of issues.

Uproot was originally intended for reading and it has more reading functionality, but it can do quite a bit of writing, now, too. (Thanks to Pratyush Das!)

In [None]:
output_file = uproot.recreate("tmp.root")
output_file

The interface still works like a Python dict: you add objects to the ROOT file by assigning them.

(The name goes in the square brackets after the file, not in the object.)

Pythonic types, such as a NumPy histogram, are accepted and converted into ROOT histograms.

In [None]:
import numpy as np

output_file["histogram"] = np.histogram(np.random.normal(0, 1, 1000000), bins=100, range=(-3, 3))

Now we are really reading the object back and looking at the C++ member data that was written.

In [None]:
output_file["histogram"].__dict__

To come full-circle, we can convert the ROOT histogram into NumPy form and plot it.

In [None]:
hep.histplot(output_file["histogram"].numpy())

### Writing TTrees

TTrees have a special interface because you'll likely need to write the data incrementally.

In [None]:
output_file["tree"] = uproot.newtree({"branch1": int, "branch2": float, "branch3": np.int32})
output_file["tree"]

In [None]:
output_file["tree"].extend({"branch1": np.random.poisson(3, 10000),
                            "branch2": np.random.normal(0, 1, 10000),
                            "branch3": np.random.poisson(1.2, 10000)})

In [None]:
output_file["tree"].show()

In [None]:
output_file["tree"].numentries

In [None]:
output_file["tree"].extend({"branch1": np.random.poisson(3, 10000),
                            "branch2": np.random.normal(0, 1, 10000),
                            "branch3": np.random.poisson(1.2, 10000)})

In [None]:
output_file["tree"].numentries

Complexity is pay-as-you-go. You can add titles to the branches, though you'll need the `uproot.newbranch` function for that.

In [None]:
output_file["tree2"] = uproot.newtree({"branch1": uproot.newbranch(float, title="snazzy branch")}, title="snazzy tree")
output_file["tree2"]

In [None]:
output_file["tree2"].title

In [None]:
output_file["tree2/branch1"].title

### Writing JaggedArrays to TTrees

This is a very new feature, but you can do it. You just have to set another branch as its `size`.

In [None]:
jagged_array

In [None]:
jagged_array.counts

Setting a `size` creates that branch (there's only one type it can have: int32).

In [None]:
output_file["jagged_tree"] = uproot.newtree({"branch1": uproot.newbranch(np.dtype(">f4"), size="n")})

In [None]:
output_file["jagged_tree"].extend({"branch1": jagged_array, "n": jagged_array.counts})

In [None]:
uproot.open("tmp.root")["jagged_tree/branch1"].array()

## Reading many arrays at once

So far, we've only been using the TBranch.array method to get arrays, but TTree.arrays (plural) is a convenient way to get a whole pack of them.

(In Uproot 4, TTree.arrays can also be more efficient via XRootD vector-reads and HTTP multipart-GETs.)

In [None]:
tree = uproot.open("data/nesteddirs.root")["one/two/tree"]
tree.keys()

Pass a list of branch names to get a dict of names → arrays.

In [None]:
tree.arrays(["Int32", "Int64", "Str"])

Set `outputtype=tuple` to get a tuple...

In [None]:
tree.arrays(["Int32", "Int64", "Str"], outputtype=tuple)

... which is good for unpacking (it preserves order).

In [None]:
Int32, Int64, Str = tree.arrays(["Int32", "Int64", "Str"], outputtype=tuple)

The output type can also be a Pandas DataFrame, though the alternate syntax TTree.pandas.df is more often used.

In [None]:
import pandas as pd

tree.arrays(["Int32", "Int64", "Str"], outputtype=pd.DataFrame)

Use wildcards (same syntax as in a UNIX shell) to match all by name.

In [None]:
tree.arrays("Slice*")

Or surround the string with `/` for a regular expression search (same syntax as Python's `re` module).

In [None]:
tree.arrays(r"/Slice[UF].*/")

### Technology preview: all of the above in Uproot 4

In [None]:
tree_uproot4 = uproot4.open("data/nesteddirs.root")["one/two/tree"]
tree_uproot4.keys()

If `library="np"`, you get a dict of NumPy arrays.

In [None]:
tree_uproot4.arrays(["Int32", "Int64", "Float32"], library="np")

If `library` is the default `"ak"` (Awkward Arrays), you can get a dict of Awkward Arrays with `how=dict`.

In [None]:
tree_uproot4.arrays(["Int32", "Int64", "Float32"], how=dict)

But the _default_ way to get a group of arrays is zipped into an Awkward record array.

In [None]:
tree_uproot4.arrays(["Int32", "Int64", "Float32"])

This is an array of records with field names `Int32`, `Int64`, `Float32`:

In [None]:
ak.type(tree_uproot4.arrays(["Int32", "Int64", "Float32"]))

This is similar to the way that old Uproot combined arrays into a single object when returning a Pandas DataFrame, but now `library` and `how` are decoupled.

   * NumPy's natural grouping is dict, but tuple and list are also allowed.
   * Pandas's natural grouping (of Series) is DataFrame, but dict, tuple, and list are allowed.
   * Awkward's natural grouping is a record array, but dict, tuple, and list are allowed.

In Pandas, `how` also specifies how data with different jaggedness, such as the muons and jets in events, are merged. There isn't a one-to-one relationship between each muon and each jet, but there are JOIN techniques.

A single array is a Pandas Series...

In [None]:
without_lorentz_uproot4 = uproot4.open("data/HZZ.root | events")

without_lorentz_uproot4["Muon_Px"].array(library="pd")

A group of arrays is a DataFrame...

In [None]:
without_lorentz_uproot4.arrays(["Muon_Px", "Muon_Py", "Muon_Pz"], library="pd")

And if they have different jaggedness (e.g. muons and jets), by default you get multiple DataFrames.

In [None]:
without_lorentz_uproot4.arrays(["Muon_Px", "Jet_Px", "Muon_Py", "Jet_Py"], library="pd")

But `how` can be passed to Pandas to define some merging semantics: INNER JOIN, LEFT JOIN, RIGHT JOIN, OUTER JOIN.

In [None]:
without_lorentz_uproot4.arrays(["Muon_Px", "Muon_Py", "Jet_Px", "Jet_Py"], library="pd", how="inner")

In [None]:
without_lorentz_uproot4.arrays(["Muon_Px", "Muon_Py", "Jet_Px", "Jet_Py"], library="pd", how="outer")

Similarly, Awkward's default is to combine arrays in a shallow way...

In [None]:
ak.type(without_lorentz_uproot4.arrays(["Muon_Px", "Muon_Py", "Jet_Px", "Jet_Py"]))

But if `how="zip"`, then it will zip together arrays with the same jaggedness (and not arrays with different jaggedness).

In [None]:
ak.type(without_lorentz_uproot4.arrays(["Muon_Px", "Muon_Py", "Jet_Px", "Jet_Py"], how="zip"))

In [None]:
ak.to_list(without_lorentz_uproot4.arrays(["Muon_Px", "Muon_Py", "Jet_Px", "Jet_Py"], how="zip")[:10])

The name splicing was intended for NanoAOD, though it could use a little work in the example above.

In [None]:
cms_uproot4 = uproot4.open("data/cms_opendata_2012_nanoaod_DoubleMuParked.root | Events")
cms_uproot4.keys()

In [None]:
ak.to_list(cms_uproot4.arrays(["Muon_pt", "Muon_eta", "Muon_phi"], how="zip", entry_stop=10))

The strings passed to TTree.arrays can be mathematical expressions, and they can be indirect (through aliases).

This is to support TTree aliases, which caught me by suprise in old Uproot (I hadn't used them before—I didn't realize they could be expressions).

There's no performance advantage in computing with strings vs computing with Python commands (and the syntax in the strings is Python, for now); this is to support a widely used convenience.

(A change that undermines the assumption that these strings are TBranch names had to wait for a major revision like this.)

In [None]:
import numpy as np

cms_uproot4.arrays("PV_xy", aliases={"PV_xy": "sqrt(PV_x**2 + PV_y**2)"}, functions={"sqrt": np.sqrt})

## New Lorentz vector package

Since Lorentz vectors have turned out to be the most important "object with methods" so far, they're getting proper treatment in a standalone library called [Vector](https://vector.readthedocs.io/en/latest/).

It's in early stages of development, but Awkward 1 is being developed to support it.

## Memory management

One of the first questions that was asked when I introduced Uproot as a package that "loads whole TBranches as arrays" was "won't you run out of memory?"

In practice, Coffea analyses use about 10% of the columns of 2 GB files, so 0.2 GB for the initial arrays × all the derived quantities still fits within the RAM available on most machines. Analyses that process one file at a time are generally okay loading whole TBranches.

However, you'll probably run into a limit at _some_ point, so it's good to know ways around it.

First, note that arrays can be partially read if you use `entrystart` and `entrystop`.

In [None]:
cms = uproot.open("data/cms_opendata_2012_nanoaod_DoubleMuParked.root")["Events"]

**Example 1:** reads the whole TBranch and only _views_ the first 10 events.

In [None]:
cms["event"].array()[:10]

**Example 2:** only reads the TBaskets necessary to get the first 10 events.

In [None]:
cms["event"].array(entrystart=0, entrystop=10)

But since the first event has 243206 entries in it, setting `entrystop=10` means reading 243206 and viewing 10.

Reading less than one TBasket is not possible (because compressed chunks need to be fully decompressed).

In [None]:
[cms["event"].basket_numentries(i) for i in range(cms["event"].numbaskets)]

On the other hand, reading the whole thing and slicing after the fact means reading 1000000 and viewing 10.

In [None]:
cms.numentries

So there's an optimal size for TBaskets: small enough to easily fit into memory and large enough to spend more time in NumPy number-crunching than Python book-keeping.

Usually, TBaskets are _too small_. (I would guess that they're tuned for the original NanoAOD file size and TBaskets are not merged when the entries are heavily filtered.)

**Note:** a lot of these parameter names will be be "camel_case" in Uproot 4: `entry_start`, `entry_stop`, `num_entries`...

### Iterating through a file

If you're working on a set of TBranches that are too large to fit into memory, you'll probably want to slice it iteratively, such that the `entrystop` of the last batch is the `entrystart` of the next batch.

TTree.iterate does this for you.

In [None]:
for batch in cms.iterate("Muon_*"):
    print({name: len(array) for name, array in batch.items()})

By default, it slices at entry numbers where the TBasket boundaries all line up for the set of TBranches you're looking at.

That way, you get consistent arrays (they're all the same length and represent the same events) and avoid the "slop" of incomplete TBaskets.

You can do a "dry run" to see where these entry boundaries would be using TTree.clusters.

In [None]:
list(cms.clusters("Muon_*"))

It can depend on which TBranches you're looking at because one TBranch with oddly spaced TBasket boundaries can ruin the alignment of the rest.

Here, we look at just the four kinematic variables, and they're more fine-grained.

In [None]:
list(cms.clusters(["Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"]))

Which was the offending TBranch? To start with, let's look at the set that we're considering...

In [None]:
cms.keys(filtername=lambda name: name.startswith(b"Muon_"))

... and remove TBaskets until we find the ones that changed the spacing.

In [None]:
list(cms.clusters(lambda branch: branch.name.startswith(b"Muon_") and branch.name not in (b"Muon_pfRelIso04_all", b"Muon_tightId")))

But maybe you don't care about that—just pick a fixed number of entries for simplicity.

In [None]:
for batch in cms.iterate("Muon_*", entrysteps=100000):
    print({name: len(array) for name, array in batch.items()})

TBaskets that were only partially used in one step in the iteration are saved for the next step, so they're not re-read/decompressed.

We can see this by the fact that the time-to-read/decompress is not much different between the two cases: partial TBaskets do not cause data to be re-read/decompressed, since that would be very costly.

In [None]:
%%timeit

for batch in cms.iterate("Muon_*"):
    {name: len(array) for name, array in batch.items()}

In [None]:
%%timeit

for batch in cms.iterate("Muon_*", entrysteps=100000):
    {name: len(array) for name, array in batch.items()}

One more complication: "number of entries" is not a great measure of "amount of data."

Some data types use more bytes than others, but there's also the fact that you might quickly switch from needing two TBranches to needing ten TBranches. You don't want to re-tune your number of entries.

Considering that this was motivated by wanting to fit everything in memory, you can scale the size of your batches by a number of bytes.

In [None]:
for batch in cms.iterate("Muon_*", entrysteps="10 MB"):
    print({name: len(array) for name, array in batch.items()})

This way, if you use more or less (or different) TBaskets, you're still using _about_ the same memory in each step.

In [None]:
for batch in cms.iterate(lambda branch: not branch.name.startswith(b"Muon_"), entrysteps="10 MB"):
    print({name: len(array) for name, array in batch.items()})

In [None]:
for batch in cms.iterate(["Muon_pfRelIso04_all", "Muon_tightId"], entrysteps="10 MB"):
    print({name: len(array) for name, array in batch.items()})

Just like TTree.clusters, you can get the entry boundaries as a dry run with TTree.mempartitions.

In [None]:
list(cms.mempartitions("10 MB", ["Muon_pt", "Muon_eta", "Muon_phi", "Muon_mass"]))

## Accessing many files (TChain)

ROOT's TChain makes the interface for iterating over _entries_ in many TTrees the same as iterating over _entries_ in only one TTree.

Uproot either gives you all entries in a TTree (the `arrays` method) or batches of entries in a TTree (the `iterate` method).

   * Extending `arrays` to many files would run you out of memory fast: imagine an interface that loaded all requested TBranches of a large set of files and concatenated them. You can concatenate arrays manually ([np.concatenate](https://numpy.org/doc/1.18/reference/generated/numpy.concatenate.html) or [ak.concatenate](https://awkward-array.readthedocs.io/en/latest/_auto/ak.concatenate.html)), but it's probably not what you want.
   * Maybe it would be useful to have arrays that look and act like "all the files" but only read on demand: these would be `lazyarrays`.
   * Extending `itetate` to many files makes sense: you can use the same interface that would step over batches within a file when accessing many files.

Uproot 3 has all of these interfaces:

<table width="100%" style="font-size: 1.25em"><tr style="background: white;">
    <td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Direct</p>
        <p>Read the file and return an array.</p>
        <ul>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#id11">TBranch.array</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#array">TTree.array</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#arrays">TTree.arrays</a></li>
        </ul>
    </td><td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Lazy</p>
        <p>Get an object that reads on demand.</p>
        <ul>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#id13">TBranch.lazyarray</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarray">TTree.lazyarray</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarrays">TTree.lazyarrays</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays">uproot.lazyarray</a>*</li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays">uproot.lazyarrays</a>*</li>
        </ul>
    </td><td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Iterative</p>
        <p>Read arrays in batches of entries.</p>
        <ul>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#iterate">TTree.iterate</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-iterate">uproot.iterate</a>*</li>
        </ul>
    </td>
</tr></table>

<p>* Applies to sets of files, like TChain.</p>

**But be warned:** lazy arrays rely on features of Awkward 0 that were hard to get right and still have outstanding bugs.

Awkward 1 has a better (more airtight) implementation of lazy arrays, so this will probably be better in Uproot 4.

However, laziness is not always helpful. Your array "loads" right away, but maybe you always pay a performance penalty in calculations because

```python
px**2 + py**2
```

walks over _all files_ to compute `px**2`, then walks over _all files_ to compute `py**2`, and maybe doesn't have enough memory to add `px**2` to `py**2`.

At some level, we need to define a batch size and do all computations in that batch, including derived quantities, before moving on to the next. The `uproot.iterate` interface naturally does that, fencing your calculations within a `for` block, but maybe we can define something similar with lazy arrays and Dask.

**Or maybe if you're in this situation, you should move on from bare Uproot and look into the scale-out mechanisms Coffea, IRIS-HEP, and others are developing.**