# Using Uproot effectively

## Increasingly unnecessary introduction to/motivation for Python

I used to start these tutorials by asking, "Why Python?" but that doesn't seem necessary anymore.

![](img/python-usage.png)

## What is Uproot?

![](img/abstraction-layers.png)

Uproot is an independent implementation of ROOT I/O and only I/O, using standard Python libraries wherever possible.

It is widely used:

![](img/awkward-0-popularity.png)

## Why was Uproot written?

![](img/uproot-awkward-timeline.png)

Uproot was originally a part of Femtocode, a query language for calculations on columnar data. I needed an easier way to deploy ROOT I/O.

   * **Uproot 1.x** was released as a Python package "in case anyone finds it useful."
   * Machine learning users did find it useful, so I quickly cleaned it up and made it presentable as **Uproot 2.x**.
   * The way people were using Uproot influenced how I thought about columnar analysis: breaking it out into smaller pieces and eventually the exposing array-at-a-time interface to users, rather than hiding the columnar processing behind a query language.
   * Uproot's "bottom up" JaggedArrays were moved into a new package, **Awkward Array**, replacing the "top down" view of OAMap. This was **Uproot 3.x**.
   * Awkward Array is successful even though it has interface flaws and its pure Python "no for loops!" implementation is hard to maintain.
   * **Awkward 1.x** started last fall with a long development time to "do it right." It is complete, but not very visible because Uproot doesn't produce the new-style arrays yet.
   * **Uproot 4.x** started development in May with a release date of July 1.

Unlike previous version updates (which were more minor), Uproot 3.x and Awkward 0.x will continue to exist as `uproot3` and `awkward0`.

Sometime this summer, `uproot4` → `uproot` and `awkward1` → `awkward`. If you need to keep old scripts working, you'll be able to

```python
import uproot3 as uproot
import awkward0 as awkward
```

but new work should use the new libraries. (The old ones will continue to exist, but won't be actively maintained.)

![](img/Raiders-of-the-Lost-Ark-Chamber.jpg)

## Opening a file with Uproot

The read-only interface starts with `uproot.open`.

In [1]:
import uproot

file = uproot.open("data/nesteddirs.root")
file

<ROOTDirectory b'tests/nesteddirs.root' at 0x7fb678296090>

A file has a dict-like interface, meaning that you can access objects with square brackets and list them with `keys`.

In [2]:
file.keys()

[b'one;1', b'three;1']

In [3]:
file["one"]

<ROOTDirectory b'one' at 0x7fb6782be610>

In [4]:
file["one"].keys()

[b'two;1', b'tree;1']

In [5]:
file.allkeys()

[b'one;1',
 b'one/two;1',
 b'one/two/tree;1',
 b'one/tree;1',
 b'three;1',
 b'three/tree;1']

In [6]:
file.allclassnames()

[(b'one;1', 'TDirectory'),
 (b'one/two;1', 'TDirectory'),
 (b'one/two/tree;1', 'TTree'),
 (b'one/tree;1', 'TTree'),
 (b'three;1', 'TDirectory'),
 (b'three/tree;1', 'TTree')]

### What's the `b` at the beginning of each file path?

These are bytestrings, not strings, and Python 3 emphasizes the difference.

(I was worried that old ROOT files would use strange encodings and thought that presuming everything to be UTF-8 would make hist�gr�m title� l��k like th�s. But the issue of encodings never came up. Dealing with the Python bytestrings has been more of a nuisance.)

### Technology preview: Uproot 4

Uproot 4 is only half-written and might fail in simple cases. However, we can try it out side-by-side with Uproot 3 because of the different package name.

In [7]:
import uproot4

file_uproot4 = uproot4.open("data/nesteddirs.root")
file_uproot4

<ReadOnlyDirectory '/' at 0x7fb68136fcd0>

In [8]:
# recursive=True is now the default; there's no allkeys
file_uproot4.keys()

['one;1',
 'one/two;1',
 'one/two/tree;1',
 'one/tree;1',
 'three;1',
 'three/tree;1']

In [9]:
file_uproot4.classnames()

{'one': 'TDirectory',
 'one/two': 'TDirectory',
 'one/two/tree': 'TTree',
 'one/tree': 'TTree',
 'three': 'TDirectory',
 'three/tree': 'TTree'}

In [10]:
file_uproot4.classname_of("one/two/tree")

'TTree'

In [11]:
file_uproot4.classname_of("one/two/tree;1")

'TTree'

No more bytestrings. (Invalid UTF-8 uses the "surrogate escape" method, so a strangely encoded string won't _break_ anything, at least.)

### What's the `;1` at the end of the key name?

These are ROOT "cycle numbers," which allow objects with the same name to exist in the same directory. We display them to disambiguate, but you don't have to type them to look up an object. (You'll get the latest one; the one with the highest cycle.)

## Exploring a TTree

TTrees also have a dict-like interface, though the `show` method has been very useful.

In [12]:
tree = file["one/two/tree"]
tree

<TTree b'tree' at 0x7fb678239590>

In [13]:
tree.keys()

[b'Int32',
 b'Int64',
 b'UInt32',
 b'UInt64',
 b'Float32',
 b'Float64',
 b'Str',
 b'ArrayInt32',
 b'ArrayInt64',
 b'ArrayUInt32',
 b'ArrayUInt64',
 b'ArrayFloat32',
 b'ArrayFloat64',
 b'N',
 b'SliceInt32',
 b'SliceInt64',
 b'SliceUInt32',
 b'SliceUInt64',
 b'SliceFloat32',
 b'SliceFloat64']

In [14]:
tree.show()

Int32                      (no streamer)              asdtype('>i4')
Int64                      (no streamer)              asdtype('>i8')
UInt32                     (no streamer)              asdtype('>u4')
UInt64                     (no streamer)              asdtype('>u8')
Float32                    (no streamer)              asdtype('>f4')
Float64                    (no streamer)              asdtype('>f8')
Str                        (no streamer)              asstring()
ArrayInt32                 (no streamer)              asdtype("('>i4', (10,))")
ArrayInt64                 (no streamer)              asdtype("('>i8', (10,))")
ArrayUInt32                (no streamer)              asdtype("('>u4', (10,))")
ArrayUInt64                (no streamer)              asdtype("('>u8', (10,))")
ArrayFloat32               (no streamer)              asdtype("('>f4', (10,))")
ArrayFloat64               (no streamer)              asdtype("('>f8', (10,))")
N                          (no streamer) 

Left column: branch names, middle column: streamers (which define complex types), right column: how _we_ interpret the branch as an array (Uproot-specific).

In [15]:
tree["Float64"].array()

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
       26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38.,
       39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51.,
       52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64.,
       65., 66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77.,
       78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88., 89., 90.,
       91., 92., 93., 94., 95., 96., 97., 98., 99.])

In [16]:
tree["ArrayInt32"].array()

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5],
       [ 6,  6,  6,  6,  6,  6,  6,  6,  6,  6],
       [ 7,  7,  7,  7,  7,  7,  7,  7,  7,  7],
       [ 8,  8,  8,  8,  8,  8,  8,  8,  8,  8],
       [ 9,  9,  9,  9,  9,  9,  9,  9,  9,  9],
       [10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
       [11, 11, 11, 11, 11, 11, 11, 11, 11, 11],
       [12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
       [13, 13, 13, 13, 13, 13, 13, 13, 13, 13],
       [14, 14, 14, 14, 14, 14, 14, 14, 14, 14],
       [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
       [16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
       [17, 17, 17, 17, 17, 17, 17, 17, 17, 17],
       [18, 18, 18, 18, 18, 18, 18, 18, 18, 18],
       [19, 19, 19, 19, 19, 19, 19, 19, 19, 19],
       [20, 20, 20, 

In [17]:
tree["SliceInt64"].array()

<JaggedArray [[] [1] [2 2] ... [97 97 97 ... 97 97 97] [98 98 98 ... 98 98 98] [99 99 99 ... 99 99 99]] at 0x7fb678252690>

The last of these is a jagged array, which has a variable number of items in each entry.

   * Uproot 3 returns NumPy arrays for scalar and fixed-length per entry types.
   * Uproot 3 returns Awkward 0 JaggedArrays for variable-length per entry types.
   * Uproot 4 (by default) returns Awkward 1 arrays for all branches.

In [18]:
file_uproot4["one/two/tree/Float64"].array()

<Array [0, 1, 2, 3, 4, ... 95, 96, 97, 98, 99] type='100 * float64'>

In [19]:
file_uproot4["one/two/tree/ArrayInt32"].array()

<Array [[0, 0, 0, 0, 0, ... 99, 99, 99, 99]] type='100 * 10 * int32'>

In [20]:
file_uproot4["one/two/tree/SliceInt64"].array()

<Array [[], [1], [2, ... 99, 99, 99, 99, 99]] type='100 * var * int64'>

It's still possible to get NumPy arrays with `library="np"` (i.e. return type depends on what you ask for, not the contents of the file).

In [21]:
file_uproot4["one/two/tree/Float64"].array(library="np")

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25.,
       26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38.,
       39., 40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51.,
       52., 53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64.,
       65., 66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76., 77.,
       78., 79., 80., 81., 82., 83., 84., 85., 86., 87., 88., 89., 90.,
       91., 92., 93., 94., 95., 96., 97., 98., 99.])

In [22]:
file_uproot4["one/two/tree/ArrayInt32"].array(library="np")

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 3,  3,  3,  3,  3,  3,  3,  3,  3,  3],
       [ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5],
       [ 6,  6,  6,  6,  6,  6,  6,  6,  6,  6],
       [ 7,  7,  7,  7,  7,  7,  7,  7,  7,  7],
       [ 8,  8,  8,  8,  8,  8,  8,  8,  8,  8],
       [ 9,  9,  9,  9,  9,  9,  9,  9,  9,  9],
       [10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
       [11, 11, 11, 11, 11, 11, 11, 11, 11, 11],
       [12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
       [13, 13, 13, 13, 13, 13, 13, 13, 13, 13],
       [14, 14, 14, 14, 14, 14, 14, 14, 14, 14],
       [15, 15, 15, 15, 15, 15, 15, 15, 15, 15],
       [16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
       [17, 17, 17, 17, 17, 17, 17, 17, 17, 17],
       [18, 18, 18, 18, 18, 18, 18, 18, 18, 18],
       [19, 19, 19, 19, 19, 19, 19, 19, 19, 19],
       [20, 20, 20, 

In [23]:
file_uproot4["one/two/tree/SliceInt64"].array(library="np")

array([array([], dtype=int64), array([1]), array([2, 2]),
       array([3, 3, 3]), array([4, 4, 4, 4]), array([5, 5, 5, 5, 5]),
       array([6, 6, 6, 6, 6, 6]), array([7, 7, 7, 7, 7, 7, 7]),
       array([8, 8, 8, 8, 8, 8, 8, 8]),
       array([9, 9, 9, 9, 9, 9, 9, 9, 9]), array([], dtype=int64),
       array([11]), array([12, 12]), array([13, 13, 13]),
       array([14, 14, 14, 14]), array([15, 15, 15, 15, 15]),
       array([16, 16, 16, 16, 16, 16]),
       array([17, 17, 17, 17, 17, 17, 17]),
       array([18, 18, 18, 18, 18, 18, 18, 18]),
       array([19, 19, 19, 19, 19, 19, 19, 19, 19]),
       array([], dtype=int64), array([21]), array([22, 22]),
       array([23, 23, 23]), array([24, 24, 24, 24]),
       array([25, 25, 25, 25, 25]), array([26, 26, 26, 26, 26, 26]),
       array([27, 27, 27, 27, 27, 27, 27]),
       array([28, 28, 28, 28, 28, 28, 28, 28]),
       array([29, 29, 29, 29, 29, 29, 29, 29, 29]),
       array([], dtype=int64), array([31]), array([32, 32]),
       arr

Also, Pandas is a `library`, rather than a special function, as well as CuPy (GPU arrays) and any others we might want to add in the future.

In [24]:
file_uproot4["one/two/tree/SliceInt64"].array(library="pd")

entry  subentry
1      0            1
2      0            2
       1            2
3      0            3
       1            3
                   ..
99     4           99
       5           99
       6           99
       7           99
       8           99
Length: 450, dtype: int64

## How ROOT data are organized

Objects in directories are referenced by keys—you can ignore these, as they just make the square brackets syntax work.

A TTree's TBranches are either containers of data, convertible to arrays, or placeholders in a hierarchy describing a "split" object (more on that later).

The actual data are broken up into TBaskets, which is the smallest unit that can be read from a compressed file. There's no such thing as "reading one event," unless you have one TBasket per event (which would be inefficient when reading many events).

![](img/terminology.png)

Often, you can ignore TBaskets: Uproot treats TBranches as the fundamental unit, with one TBranch → one array.

But if your file compresses poorly or is slow to read, check the TBasket sizes to see that they are at least 10's to 100's of kilobytes each.

In [27]:
events = uproot.open("data/cms_opendata_2012_nanoaod_DoubleMuParked.root")["Events"]
events

<TTree b'Events' at 0x7fb6780d7fd0>

In [45]:
for name in events.keys():
    print(f"{name.decode():20} {events[name].numbaskets:2d} baskets {[events[name].basket_uncompressedbytes(i)/1024 for i in range(events[name].numbaskets)]} kB each")

run                   5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
luminosityBlock       5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
event                 5 baskets [1900.046875, 1900.046875, 1900.046875, 1900.046875, 212.3125] kB each
PV_npvs               5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
PV_x                  5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
PV_y                  5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
PV_z                  5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
nMuon                 5 baskets [950.0234375, 950.0234375, 950.0234375, 950.0234375, 106.15625] kB each
Muon_pt              12 baskets [2486.1875, 709.421875, 1502.57421875, 1502.59375, 193.62890625, 1502.984375, 1502.25, 192.41015625, 1502.4765625, 1502.4140625, 192.5039

This affects ROOT performance, but it affects Uproot performance _more_.

![](img/root-none-muon.png)

(The TFile-TTree-TBranch-TBasket structure has to be navigated in slow Python, but reading/decompressing/interpreting a TBasket is a NumPy call, about as fast as the hardware allows.)