In [None]:
%%HTML
<style> div.output {flex-direction: row} div.output > div:only-child {width: 100%} </style>

In [None]:
import numpy
numpy.set_printoptions(linewidth=numpy.nan)

<img style="margin-left: auto; margin-right: auto; width: 50%" src="uproot-3.png"></img>

What's new?

   * more modularization
   * jagged array operations
   * writing files

## More modularization

uproot 2 had been a single library (depending on Numpy and lz4).

<img style="margin-left: auto; margin-right: auto; width: 80%" src="abstraction-layers-before.png"></img>

## More modularization

uproot 3 splits out everything that is not I/O. We'll see the advantage in a moment.

<img style="margin-left: auto; margin-right: auto; width: 80%" src="abstraction-layers.png"></img>

## Jagged array operations

Jagged arrays are a minimal unit of nested structure: a list containing lists of varying lengths.

In [None]:
import uproot
import numpy
f = uproot.open("HZZ-objects.root")
t = f["events"]

In [None]:
a = t.array("muoniso")     # muon isolation variable; multiple per event
a

The implementation is a façade: these are not millions of list objects in memory but two arrays with methods to make them _behave like_ nested lists.

In [None]:
a.offsets

In [None]:
a.content

In [None]:
for i, x in enumerate(a):
    if i == 20:
        break
    print(i, x)

Introducing additional types, like "table" as a struct of arrays presented as an array of structs, allows us to make tables of jagged arrays or jagged arrays of tables.

In [None]:
a = t.array("muonp4")
a

In [None]:
a.content.content

In [None]:
a.content.content.columns

In [None]:
a[2]["fX"]       # subscript commutativity: hide the AoS ↔ SoA distinction

In [None]:
a["fX"][2]

We can also mix-in methods from **uproot-methods** to make physics-aware objects, arrays, and jagged arrays:

In [None]:
one = a[0][0]; two = a[0][1]; one, two

In [None]:
one + two

In [None]:
(one + two).mass

In [None]:
hastwo = (a.counts >= 2); ones = a[hastwo, 0]; twos = a[hastwo, 1]

In [None]:
ones + twos                # the plus operation "commutes" through the array

In [None]:
(ones + twos).mass         # the mass operation "commutes" through the array

<img src="apl-timeline.png" align="right" style="margin-left: 50px; width: 40%"></img>

### Array programming

Expresses regular operations over rectangular data structures in shorthand.

   * Multidimensional slices: `rgb_pixels[0, 50:100, ::3]`
   * Elementwise operations: `all_pz = all_pt * sinh(all_eta)`
   * Broadcasting: `all_phi - 2*pi`
   * Masking: `data[trigger & (pt > 40)]`
   * Fancy indexing: `all_eta[argsort(all_pt)]`
   * Array reduction: `array.sum()` → scalar

Our data are not rectangular, but the syntax can be extended by defining rules for jaggedness.

In [None]:
a2 = a[hastwo]
a2[::2, 0]                                                      # Multidimensional slices

In [None]:
pt = a.pt; eta = a.eta
pt * numpy.sinh(eta)                                            # Elementwise operations

In [None]:
multi_per_event = a.phi; one_per_event = t.array("MET").phi
multi_per_event - one_per_event                                 # Broadcasting

In [None]:
a[a.pt > 40]                                                    # Masking by jagged (selects particles)

In [None]:
a[a.pt.max() > 40]                                              # Masking by flat (selects events)

In [None]:
i = abs(a.eta).argmax()
i

In [None]:
a[i]                                                            # Fancy indexing

In [None]:
abs(a.eta).max()                                                # Jagged reduction

In [None]:
import awkward                                                                 # Simple, synthetic examples
a = awkward.JaggedArray.fromiter([[   1,     2,    3], [], [    4,    5]])     # to illustrate the idea
b = awkward.JaggedArray.fromiter([[  10,    20,   30], [], [   40,   50]])
m = awkward.JaggedArray.fromiter([[True, False, True], [], [False, True]])
flat  = numpy.array([  100,  200,  300])
mflat = numpy.array([False, True, True])
i = awkward.JaggedArray.fromiter([[2, 1], [], [1, 1, 0, 1]])

In [None]:
a + b                                                                          # Elementwise operations

In [None]:
a + flat                                                                       # Broadcasting

In [None]:
a[m]                                                                           # Masking by jagged (selects particles)

In [None]:
a[mflat]                                                                       # Masking by flat (selects events)

In [None]:
a[i]                                                                           # Fancy indexing

In [None]:
a.sum()                                                                        # Jagged reduction

<img src="logscales.png" style="margin-left: auto; margin-right: auto; width: 90%"></src>

Okay, but what about nested for loops? We want something like a jagged "cross join."

In [None]:
import awkward
a = awkward.JaggedArray.fromiter([[1, 2, 3], [], [4, 5], [6], [7, 8, 9]])
b = awkward.JaggedArray.fromiter([[100, 200], [300], [400], [500, 600, 700], [800, 900]])

In [None]:
a.cross(b)   # .tolist()

In [None]:
print(a.cross(b)._0)
print(a.cross(b)._1)

In [None]:
leptoquarks = t.array("muonp4").cross(t.array("jetp4"))
leptoquarks                                                          # all muon-jet pairs in each event

In [None]:
leptoquarks._0 + leptoquarks._1                                      # the muon in each pair plus the jet in each pair

In [None]:
(leptoquarks._0 + leptoquarks._1).mass                               # the mass of each pair

In [None]:
physt.h1((leptoquarks._0 + leptoquarks._1).mass.flatten()).plot();   # a one-line search for leptoquarks

What about nested for loops _without duplicates?_

In [None]:
import awkward
a = awkward.JaggedArray.fromiter([[], [1], [1, 2], [1, 2, 3], [1, 2, 3, 4]])

In [None]:
a.pairs().tolist()               # same=False

In [None]:
zcandidates = t.array("muonp4").pairs(same=False)
(zcandidates._0 + zcandidates._1).mass

In [None]:
physt.h1((zcandidates._0 + zcandidates._1).mass.flatten(), bins=100).plot();

In [None]:
charges = t.array("muonq").pairs(same=False)
cut = (charges._0 * charges._1 < 0)

In [None]:
physt.h1((zcandidates[cut]._0 + zcandidates[cut]._1).mass.flatten(), bins=100).plot();

<img style="float: right; width: 10%" src="jaydeep.jpg"></img>

## Credit

Broadcasting, cross, pairs, and a vectorized jagged reduction algorithm were developed by Jaydeep Nandi, a Google Summer of Code student.

None of them involve for loops, not even for loops in C, and are good candidates for GPU acceleration.

<img style="margin-left: auto; margin-right: auto; width: 35%" src="sum_rates_logy.png"></img>

<div style="margin-left: auto; margin-right: auto; width: 70%">
<p>Are there other looping constructs that can't be expressed like this, which would force you to write a for loop?</p>

<p style="font-weight: bold">Probably.</p>

<p>But when you encounter such instances, let me know and we'll think about new primitives beyond "cross" and "pairs" for those cases.</p>
</div>

Purely for streamlined expression (syntactic sugar), a few higher-order functions have been defined.

   * `array.apply(function)` performs `function(array)`
   * `array.filter(function)` performs `array[function(array)]`
   * `array.maxby(function)` performs `array[function(array).argmax()]`
   * `array.minby(function)` performs `array[function(array).argmin()]`

In [None]:
physt.h1(t.array("muonp4")                          # get the muon 4-vectors
          .filter(lambda muon: abs(muon.eta) < 1)   # select central muons (select particles, not events)
          .pairs(same=False)                        # form all non-duplicate pairs
          .apply(lambda a, b: a + b)                # compute Z candidates from as 4-vector sums
          .maxby(lambda z: z.pt)                    # select one per event, the highest pT
          .flatten()                                # flatten [x] → x and [] → nothing (ignore empty events)
          .mass,                                    # compute the masses of what remains
         bins=100).plot();

## Where is this headed?

**awkward-array** is distinct from **uproot**, with potential uses on data beyond ROOT files.

In the next few months, I hope to...

   * add support for other "awkward" array types: chunked, masked, indexed
   * add Pandas extensions so that Pandas columns can be "awkward"
   * add Numba extensions so you can write fast for loops if you need to
   * add Dask extensions so you can distribute work across a cluster
   * use Apache Arrow as input, which could allow efficient processing of nested data in PySpark (depending on Spark developments to provide Arrow buffers)

## Writing files

uproot can now write histograms to files. It has the same dict-like interface as reading:

In [None]:
f = uproot.recreate("tmp.root")                                  # instead of uproot.open
f["name"] = numpy.histogram(numpy.random.normal(0, 1, 100000))   # any kind of histogram

In [None]:
f["name"].show()                                                 # read it back out

In [None]:
import ROOT
c = ROOT.TCanvas()

In [None]:
f = ROOT.TFile("tmp.root")                    # ROOT can read it, too
h = f.Get("name")
h.Draw()
c.Draw()

In [None]:
f = ROOT.TFile("tmp.root", "UPDATE")          # ROOT can add to the same file, too
h = ROOT.TH1D("another", "", 10, -5, 5)
for x in numpy.random.normal(0, 1, 100000):
    h.Fill(x)
h.Write()
f.Close()

In [None]:
f = uproot.open("tmp.root")
f["another"].show()

uproot could become a clearinghouse for histograms from different libraries.

In [None]:
%matplotlib inline
import physt                       # physt is a pure Python histogram library: https://physt.readthedocs.io

In [None]:
h = physt.h1(numpy.random.normal(0, 1, 100000), bins=16, range=(-4, 4), name="physt histogram")
h.plot();

In [None]:
f = uproot.recreate("tmp.root")   # save the physt histogram as a TH1D (making the necessary translations)
f["name"] = h

In [None]:
f = uproot.open("tmp.root")       # read the ROOT histogram back and convert it to physt
f["name"].physt().plot();

In [None]:
f["name"].numpy()                  # or Numpy, a format Matplotlib recognizes

In [None]:
f = ROOT.TFile("tmp.root")         # but look, it's really a ROOT file; ROOT recognizes it as a histogram
h = f.Get("name")
h.Draw()
c.Draw()

This can also include new ways of representing histograms.

In [None]:
f = uproot.open("tmp.root")
print(f["name"].hepdata())      # YAML format for the HEPData archival site

Including an idea I've been working on: Pandas DataFrames with an interval index _are_ histograms.

In [None]:
f = uproot.open("tmp.root")
h = f["name"].pandas()              # read the histogram as a DataFrame with interval index
h

In [None]:
f = uproot.recreate("tmp.root")     # write DataFrames in the same format as ROOT histograms
f["another"] = h

Given the way Pandas handles indexes (interval indexes in particular), Pandas-as-histograms are sparse histograms.

In [None]:
f = uproot.recreate("tmp.root")
f["one"]   = numpy.histogram(numpy.random.normal(1, 0.6, 10000), bins=8, range=(0, 8))
f["two"]   = numpy.histogram(numpy.random.normal(3, 0.4, 10000), bins=8, range=(0, 8))
f["three"] = numpy.histogram(numpy.random.normal(9, 0.6, 100000), bins=8, range=(0, 8))
one   = f["one"].pandas()
two   = f["two"].pandas()
three = f["three"].pandas()

In [None]:
from IPython.display import display
display(one); display(two); display(three);

Adding DataFrames matches up intervals and fills in missing values with NaN (0 if we explicitly set it).

In [None]:
import functools
def add(*args):
    return functools.reduce(lambda x, y: x.add(y, fill_value=0), args)

display(add(one, two)); display(add(two, three)); display(add(one, two, three))

In [None]:
f["all"] = add(one, two, three)
f["all"].physt().plot();

## Modularized file-writing and histogram-conversion

   * all of the histogram-writing code (what bytes go to the file) is in **uproot**
   * all of the code that recognizes different histogram libraries and converts them is in **uproot-methods**

**uproot-methods** can be updated independently from (and more rapidly than) **uproot**.

<img style="margin-left: auto; margin-right: auto; width: 70%" src="abstraction-layers.png"></img>

<img style="float: right; width: 40%" src="pratyush.jpg"></img>

## Credit

The ROOT-writing feature was developed by Pratyush Das, a DIANA-HEP undergraduate fellow.

(Most of the work was writing _anything_ to a ROOT file; histograms were done in the last week!)

In the meantime, try it out!

<div style="display: block; width: 80%; margin-left: auto; margin-right: auto; margin-top: 100px; margin-bottom: 100px">
    <tt>pip install uproot</tt>
</div>

Live tutorials (Binder) are available on uproot's GitHub site.

**Thanks!**