<br><br><br><br><br>

# Columnar data analysis

<br><br><br><br><br>

<br><br><br><br>

<p style="font-size: 1.25em">Array programming is a programming language paradigm like Object-Oriented Programming (OOP) and Functional Programming (FP).</p>

<br>

<p style="font-size: 1.25em">As physicists, we are mostly familiar with <i>imperative, procedural, structured, object-oriented programming</i> (see <a href="https://en.wikipedia.org/wiki/Comparison_of_programming_paradigms#Main_paradigm_approaches">this list</a>).</p>

<br><br><br><br>

In [None]:
from IPython.display import IFrame    
IFrame("http://zoom.it/6rJp", width="100%", height="440")

<br>

<p style="font-size: 1.25em">Array programming is common to languages and systems designed for interactive data analysis.</p>

<img src="img/apl-timeline.png" width="100%">

<br>

<br>

<table align="left" width="33%" style="margin-right: 50px">
<tr style="background: white"><td><img src="img/apl-keyboard.jpg" width="100%"></td></tr>
<tr style="background: white"><td style="text-align: center"><i>Special keyboard for all the symbols.</i></td></tr>
<tr style="background: white"><td align="center"><img src="img/tshirt.jpg" width="50%"></td></tr>
<tr style="background: white"><td style="text-align: center"><i>A program was a struggle to write, but T-shirt fodder when it worked.</i></td></tr>
</table>

<br>

<p style="font-size: 1.25em">APL (1963) pioneered programming language conciseness—and discovered the mistake of being too concise.</p>

| APL | <br> | Numpy |
|:---:|:----:|:-----:|
| <tt>ι4</tt> | <br> | <tt>numpy.arange(4)</tt> |
| <tt>(3+ι4)</tt> | <br> | <tt>numpy.arange(4) + 3</tt> |
| <tt>+/(3+ι4)</tt> | <br> | <tt>(numpy.arange(4) + 3).sum()</tt> |
| <tt>m ← +/(3+ι4)</tt> | <br> | <tt>m = (numpy.arange(4) + 3).sum()</tt> |

(The other extreme is writing for loops for each of the above.)

<br>

<br><br><br><br>

<p style="font-size: 1.25em">The fundamental data type in this world is an array. (Some array languages don't even have non-arrays.)</p>

<br>

<p style="font-size: 1.25em">Unlike the others (APL, IDL, MATLAB, R), Numpy is a library, not a language, though it goes all the way back to the beginning of Python (1995) and significantly influenced Python's grammar.</p>

<br><br><br><br>

In [None]:
# Assortment of ways to make Numpy arrays

import numpy, uproot
print(numpy.arange(20),                                        end="\n\n")
print(numpy.linspace(-5, 5, 21),                               end="\n\n")
print(numpy.empty(10000, numpy.float16),                       end="\n\n")
print(numpy.full((2, 7), 999),                                 end="\n\n")
print(numpy.random.normal(-1, 0.0001, 10000),                  end="\n\n")
print(uproot.open("data/Zmumu.root")["events"]["E1"].array(),  end="\n\n")

<br><br>

<center><img src="img/numpy-memory-layout.png" width="90%"></center>

<br><br>

In [None]:
a = numpy.array([2**30, 2**30 + 2**26, -1, 0, 2**30 + 2**24, 2**30 + 2**20], numpy.int32)
# a = a.view(numpy.float32)
# a = a.reshape((2, 3))
# a = a.astype(numpy.int64)

print("data:\n", a, end="\n\n")
print("type:", type(a), end="\n\n")
print("dtype (type of the data it contains):", a.dtype, end="\n\n")
print("shape: (size of each dimension):", a.shape, end="\n\n")

In [None]:
# Any mathematical function that would map scalar arguments to a scalar result
#                                      maps array arguments to an array result.

a_array = numpy.random.uniform(5, 10, 10000);     a_scalar = a_array[0]
b_array = numpy.random.uniform(10, 20, 10000);    b_scalar = b_array[0]
c_array = numpy.random.uniform(-0.1, 0.1, 10000); c_scalar = c_array[0]

def quadratic_formula(a, b, c):
    return (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

print("scalar:\n", quadratic_formula(a_scalar, b_scalar, c_scalar), end="\n\n")
print("array:\n",  quadratic_formula(a_array,  b_array,  c_array), end="\n\n")

In [None]:
# Each step in the calculation is performed over whole arrays before moving on to the next.

a, b, c = a_array, b_array, c_array

roots1 = (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

tmp1 = numpy.negative(b)            # -b
tmp2 = numpy.square(b)              # b**2
tmp3 = numpy.multiply(4, a)         # 4*a
tmp4 = numpy.multiply(tmp3, c)      # tmp3*c
tmp5 = numpy.subtract(tmp2, tmp4)   # tmp2 - tmp4
tmp6 = numpy.sqrt(tmp5)             # sqrt(tmp5)
tmp7 = numpy.add(tmp1, tmp6)        # tmp1 + tmp6
tmp8 = numpy.multiply(2, a)         # 2*a
roots2 = numpy.divide(tmp7, tmp8)   # tmp7 / tmp8

roots1, roots2

In [None]:
# Even comparison operators are element-by-element.

roots1 == roots2

In [None]:
# So use a reducer (e.g. sum, max, min, any, all) to turn the array into a scalar.

(roots1 == roots2).all()

<br><br><br><br><br>

<p style="font-size: 1.25em">Just as a Numpy array is a common data type, this is a common function type: "universal functions" or "ufuncs."</p>

<br><br><br><br><br>

In [None]:
px, py, pz = uproot.open("data/Zmumu.root")["events"].arrays("p[xyz]1", outputtype=tuple)

p = numpy.sqrt(px**2 + py**2 + pz**2)
p

In [None]:
# But what if there are multiple values per event?

uproot.open("data/HZZ.root")["events"].array("Muon_Px")

In [None]:
# JaggedArray can be used in place of a Numpy array in some contexts,
# such as array-at-a-time math. Functions like numpy.sqrt recognize it.

px, py, pz = uproot.open("data/HZZ.root")["events"].arrays(["Muon_P[xyz]"], outputtype=tuple)

numpy.sqrt(px**2 + py**2 + pz**2)

<br><br>

<center><img src="img/numpy-memory-broadcasting.png" width="75%"></center>

<br><br>

In [None]:
E, px, py, pz = uproot.open("data/Zmumu.root")["events"].arrays(["E1", "p[xyz]1"], outputtype=tuple)

# Numpy arrays
#                   array   array   array   scalar
energy = numpy.sqrt(px**2 + py**2 + pz**2 + 0.1056583745**2)
energy, E

In [None]:
E, px, py, pz = uproot.open("data/HZZ.root")["events"].arrays(["Muon_E", "Muon_P[xyz]"], outputtype=tuple)

# JaggedArrays
#                   array   array   array   scalar
energy = numpy.sqrt(px**2 + py**2 + pz**2 + 0.1056583745**2)
energy, E

In [None]:
import awkward  # the library that defines JaggedArrays and other "awkward" arrays

scalar = 1000
flat   = numpy.array([100, 200, 300])
jagged = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

# With JaggedArrays, there are more broadcasting cases:
print(f"scalar + flat:   {scalar + flat}")
print(f"\nscalar + jagged: {scalar + jagged}")
print(f"\n  flat + jagged: {flat + jagged}")

In [None]:
# Using jagged broadcasting in physics

jetx, jety, metx, mety = uproot.open("data/HZZ.root")["events"].arrays(
    ["Jet_P[xy]", "MET_p[xy]"], outputtype=tuple)

jet_phi = numpy.arctan2(jety, jetx)
met_phi = numpy.arctan2(mety, metx)

print(f"multi per event: {jet_phi}")
print(f"one per event:   {met_phi}")

print(f"\ndifference:      {jet_phi - met_phi}")

In [None]:
# Q: What about ensuring that each delta-phi is between -pi and pi without if/then?
# A: You start to pick up tricks, like this:

raw_diff = jet_phi - met_phi

bounded_diff = (raw_diff + numpy.pi) % (2*numpy.pi) - numpy.pi

# Should dphi be a library function? That's the kind of question we think about...

raw_diff, bounded_diff
# bounded_diff.flatten().min(), bounded_diff.flatten().max()

<br><br><br><br><br>

<p style="font-size: 1.25em; text-align: center"><b>Reducers:</b> any, all, count, count_nonzero, sum, prod, min, max</p>

<br><br><br><br><br>

In [None]:
# Another way JaggedArrays extend Numpy arrays:

# Reducers, like sum, min, max, turn flat arrays into scalars.

met_phi.min(), met_phi.max()

In [None]:
# Another way JaggedArrays extend Numpy arrays:

# Reducers, like sum, min, max, turn jagged arrays into flat arrays.

jet_phi.min(), jet_phi.max()

In [None]:
# The meaning of flat.sum() is "sum of all elements of the flat array."
# The meaning of jagged.sum() is "sum of all elements in each inner array."

jagged = awkward.fromiter([[1.0, 2.0, 3.0], [], [4.0, 5.0]])
jagged.sum()   # min, max

In [None]:
# jagged.sum().sum() completes the process, resulting in a scalar. But,
# jagged.flatten().sum() does the same thing. Why?

jagged.sum().sum(), jagged.flatten().sum()

In [None]:
# mean, var, std are also available, just like Numpy, but these aren't associative.

# "Don't do a mean of means unless you mean it!"

jet_phi.mean()

In [None]:
# Also worth noting that any and all are reducers... of booleans.

same_hemicircle = (abs(bounded_diff) < numpy.pi/2)

print(f"same_hemicircle:             {same_hemicircle}")
print(f"same_hemicircle.any():       {same_hemicircle.any()}")
print(f"same_hemicircle.any().any(): {same_hemicircle.any().any()}")
print(f"same_hemicircle.any().all(): {same_hemicircle.any().all()}")
print(f"same_hemicircle.all():       {same_hemicircle.all()}")
print(f"same_hemicircle.all().any(): {same_hemicircle.all().any()}")
print(f"same_hemicircle.all().all(): {same_hemicircle.all().all()}")

<br><br><br><br><br>

<p style="font-size: 1.25em; text-align: center"><b>Slicing:</b> single-item extraction, filtering (cuts), rearrangement</p>

<br><br><br><br><br>

In [None]:
# Basic array slicing is the same as Python list slicing

a = numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

for expr in ["a[3]      ", "a[3:]     ", "a[:3]     ",
             "a[3:7]    ", "a[3:7:2]  ", "a[::2]    "]:
    print(expr, "=", eval(expr))

print()
for expr in ["a[-3]     ", "a[-3:]    ", "a[:-3]    ",
             "a[-7:-3]  ", "a[-7:-3:2]", "a[::-1]   "]:
    print(expr, "=", eval(expr))

In [None]:
# But multidimensional arrays can be sliced with an extension of list slicing.
a = numpy.array([[ 0,  1,  2,  3,  4,  5],
                 [10, 11, 12, 13, 14, 15],
                 [20, 21, 22, 23, 24, 25],
                 [30, 31, 32, 33, 34, 35]])
for expr in "a[2:, 1:]", "a[:, 1:-1]", "a[::2, ::2]", "a[:, 3]":
    print(expr, " =\n", eval(expr), sep="", end="\n\n")

<br>

<center><img src="img/numpy-slicing.png" width="40%"></center>

<br>

In [None]:
# Masking: using an array of booleans as a slice

a    = numpy.array([  1.1,   2.2,   3.3,   4.4,  5.5,   6.6,  7.7,   8.8,  9.9])
mask = numpy.array([False, False, False, False, True, False, True, False, True])
#                                                5.5          7.7          9.9

for expr in "a[mask]", "a < 5", "a[a < 5]":
    print(expr, " =\n", eval(expr), sep="", end="\n\n")

In [None]:
# Five-minute exercise: plot masses with (1) opposite charges and
#                                        (2) both muon abs(eta) < 1
arrays = uproot.open("data/Zmumu.root")["events"].arrays(namedecode="utf-8")
print(arrays.keys())
for n in arrays:
    exec(f"{n} = arrays['{n}']")

import matplotlib.pyplot
matplotlib.pyplot.hist(M, bins=100);

In [None]:
# What if the boolean mask is jagged?

E, px, py, pz, q = uproot.open("data/HZZ.root")["events"].arrays(
    ["Muon_E", "Muon_P[xyz]", "Muon_Charge"], outputtype=tuple)

print(f"q:        {q}")
print(f"\nq > 0:    {q > 0}")
print(f"\nE:        {E}")
print(f"\nE[q > 0]: {E[q > 0]}")

In [None]:
# JaggedArray slicing does what Numpy does in the cases that overlap...

x = awkward.fromiter([[1.1, 2.2, 3.3, 4.4], [5.5, 6.6], [7.7, 8.8, 9.9]])
print(f"x                      = {x}")

# take the first two inner arrays
print(f"\nx[:2]                  = {x[:2]}")

# take the first two of each inner arrays
print(f"\nx[:, :2]               = {x[:, :2]}")

# mask outer lists
print(f"\nx[[True, False, True]] = {x[[True, False, True]]}")

In [None]:
# ... and naturally extend it in the new cases.

x      = awkward.fromiter([[ 1.1,   2.2,  3.3], [  4.4,   5.5], [ 6.6,  7.7,  8.8]])
mask   = awkward.fromiter([        True,             False,             True      ])
jmask  = awkward.fromiter([[True, False, True], [False, False], [True, True, True]])

print(f"x[mask]  = {x[mask]}")       # mask outer array
print(f"\nx[jmask] = {x[jmask]}")    # mask inner arrays

In [None]:
# In Numpy, arrays of integers can also be used as indexes.

a = numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

print("selects elements, possibly out of order")
index = numpy.array([3, 5, 0, 9])
print("a[[3, 5, 0, 9]] =", a[index])

print("\nmay use negative indexing, just like single integers and slices")
index = numpy.array([3, 5, 0, -1, -2, -3])
print("a[[3, 5, 0, -1, -2, -3]] =", a[index])

print("\nmay include repetitions(!)")
index = numpy.array([3, 5, 0, 9, 9, 9, 3, 5, 0])
print("a[[3, 5, 0, 9, 9, 9, 3, 5, 0]] =", a[index])

In [None]:
# What is integer indexing good for?

permutation = eta1.argsort()                   # also try abs(eta1).argsort()

print(f"permutation:\n{permutation}")

print(f"\n\nsorted eta1:\n{eta1[permutation]}")

print(f"\n\nE1 sorted by eta1:\n{E1[permutation]}")

In [None]:
# Integer indexes with JaggedArrays:

x      = awkward.fromiter([[ 1.1, 2.2, 3.3, 4.4], [5.5, 6.6], [7.7, 8.8, 9.9]])
index  = awkward.fromiter([-1, 0, 0])
jindex = awkward.fromiter([[0, 0, -1], [0, 0, -1], [0, 0, -1]])

print(f"x[index]  = {x[index]}")       # rearrange outer array
print(f"\nx[jindex] = {x[jindex]}")    # rearrange inner arrays

In [None]:
# Use case for jagged indexing: argmin and argmax

E, px, py, pz, q = uproot.open("data/HZZ.root")["events"].arrays(
    ["Muon_E", "Muon_P[xyz]", "Muon_Charge"], outputtype=tuple)

eta = numpy.arctanh(pz / numpy.sqrt(px**2 + py**2 + pz**2))
print(f"eta:            {eta}")

maxabseta = abs(eta).argmax()
print(f"\nmaxabseta:      {maxabseta}")

print(f"\neta[maxabseta]: {eta[maxabseta]}")   # eta with max |eta| per event

print(f"\nE[maxabseta]:   {E[maxabseta]}")     # energy with max |eta| per event

In [None]:
# Array indexing is useful in surprising ways because it's a basic mathematical
# operation: thinking of f[x] as a function, array indexing is function composition.

# Take any two non-negative functions of integers...
def f(x):
    return x**2 - 5*x + 10
def g(y):
    return max(0, 2*y - 10) + 3

# ... and sample them as arrays
F   = numpy.array([f(i) for i in numpy.arange(10)])     # F is f at 10 elements
G   = numpy.array([g(i) for i in numpy.arange(100)])    # G is g at enough elements to include max(f)
GoF = numpy.array([g(f(i)) for i in numpy.arange(10)])  # GoF is g∘f at 10 elements

print("G\u2218F =", G[F])   # integer indexing
print("g\u2218f =", GoF)    # array of the composed functions

In [None]:
# Consider the following application:

text = """Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and
so dedicated, can long endure. We are met on a great battle-field of that war. We have come to
dedicate a portion of that field, as a final resting place for those who here gave their lives that
that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The
brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add
or detract. The world will little note, nor long remember what we say here, but it can never forget
what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which
they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the
great task remaining before us—that from these honored dead we take increased devotion to that cause
for which they gave the last full measure of devotion—that we here highly resolve that these dead
shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that
government of the people, by the people, for the people, shall not perish from the earth."""

words = text.replace(".", " ").replace(",", " ").replace("-", " ").replace("\u2014", " ").split()

In [None]:
# Dictionary encoding: for compression or textual analysis

words = numpy.array(words)
dictionary, index = numpy.unique(words, return_inverse=True)

print("len(words) =", len(words), "\nwords[:25] =\n" + str(words[:25]))
print("\nlen(dictionary) =", len(dictionary), "\ndictionary[:25] =\n" + str(dictionary[:25]))
print("\nlen(index) =", len(index), "\nindex[:25] =\n" + str(index[:25]))

In [None]:
# Recovering the original text is function composition:
# 
# index             : positions in corpus → integer codes
# dictionary        : integer codes       → words

dictionary[index]

In [None]:
# Five minute exercise: dense array → sparse array → dense array.
dense1 = 1.1 * numpy.array(
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 4, 1, 0, 3, 0,
     1, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

sparse_indexes = numpy.nonzero(dense1)[0]
sparse_values  = dense1[sparse_indexes]
print("sparse indexes:", sparse_indexes, "\nsparse values: ", sparse_values)

dense2 = numpy.zeros(len(dense1))
???
print("recovered dense:", dense2, sep="\n")

<br><br>

<p style="font-size: 1.25em">Summary of slicing</p>

   * **if X is an integer:** selects individual elements;
   * **if X is a slice:** selects a contiguous or regularly strided subrange (strides can be backward);
   * **if X is a tuple** (any commas between square brackets): applies selections to multiple dimensions;
   * **if X is a boolean array:** filters arbitrarily chosen elements (preserving order);
   * **if X is an integer array:** applies a function of integers, arbitrarily chosen, in any order, and may have duplicates.

<br>

See [Numpy's advanced indexing documentation](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing) for more (e.g. slicing by a tuple of arrays).

<br><br><br><br><br>

<p style="font-size: 1.25em; text-align: center"><b>Awkward arrays:</b> extensions to Numpy for particle physics</p>

<br><br><br><br><br>

<br>

<p style="font-size: 1.25em">We've seen some examples of jagged arrays and how they extend Numpy.</p>

<p style="font-size: 1.25em"><tt>JaggedArray</tt> is one of the classes in awkward-array to provide the kinds of data structures needed by particle physics in array form.</p>

<br>

<center><img src="img/abstraction-layers.png" width="80%"></center>

<br>

In [None]:
# PyROOT's jagged array is an RVec object for every event (viewed through PyROOT).

import ROOT
rdf = ROOT.RDataFrame("events", "data/HZZ.root")
rdf.AsNumpy(columns=["Muon_E"])

In [None]:
# root_numpy's jagged array is a Numpy array for every event.

import root_numpy
root_numpy.root2array("data/HZZ.root", "events", branches=["Muon_E"])

In [None]:
# Awkward/uproot's jagged array consists of three arrays: starts/stops and content.
#
# The number of Python objects does not scale with the number of events.

import uproot
array = uproot.open("data/HZZ.root")["events"].array("Muon_E")

print(array.layout)

array.starts, array.stops, array.content

<img src="img/arrow-website.png" width="100%">

In [None]:
# Recent projects outside of particle physics (Arrow, Parquet, Zarr, XND, and
# TensorFlow) also have jagged arrays represented as contiguous flat arrays.
# 
# Using a similar format lets us easily (and quickly!) convert between them.

awkward.toparquet("tmp.parquet", array)
awkward.toarrow(array)

<br><br>

<p style="font-size: 2em; margin-bottom: 0px">In particular,</p>

<center style="margin-top: 0px"><img src="img/pandas-logo.png" width="35%" style="margin-top: 0px"></center>

<br><br>

In [None]:
# Pandas is a data analysis environment built around in-memory tables.
# 
# "Numpy with an index" ... "Programmatic Excel" ... "SQL with an ordering"

uproot.open("data/Zmumu.root")["events"].pandas.df()

In [None]:
# Pandas deals with jaggedness by putting structure in an index, not the values.

df = uproot.open("data/HZZ.root")["events"].pandas.df(["Muon_E", "Muon_P[xyz]"])
df

In [None]:
# This seems a little odd (to me), but you can definitely work with it.

df.unstack()

In [None]:
# This has some interesting features: nested objects become multi-level columns...

array = awkward.fromiter([{"x": i, "y": {"y1": i, "y2": i}, "z": {"z1": {"z2": i}}}
                          for i in range(10)])
print(array[:2].tolist())
awkward.topandas(array, flatten=True)

In [None]:
# ... and nested lists become multi-level rows.

f = lambda i: [{"x": i, "y": i}] * i
array = awkward.fromiter([[f(1), f(2)], [f(3)], [f(4), f(5), f(6)]])
print(array[:2].tolist())
awkward.topandas(array, flatten=True)

In [None]:
# One-per-event data must be duplicated for each particle, and are inaccessible
# when there are no particles.

# (Switch between flatten=False and flatten=True.)
uproot.open("data/HZZ.root")["events"].pandas.df(["MET_*", "Jet_P[xyz]"],
                                                 flatten=False)

In [None]:
# And there isn't a way to deal with different jaggedness in the same table.

# (Switch between flatten=False and flatten=True.)
uproot.open("data/HZZ.root")["events"].pandas.df(["Muon_P[xyz]", "Jet_P[xyz]"],
                                                 flatten=False)

<br><br><br><br>

<center><img src="img/awkward-logo.png" width="40%"></center>

<br><br><br><br>

In [None]:
# Awkward-array is designed to handle arbitrary data structures in a way that
# fits both ROOT and Arrow/Parquet.

array = awkward.Table(uproot.open("data/HZZ-objects.root")["events"].arrays(
    ["MET", "muonp4", "muonq", "jetp4"], namedecode="utf-8"))

array[:10].tolist()

In [None]:
# ROOT has objects like TLorentzVector, but they translate to generic Tables
# in Arrow/Parquet.

awkward.toarrow(array)
# awkward.fromarrow(awkward.toarrow(array))[:10].tolist()

In [None]:
# You can iterate over these objects in for loops, like PyROOT...

for i, event in enumerate(array):
    print("new event", event.MET)
    for muon in event.muonp4:
        print("    muon", muon)
    for jet in event.jetp4:
        print("    jet ", jet)
    if i > 10:
        break

In [None]:
# ... but if you need to scale up, use array-at-a-time operations.

mu1 = array.muonp4[array.muonp4.counts >= 2, 0]
mu2 = array.muonp4[array.muonp4.counts >= 2, 1]
(mu1 + mu2).mass

In [None]:
# The "combinatorics" we need for particle physics requires a few new operations.

# Take any two muons from events that have them, not necessarily the first two.
pairs = array.muonp4.choose(2)
pairs

In [None]:
# Get the first and second element of each pair.

first, second = pairs.unzip()
first, second

In [None]:
# Compute the mass and plot.
# 
# ("flatten" because Matplotlib needs a flat array, not a jagged array.)

matplotlib.pyplot.hist((first + second).mass.flatten(), bins=100, range=(0, 150));

In [None]:
# Five-minute exercise: plot masses with (1) opposite charges and
#                                        (2) both muon abs(eta) < 1
# This time, it's jagged.

array.muonq, array.muonp4.eta

# first, second = array.muonp4.choose(2).unzip()
# matplotlib.pyplot.hist((first + second).mass.flatten(), bins=100, range=(0, 150));

In [None]:
# Advanced combinatorics: muons that are close to jets

# Step 1: jet-muon pairs with a doubly-jagged structure
# so we have one of these for every jet
jets, muons = array.jetp4.cross(array.muonp4, nested=True).unzip()
jets, muons

In [None]:
# Advanced combinatorics: muons that are close to jets

# Step 2: ΔR between each jet and muon
distance = jets.delta_r(muons)
distance

In [None]:
# Advanced combinatorics: muons that are close to jets

# Step 3: mask those that have any within ΔR < 1.0
mask = (distance < 1.0).any()
print(f"mask:  {mask}")

# Step 4: index of the closest one
index = distance.argmin()
print(f"index: {index}")

In [None]:
# Advanced combinatorics: muons that are close to jets

# Step 5: select those jets
jets_near_muons = jets[index][mask]
jets_near_muons

# (Use this to see just the events that have one.)
# jets_near_muons[jets_near_muons.counts > 0]

In [None]:
# Advanced combinatorics: muons that are close to jets

# Choice A: we want just those jets. Need to flatten the inner arrays so that
# the result is singly jagged, like the original jets.

array["jets_near_muons"] = jets_near_muons.flatten(axis=1)

for i, event in enumerate(array):
    if mask[i].any():
        print(event.jetp4)
        print(event.jets_near_muons)
        print()
    if i > 100:
        break

In [None]:
# Advanced combinatorics: muons that are close to jets

# Choice B: we want to link to the relevant muons, with the ΔR distance

array["nearest_muon"] = muons[index].pad(1, axis=1).flatten(axis=1)
array["distance"]     = distance[index].pad(1, axis=1).flatten(axis=1)

# Set link to None if nearest_muon or distance doesn't pass the cut
array.nearest_muon.content.mask |= ~mask.flatten()
array.distance.content.mask     |= ~mask.flatten()

for i, event in enumerate(array):
    if mask[i].any():
        print(event.jetp4)
        print(event.nearest_muon)
        print(event.distance)
        print()
    if i > 100:
        break

In [None]:
# Apologies for using functions that have not yet been introduced: just as with
# Numpy, working with awkward arrays means learning a vocabulary of single-step
# functions and putting them together.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])

# "pad" means fill inner arrays with None until it has at least N elements.
a.pad(3)

In [None]:
# Apologies for using functions that have not yet been introduced: just as with
# Numpy, working with awkward arrays means learning a vocabulary of single-step
# functions and putting them together.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])

# You can use it with "fillna" and "regular" to make a regular Numpy array.
a.pad(3, clip=True).fillna(999).regular()

In [None]:
# Apologies for using functions that have not yet been introduced: just as with
# Numpy, working with awkward arrays means learning a vocabulary of single-step
# functions and putting them together.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])

# In the previous example, we used it with argmax, which makes inner arrays of
# length 0 or 1, to ensure that they're always length 1.
a.argmax().pad(1)

In [None]:
# Apologies for using functions that have not yet been introduced: just as with
# Numpy, working with awkward arrays means learning a vocabulary of single-step
# functions and putting them together.

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])

# Once we've done that, we don't need the inner structure anymore and can flatten
# it to get a non-jagged array.
a.argmax().pad(1).flatten()

In [None]:
# Apologies for using functions that have not yet been introduced: just as with
# Numpy, working with awkward arrays means learning a vocabulary of single-step
# functions and putting them together.

a = awkward.fromiter([[[1.1, 2.2, 3.3]], [[], [4.4, 5.5]], [[6.6, 7.7, 8.8, 9.9]]])

# But all of that happened inside a doubly-jagged array, in which we wanted to
# collapse the inner dimension, so we used axis=1. (Same meaning as in Numpy.)
a.argmax().pad(1, axis=1).flatten(axis=1)

<br><br>

<p style="font-size: 1.25em">Although we've only talked about variable-length lists, objects, and <tt>None</tt>, awkward-array types form a complete type system:</p>

<ul>
    <li style="font-size: 1.25em"><b>Primitive types:</b> numbers, booleans, and fixed-size binary blobs via Numpy,
    <li style="font-size: 1.25em"><b>Lists:</b> variable-length lists via <tt>JaggedArray</tt>,
    <li style="font-size: 1.25em"><b>Union (sum) types:</b> heterogeneous lists via <tt>UnionArray</tt>,
    <li style="font-size: 1.25em"><b>Record (product) types:</b> objects (<tt>Table</tt>), implicitly in our previous examples,
    <li style="font-size: 1.25em"><b>Pointers:</b> cross-references and circular references via <tt>IndexedArray</tt>,
    <li style="font-size: 1.25em"><b>Non-contiguous data:</b> via <tt>ChunkedArray</tt>,
    <li style="font-size: 1.25em"><b>Lazy data:</b> via <tt>VirtualArray</tt>.
</ul>

<br><br>

In [None]:
# Just to demonstrate, let's make a tree...

tree = awkward.fromiter([
    {"value": 1.23, "left":    1, "right":    2},     # node 0
    {"value": 3.21, "left":    3, "right":    4},     # node 1
    {"value": 9.99, "left":    5, "right":    6},     # node 2
    {"value": 3.14, "left":    7, "right": None},     # node 3
    {"value": 2.71, "left": None, "right":    8},     # node 4
    {"value": 5.55, "left": None, "right": None},     # node 5
    {"value": 8.00, "left": None, "right": None},     # node 6
    {"value": 9.00, "left": None, "right": None},     # node 7
    {"value": 0.00, "left": None, "right": None},     # node 8
])

left = tree.contents["left"].content
right = tree.contents["right"].content
left[(left < 0) | (left > 8)] = 0
right[(right < 0) | (right > 8)] = 0

tree.contents["left"].content = awkward.IndexedArray(left, tree)
tree.contents["right"].content = awkward.IndexedArray(right, tree)

In [None]:
print("Physical layout:")
print("------------------------------------------------------------------")
for i, x in tree.layout.items():
    if x.cls == numpy.ndarray:
        print("{0:10s} {1}".format(repr(i), x.array))

import json
print("\nLogical meaning:")
print("------------------------------------------------------------------")
print(json.dumps(tree[0].tolist(), indent=4))

In [None]:
# For those of you who were here yesterday, do you remember this?
import sklearn.datasets, matplotlib.pyplot
X1, y1 = sklearn.datasets.make_gaussian_quantiles(
    cov=2.0, n_samples=200, n_features=2, n_classes=2, random_state=1)
X2, y2 = sklearn.datasets.make_gaussian_quantiles(
    mean=(3, 3), cov=1.5, n_samples=400, n_features=2, n_classes=2, random_state=1)
X = numpy.concatenate((X1, X2))
y = numpy.concatenate((y1, -y2 + 1))
matplotlib.pyplot.scatter(X[y == 0, 0], X[y == 0, 1], c="deepskyblue", edgecolor="black");
matplotlib.pyplot.scatter(X[y == 1, 0], X[y == 1, 1], c="orange", edgecolor="black");

In [None]:
# We made a decision tree using Scikit-Learn...
import sklearn.tree
decision_tree = sklearn.tree.DecisionTreeClassifier(max_depth=8)
decision_tree.fit(X, y)
xx, yy = numpy.meshgrid(numpy.arange(-5, 8, 0.02), numpy.arange(-5, 8, 0.02))
Z = decision_tree.predict(numpy.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
matplotlib.pyplot.contourf(xx, yy, Z);
matplotlib.pyplot.scatter(X[y == 0, 0], X[y == 0, 1], c="deepskyblue", edgecolor="black", alpha=0.2);
matplotlib.pyplot.scatter(X[y == 1, 0], X[y == 1, 1], c="orange", edgecolor="black", alpha=0.2);
matplotlib.pyplot.xlim(-5, 8); matplotlib.pyplot.ylim(-5, 8);

In [None]:
# Scikit-Learn is already using columnar trees: we can just cast it.
mask = decision_tree.tree_.children_left < 0
left = decision_tree.tree_.children_left.copy()
right = decision_tree.tree_.children_right.copy()
left[mask] = 0
right[mask] = 0

tree = awkward.Table()
tree["feature"]   = awkward.MaskedArray(mask, decision_tree.tree_.feature)
tree["threshold"] = awkward.MaskedArray(mask, decision_tree.tree_.threshold)
tree["left"]      = awkward.MaskedArray(mask, awkward.IndexedArray(left, tree))
tree["right"]     = awkward.MaskedArray(mask, awkward.IndexedArray(right, tree))
tree["value"]     = decision_tree.tree_.value[:, 0, 0] - decision_tree.tree_.value[:, 0, 1]

tree[0].tolist()

<br>

<p style="font-size: 1.25em">Columnar data structures are more general than the array-at-a-time programming paradigm. There are ongoing efforts to use the same awkward arrays in several programming paradigms:</p>

<ul>
    <li style="font-size: 1.25em"><b>conventional imperative programming</b> in Numba (compiled Python, using awkward arrays as data types),
    <li style="font-size: 1.25em"><b>truly vectorized programming</b> on GPUs with CuPy and Numba,
    <li style="font-size: 1.25em"><b>declarative languages</b> (user specifies the <i>what</i>, not the <i>how</i>). Examples: LINQ, SQL-per-event, combinatorical pattern-matching...
</ul>

<br>

<p style="font-size: 1.25em">Columnar data structures provide a zero-copy medium between all of these paradigms.</p>

<br>

<br><br><br><br>

<center><img src="img/uproot-logo.png" width="40%"></center>

<br><br><br><br>

In [None]:
# We've been using uproot for many of our examples so far.
# 
# As a re-write of ROOT I/O in Python, uproot presents the data in a Pythonic way:

file = uproot.open("http://scikit-hep.org/uproot/examples/nesteddirs.root")

print("file is a read-only dict, from object names to objects:\n")
print(f"file.keys()                  → {file.keys()}\n")
print(f"file['one'].keys()           → {file['one'].keys()}\n")
print(f"file['one']['two'].classes() → {dict(file['one']['two'].classes())}\n")
print(f"file['one']['two']['tree']   → {file['one']['two']['tree']}\n")
print(f"file['one/two/tree']         → {file['one/two/tree']}")

In [None]:
# TBranches of TTrees are also presented as dicts.
events = uproot.open("data/Zmumu.root")["events"]
events.keys()

In [None]:
# Get an array with TBranch.array().

events["E1"].array()

In [None]:
# Or TTree.array(branchname).

events.array("E1")

In [None]:
# The plural form, arrays, returns a dict from branch names to arrays.

events.arrays("E1")

In [None]:
# You get the arrays you ask for.

events.arrays(["E1", "px1", "py1", "pz1"])

In [None]:
# With wildcards.

events.arrays(["E1", "p[xyz]1"])

In [None]:
# These are the same wildcard patterns as matching files in UNIX.

events.arrays(["E1", "p*1"])

In [None]:
# Or with slashes, they become regular expressions.

events.arrays(["E1", "/p.*[0-1]/"])

In [None]:
# The "b" before each string (for bytestring) can be removed in Python 3 by
# specifying an encoding (strings in ROOT have no default encoding).

events.arrays(["E1", "px1", "py1", "pz1"], namedecode="utf-8")

In [None]:
# And we can change the container from a dict to something else by passing a
# class name; tuple is useful because it lets us assign each array.

E, px, py, pz = events.arrays(["E1", "px1", "py1", "pz1"], outputtype=tuple)

In [None]:
# outputtype=pandas.DataFrame is a synonym for TTree.pandas.df.

import pandas
events.arrays(["E1", "px1", "py1", "pz1"], outputtype=pandas.DataFrame)

In [None]:
# Use an explicit cache to avoid reading many times from the same file.

uproot.asdtype.debug_reading = True

print("asking for array...")
events.array("E1")

mycache = {}    # or maybe uproot.ArrayCache("1 GB")

print("asking for it with a cache...")
events.array("E1", cache=mycache)

print("asking for it again...")
events.array("E1", cache=mycache)

uproot.asdtype.debug_reading = False

<br>

<p style="font-weight: bold; font-size: 1.875em; color: gray">Three ways to get data:</p>

<table width="100%" style="font-size: 1.25em"><tr style="background: white;">
    <td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Direct</p>
        <p>Read the file and return an array.</p>
        <ul>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#id11">TBranch.array</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#array">TTree.array</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#arrays">TTree.arrays</a></li>
        </ul>
    </td><td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Lazy</p>
        <p>Get an object that reads on demand.</p>
        <ul>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#id13">TBranch.lazyarray</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarray">TTree.lazyarray</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#lazyarrays">TTree.lazyarrays</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays">uproot.lazyarray</a>*</li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays">uproot.lazyarrays</a>*</li>
        </ul>
    </td><td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Iterative</p>
        <p>Read arrays in batches of entries.</p>
        <ul>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/ttree-handling.html#iterate">TTree.iterate</a></li>
            <li style="margin-bottom: 0.3em"><a href="https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-iterate">uproot.iterate</a>*</li>
        </ul>
    </td>
</tr></table>

<p>*Lazy arrays or iteration over sets of files.</p>

In [None]:
# Direct:

events.array("E1")

In [None]:
# Lazy:

uproot.asdtype.debug_reading = True

print("getting lazy array...")
lazyarray = events.lazyarray("E1", entrysteps=500)
print(f"len(lazyarray.chunks) = {len(lazyarray.chunks)}")

print("before looking at the array...")
print(f"lazyarray = {lazyarray}")
print(f"chunks read = {[x.ismaterialized for x in lazyarray.chunks]}")

print("before computing a value...")
print(f"numpy.sqrt(lazyarray) = {numpy.sqrt(lazyarray)}")

print("before computing another value...")
print(f"lazyarray**2 = {numpy.sqrt(lazyarray)}")

uproot.asdtype.debug_reading = False

In [None]:
# Iterative:

for arrays in events.iterate("E1", entrysteps=500):
    print(arrays)

<br><br>

<p style="font-weight: bold; font-size: 1.875em; color: gray">Advantages and disadvantages of each:</p>

<table width="100%" style="font-size: 1.25em"><tr style="background: white;">
    <td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Direct</p>
        <p>Simple; returns pure Numpy arrays if possible.</p>
    </td><td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Lazy</p>
        <p>Transparently work on data too large to fit into memory.</p>
    </td><td width="33%" style="vertical-align: top">
        <p style="font-weight: bold; font-size: 1.5em; margin-bottom: 0.5em">Iterative</p>
        <p>Control the loading of data into and out of memory.</p>
    </td>
</tr></table>

In [None]:
# Controlling the chunk size:

print("Lazy or iteration steps as a fixed number of entries:")
for arrays in events.iterate(entrysteps=500):
    print(len(arrays[b"E1"]))

print("\nLazy or iteration steps as a fixed memory footprint:")
for arrays in events.iterate(entrysteps="100 kB"):
    print(len(arrays[b"E1"]))

In [None]:
# Reading complex data: mostly simplified by the fact that C++ classes are "split"
# into TBranches, and most TBranches are simple arrays.

tree = uproot.open("http://scikit-hep.org/uproot/examples/Event.root")["T"]
tree.show()

# branch name              streamer type, if any      uproot's interpretation

In [None]:
# In this view, class attributes are NOT special types; they're just numbers.

tree.array("fTemperature", entrystop=20)

In [None]:
# Fixed-width matrices are multidimensional arrays,

tree.array("fMatrix[4][4]", entrystop=6)

In [None]:
# branches with multiple leaves ("leaf-list") are Numpy record arrays,

uproot.open("http://scikit-hep.org/uproot/examples/"
                                    "leaflist.root")["tree"]["leaflist"].array()

In [None]:
# and anything in variable-length lists is a JaggedArray,

tree.array("fTracks.fMass2", entrystop=6)

In [None]:
# even if it's fixed-width within jagged or whatever.

tree.array("fTracks.fTArray[3]", entrystop=6)

In [None]:
# There are some types that ROOT does not split because they are too complex.
# For example, *histograms* inside a TTree:

tree.array("fH", entrystop=6)

In [None]:
# Uproot can read objects like this because ROOT describes their layout in
# "streamers;" uproot reads the (most common types of) streamers and generates
# Python classes, some of which have specialized, high-level methods.

for histogram in tree.array("fH", entrystop=3):
    print(histogram.title)
    print(histogram.values)
print("\n...\n")
for histogram in tree.array("fH", entrystart=-3):
    print(histogram.title)
    print(histogram.values)

In [None]:
# As we've seen, histograms have some convenience methods.
# They're mostly for conversion to other formats, like Numpy.
# 
# Numpy "histograms" are a 2-tuple of counts and edges.

uproot.open("http://scikit-hep.org/uproot/examples/"
                                        "hepdata-example.root")["hpx"].numpy()

In [None]:
# Similarly for 2-dimensional histograms.

uproot.open("http://scikit-hep.org/uproot/examples/hepdata-example.root")["hpxpy"].numpy()

In [None]:
# It can also be useful to turn histograms into Pandas DataFrames (note the IntervalIndex).

uproot.open("http://scikit-hep.org/uproot/examples/Event.root")["htime"].pandas()

In [None]:
# Or HEPData's YAML format. As Python objects, it's just a little work to make different formats.

print(uproot.open("http://scikit-hep.org/uproot/examples/Event.root")["htime"].hepdata())

In [None]:
# At the moment, only two kinds of objects can be *written* to ROOT files:
# TObjString and histograms.
# 
# To write, open a file for writing (create/recreate/update) and assign to it
# like a dict:

file = uproot.recreate("tmp.root", compression=uproot.ZLIB(4))
file["name"] = "Some object, like a TObjString."

In [None]:
import ROOT

pyroot_file = ROOT.TFile("tmp.root")
pyroot_file.Get("name")

In [None]:
# During assignment, uproot recognizes Pythonic types, such as Numpy histograms.

file["from_numpy"] = numpy.histogram(numpy.random.normal(0, 1, 10000))

In [None]:
pyroot_file = ROOT.TFile("tmp.root")           # refresh the PyROOT file
pyroot_hist = pyroot_file.Get("from_numpy")

canvas = ROOT.TCanvas("canvas", "", 400, 300)
pyroot_hist.Draw("hist")
canvas.Draw()

In [None]:
# 2-dimensional Numpy histograms.

file["from_numpy2d"] = numpy.histogram2d(numpy.random.normal(0, 1, 10000), numpy.random.normal(0, 1, 10000))

In [None]:
pyroot_file = ROOT.TFile("tmp.root")           # refresh the PyROOT file
pyroot_hist = pyroot_file.Get("from_numpy2d")

pyroot_hist.Draw()
canvas.Draw()