# Lesson 3: Array-oriented programming

Our analysis of Higgs data with NumPy arrays didn't use many `if` or `for` statements at all.

<br>

The general pattern consists of a single Python extension call that operates on many data values (_similar to_ "SIMD": Single Instruction, Multiple Data).

<br>

This pattern can be called a programming language paradigm, contrasted with "imperative," "functional," "object-oriented," etc.

In [3]:
import numpy as np

<br>

**Imperative programming:**

In [4]:
%%time
input_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
output_data = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0])
for i in range(len(input_data)):                                  # you say what happens to each element
    output_data[i] = input_data[i]**2                             # in an exactly specified order (for loop)
output_data

CPU times: user 18 µs, sys: 10 µs, total: 28 µs
Wall time: 29.3 µs


array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

<br>

**Functional programming:**

In [3]:
input_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
output_data = np.fromiter(map(lambda x: x**2, input_data), int)   # you provide a function to be applied to
output_data                                                       # each element; may run in any order

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

<br>

**Array-oriented programming:**

In [5]:
%%time
input_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
output_data = input_data**2                                       # implicit indexes, no reference to individual
output_data                                                       # elements; function is hard-coded in C

CPU times: user 0 ns, sys: 72 µs, total: 72 µs
Wall time: 320 µs


array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

Most array-oriented programming languages have been interactive, intended for data analysis or simulation.

(This is a list of _all_ array-oriented languages that I know about.)

<br>

<center>
<img src="img/apl-timeline.svg" width="100%">
</center>

They have also tended to be concise (for quick typing, reduced screen clutter).

<br>

The original, APL, was way too concise! (Needed a special keyboard for all the math symbols.)

<br>

<center>
<div style="display: inline-block">

| APL | <br> | Numpy |
|:---:|:----:|:-----:|
| <tt>ι4</tt> | <br> | <tt>np.arange(4)</tt> |
| <tt>(3+ι4)</tt> | <br> | <tt>np.arange(4) + 3</tt> |
| <tt>+/(3+ι4)</tt> | <br> | <tt>(np.arange(4) + 3).sum()</tt> |
| <tt>m ← +/(3+ι4)</tt> | <br> | <tt>m = (np.arange(4) + 3).sum()</tt> |

</div>

<img src="img/apl-keyboard.jpg" width="35%" style="display: inline-block; margin-left: 5%">

</center>

Ordinary development/debugging interaction pattern: step through instructions on each _value_ in a debugger (breakpoints, etc.).

<br>

Data analysis interaction pattern: stop after key _operations_ and look at _distributions_ of all values.

<br>

<br>

Example: suppose you have a million data points.

In [6]:
import matplotlib.pyplot as plt
import hist

<br>

In [7]:
dataset = np.random.normal(0, 1, 1000000)
dataset

array([-0.26221554, -1.4188783 , -1.5109553 , ..., -0.31220267,
       -1.32222738, -1.0613871 ])

<br>

(Seeing 6 numerical values doesn't tell us about the other 999994.)

"What does the distribution look like?"

<br>

In [8]:
hist.Hist.new.Regular(100, -5, 5, name=" ").Double().fill(dataset)

<br>

Of course, it's Gaussian/normal-distributed. (That's what we had asked for with `np.random.normal`.)

"What does its square look like?"

In [9]:
dataset2 = dataset**2

<br>

In [10]:
hist.Hist.new.Regular(100, -3, 13, name=" ").Double().fill(dataset2)

<br>

"Of course. It's always positive, peaks at 0, and falls off to 9, rather than 3."

"What does this crazy combination look like?"

In [11]:
dataset3 = np.sin(1/dataset2)

<br>

In [12]:
hist.Hist.new.Regular(100, -1.2, 1.2, name=" ").Double().fill(dataset3)

<br>

I couldn't have guessed that shape: having the computer do it revealed something non-trivial.

History of paradigm-related words in CHEP titles & abstracts (Computing in HEP conferences from 1985 through present).

"Arrays" (originally, Fortran arrays) are making a comeback.

<br>

<center>
<img src="img/chep-papers-paradigm.svg" width="75%">
</center>

## Awkward Arrays

<br>

In exercise-1, we saw that particle physics analyses rely heavily on combinatorics.

<br>

In exercise-2, we saw that NumPy arrays and operations don't provide enough structure (in the data or operations).

<br><br><br>

The Awkward Array library was created to fill that gap.

Load Higgs data as an Awkward Array.

In [6]:
import awkward as ak

<br>

In [7]:
events = ak.from_parquet("data/SMHiggsToZZTo4L.parquet")
events

View the first event as Python lists and dicts (like JSON).

In [None]:
events[0].to_list()

Get one numeric field (also known as "column").

In [None]:
events.electron.pt

Compute something ($p_z = p_T \sinh\eta$).

In [None]:
events.electron.pt * np.sinh(events.electron.eta)

To plot it, we need numbers without structure, so [ak.flatten](https://awkward-array.org/doc/main/reference/generated/ak.flatten.html) it.

In [None]:
hist.Hist.new.Regular(100, 0, 100, name=" ").Double().fill(
    ak.flatten(events.electron.pt)
).plot();

Each event has a different number of electrons and muons ([ak.num](https://awkward-array.org/doc/main/reference/generated/ak.num.html) to check).

In [None]:
ak.num(events.electron), ak.num(events.muon)

<br>

So what happens if we try to compute something with the electrons' $p_T$ and the muons' $\eta$?

In [None]:
events.electron.pt * np.sinh(events.muon.eta)

This is data structure-aware, array-oriented programming.

Before moving on, I should point out that we can get these data from ROOT files:

In [None]:
import uproot

<br>

In [None]:
file = uproot.open("data/SMHiggsToZZTo4L.root")
file

<br>

In [None]:
tree = file["Events"]
tree

<br>

Uproot has several methods to read arrays (NumPy/Awkward/Pandas), but [uproot.TTree.arrays](https://uproot.readthedocs.io/en/latest/uproot.behaviors.TTree.TTree.html#arrays) is a general one.

In [None]:
tree.arrays(filter_name="Electron_*")

## Basic operations of Awkward Array

Illustrated with a small array.

In [None]:
array = ak.from_iter([[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], [], [{"x": 3.3, "y": [1, 2, 3]}]])
array

<br>

We've seen some selections (single item, by field name), but here they are for the small array:

In [None]:
array[0]

<br>

In [None]:
array[0, "y", -1]

We've seen mapped operations (NumPy ufuncs):

In [None]:
np.square(array["x"])

<br>

In [None]:
np.sqrt(array["y"])

Reducers ([ak.sum](https://awkward-array.org/doc/main/reference/generated/ak.sum.html), [ak.min](https://awkward-array.org/doc/main/reference/generated/ak.min.html), [ak.max](https://awkward-array.org/doc/main/reference/generated/ak.max.html), [ak.any](https://awkward-array.org/doc/main/reference/generated/ak.any.html), [ak.all](https://awkward-array.org/doc/main/reference/generated/ak.all.html), etc.) apply to lists of variable length, including zero length.

In [None]:
ak.sum(array["x"])

<br>

In [None]:
ak.sum(array["y"], axis=-1)

The `axis` parameter has the same meaning as in NumPy, but extended to allow for non-rectilinear data.

In [None]:
array2 = ak.from_iter([[   1,    2,    4],
                       [                ],
                       [None,    8      ],
                       [  16            ]])

<br>

In [None]:
ak.sum(array2, axis=0)

<br>

In [None]:
ak.sum(array2, axis=1)

Slicing with boolean or integer arrays.

In [None]:
array

<br>

In [None]:
array[[False, False, True]]

<br>

In [None]:
array[[1, 1, 1, 2]]

Slicing with arrays of _lists_ of booleans or integers.

In [None]:
array.y

<br>

In [None]:
array.y[[[[True], [False, True]], [], [[False, True, False]]]]

<br>

In [None]:
array.y[[[[], [-1, -1, -1]], [], [[0, 1, 1, 1, 1, 1, 2]]]]

**Application:** Filtering events with an array of booleans.

In [None]:
events.MET.pt, events.MET.pt > 20

<br>

In [None]:
len(events), len(events[events.MET.pt > 20])

<br>

**Application:** Filtering particles with an array of lists of booleans.

In [None]:
events.electron.pt, events.electron.pt > 30

<br>

In [None]:
ak.num(events.electron), ak.num(events.electron[events.electron.pt > 30])

**Quizlet:** Using the reducer [ak.any](https://awkward-array.org/doc/main/reference/generated/ak.any.html), how would we select _events_ in which any electron has $p_T > 30$ GeV/c$^2$?

In [None]:
events.electron[events.electron.pt > 30]

**Bonus:** How would you do it with [ak.min](https://awkward-array.org/doc/main/reference/generated/ak.min.html)?

Awkward Array has two combinatorial primitives:

<table style="width: 50%">
    <tr style="background: white"><td style="font-size: 1.75em; font-weight: bold; text-align: center"><a href="https://awkward-array.org/doc/main/reference/generated/ak.cartesian.html">ak.cartesian</a></td><td style="font-size: 1.75em; font-weight: bold; text-align: center"><a href="https://awkward-array.org/doc/main/reference/generated/ak.combinations.html">ak.combinations</a></td></tr>
    <tr style="background: white"><td><img src="img/cartoon-cartesian.svg" width="100%"></td><td><img src="img/cartoon-combinations.svg" width="100%"></td></tr>
</table>

[ak.cartesian](https://awkward-array.org/doc/main/reference/generated/ak.cartesian.html) takes a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of lists from $N$ different arrays, producing an array of lists of $N$-tuples.

[ak.combinations](https://awkward-array.org/doc/main/reference/generated/ak.combinations.html) takes $N$ [samples without replacement](http://prob140.org/sp18/textbook/notebooks-md/5_04_Sampling_Without_Replacement.html) of lists from a single array, producing an array of lists of $N$-tuples.

In [None]:
numbers = ak.Array([[1, 2, 3], [], [4]])
letters = ak.Array([["a", "b"], ["c"], ["d", "e"]])

<br>

In [None]:
ak.cartesian([numbers, letters])

<br>

In [None]:
values = ak.Array([[1.1, 2.2, 3.3, 4.4], [], [5.5, 6.6]])

<br>

In [None]:
ak.combinations(values, 2)

Often, it's useful to separate the separate the left-hand sides and right-hand sides of these pairs with [ak.unzip](https://awkward-array.org/doc/main/reference/generated/ak.unzip.html), so they can be used in mathematical expressions.

<br>

In [None]:
electron_muon_pairs = ak.cartesian([events.electron, events.muon])
# electron_muon_pairs.type.show()

<br>

In [None]:
electron_in_pair, muon_in_pair = ak.unzip(electron_muon_pairs)
# electron_in_pair.type.show()

<br>

In [None]:
electron_in_pair.pt, muon_in_pair.pt

<br>

In [None]:
ak.num(electron_in_pair), ak.num(muon_in_pair)

The Vector library [can be applied to Awkward Arrays](https://vector.readthedocs.io/en/latest/usage/intro.html#Awkward-Arrays-of-vectors), and the easiest way to do that is by calling `register_awkward` after importing it.

<br>

In [None]:
import vector
vector.register_awkward()

<br>

Now all Awkward data structures named "`Momentum4D`" can compute `px`, `py`, `pz`, etc. from `pt`, `phi`, `eta`, etc.

In [None]:
events.electron.px, events.electron.py, events.electron.pz

Other useful functions, like $\Delta R = \sqrt{\Delta\phi^2 + \Delta\eta^2}$, can be applied to combinations of particles.

In [None]:
electron_in_pair, muon_in_pair = ak.unzip(ak.cartesian([events.electron, events.muon]))

<br>

In [None]:
electron_in_pair.deltaR(muon_in_pair)

In [None]:
first_electron_in_pair, second_electron_in_pair = ak.unzip(ak.combinations(events.electron, 2))

<br>

In [None]:
first_electron_in_pair.deltaR(second_electron_in_pair)

<br>

**Quizlet:** What's this?

In [None]:
(first_electron_in_pair + second_electron_in_pair).mass

The next exercise contains solutions because it's not an easy problem, especially if you're new to array programming on structures.

<br>

