# CoDaS-HEP Columnar Data Analysis, part 1

This is the first of two sessions on [columnar data analysis](https://indico.cern.ch/event/1151367/timetable/#41-columnar-data-analysis), presented at CoDaS-HEP at 12:30pm on August 3, 2022 by Jim Pivarski and Ioana Ifrim.

See the [GitHub repo](https://github.com/jpivarski-talks/2022-08-03-codas-hep-columnar-tutorial#readme) for instructions on how to run it.

<br><br><br><br><br>

## Programming paradigms

Programming languages are for humans, not computers.

<br><br>

They bridge the gap between patterns of flowing electrons and the way humans conceptualize logical necessity (i.e. math).

<br><br>

Humans don't all think the same way about math, and one person doesn't always think about it the same way in different problems. That's why we need different **programming languages**.

<br><br>

But there are only a few ways that programming languages can be fundamentally different from each other, and even these have overlaps. They're called **programming paradigms**.

<br><br>

You might have heard of a few, or at least recognize them when you see them.

<br><br><br><br><br>

### Sample problem

Let's illustrate a few in the context of a single problem.

Suppose that we have a 1D array `A` and a 2D array `B`, and we want to

   * multiply `A[i]` and `B[i, j]` for each horizontal index `i`,
   * sum over those products along vertical index `j`,
   * resulting in a 1D array, indexed only by `i`.

<img src="img/paradigms-problem.svg" width="500">

In [1]:
import numpy as np

In [2]:
A = np.array([10, 20, 30])

B = np.array([[ 0,  1,  2,  3],
              [ 4,  5,  6,  7],
              [ 8,  9, 10, 11]])

We want to get an output that is

In [3]:
np.array([
    A[0]*B[0, 0] + A[0]*B[0, 1] + A[0]*B[0, 2] + A[0]*B[0, 3],
    A[1]*B[1, 0] + A[1]*B[1, 1] + A[1]*B[1, 2] + A[1]*B[1, 3],
    A[2]*B[2, 0] + A[2]*B[2, 1] + A[2]*B[2, 2] + A[2]*B[2, 3],
])

array([  60,  440, 1140])

<br><br><br><br><br>

### Imperative

The imperative paradigm is probably the most familiar: you tell the computer exactly what to do for each element, step by step.

In [4]:
output = np.zeros(len(A), dtype=int)

for i in range(len(A)):
    ai = A[i]
    bi = B[i]
    for bij in bi:
        output[i] += ai * bij

output

array([  60,  440, 1140])

If this had been written in another language, like C++, it would look fairly similar.

```c++
std::vector<int> output(A.size(), 0);

for (int i = 0; i < A.size(); i++) {
    int              ai = A[i];
    std::vector<int> bi = B[i];
    for (int bij : bi) {
        output[i] += ai * bij;
    }
}
```

Imperative programs involve explicit instructions, loop blocks, and if/then/else blocks (indented in Python and in curly brackets in C).

For most of us today, this is just "normal programming," but it wasn't a part of the first programming languages. It was introduced by ALGOL (1958) and only became mainstream after it was [proven capable of universal computation](https://en.wikipedia.org/wiki/Structured_program_theorem) (1966), such that programming languages [wouldn't need a GOTO statement anymore](https://doi.org/10.1145%2F362929.362947) (1968). Most physicists have been using this style [since Fortran 77](https://en.wikipedia.org/wiki/Fortran#FORTRAN_77) (1977), and some [a little earlier](https://onlinelibrary.wiley.com/doi/10.1002/spe.4380050408) (1975).

The downsides of imperative programming are:

   * it can be _too_ prescriptive, preventing compilers from finding faster algorithms than the one you wrote,
   * some languages, like Python, have considerable overhead for each statement.

Imperative programming in Python is slow.

<br><br><br><br><br>

### Functional

The functional paradigm replaces blocks of code—that which is indented in Python and between curly brackets in C—with functions.

Instead of `if` and loop syntax like `for` and `while`, functional languages have "functors" that take functions as arguments, and the functors do that work.

Most functional programming languages use the same names for these common functors: `map`, `reduce`, `filter`, `scan`, `fold`, `flatten`, `flatmap`, ... ([in Python](https://web.mit.edu/6.005/www/fa15/classes/25-map-filter-reduce/), [in LISP](https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node143.html), [in Java](https://belief-driven-design.com/functional-programm-with-java-map-filter-reduce-77e479bd73e/), [in Swift](https://abhimuralidharan.medium.com/higher-order-functions-in-swift-filter-map-reduce-flatmap-1837646a63e8)...)

If you're familiar with [ROOT's RDataFrame](https://root.cern/doc/master/classROOT_1_1RDataFrame.html), this is an example of functional programming in physics.

Here is a functional solution to the sample problem:

In [5]:
def fun(args):
    i, (ai, bi) = args
    return sum(map(lambda bij: ai * bij, bi))

np.fromiter(map(fun, enumerate(zip(A, B))), dtype=int)

array([  60,  440, 1140])

In the above,

   * `lambda` makes an inline user-defined function and `def` makes one capable of multiple statements,
   * `zip` makes a sequence by pairing elements of `A` with elements of `B`,
   * `enumerate` pairs indexes with the sequence,
   * `map` applies a function to every element of a sequence: `x[i] → f(x[i])`,
   * `sum` is a reducer that applies the function "`+`" to every pair of neighbors in the sequence, turning the sequence into a scalar.

In Python, list comprehensions are more common than explicit functions and `map` because the syntax is more streamlined. The above could be written as

In [6]:
np.array([sum(ai * bij for bij in bi) for i, (ai, bi) in enumerate(zip(A, B))])

array([  60,  440, 1140])

The advantage of functional programming is that these functional primitives—`map`, `reduce`, `filter`, etc.—can be parallelized, as long as the user-supplied function can be called on different arguments in any order. This can be guaranteed with [pure functions](https://en.wikipedia.org/wiki/Pure_function), which do not modify any variables outside of themselves.

[Google MapReduce](https://research.google/pubs/pub62/)/Hadoop started the "Big Data" sensation by implementing `map+filter` and `reduce` on distributed datasets (2004). Today, it's how distributed computing in Apache Spark works, as well as RDataFrame in ROOT.

The disadvantage _in Python_ is that the functional primitives are implemented imperatively, so it's still slow.

<br><br><br><br><br>

### Object-oriented

Object-oriented programming is another big one; you've probably heard of it.

It has more to do with organizing the large-scale structure of a program into units, so while it can be applied to a small problem like this one, it doesn't help much.

In [7]:
class ABProductSum:
    def __init__(self):
        self._value = 0

    def accumulate(self, ai, bij):
        self._value += int(ai * bij)

    def __int__(self):
        return self._value

output = [ABProductSum(), ABProductSum(), ABProductSum()]

for i, (ai, bi) in enumerate(zip(A, B)):
    for bij in bi:
        output[i].accumulate(ai, bij)

np.array(output, dtype=int)

array([  60,  440, 1140])

Above, each `ABProductSum` is only responsible for its own sum. It changes values in place, but in a localized way.

Object-oriented programming is often contrasted with functional programming because objects are often designed around modifying their own state and pure functions avoid any mutable state.

However, they're not incompatible: objects don't _need_ to have mutable state, and in some languages, class definitions are the only way to pass functions as arguments.

In [8]:
class Map:
    def apply(self, function, sequence):
        out = []
        for x in sequence:
            out.append(function.apply(x))   # expects the function object to have an 'apply' method
        return out

class Sum:
    def apply(self, sequence):
        out = 0
        for x in sequence:
            out += x
        return out

class UserFun1:
    def __init__(self, ai):   # UserFun1 knows about 'ai'; in functional programming, this is called a closure
        self._ai = ai

    def apply(self, bij):     # the actual function only depends on 'bij'
        return self._ai * bij

class UserFun2:
    def apply(self, args):
        i, (ai, bi) = args
        return Sum().apply(Map().apply(UserFun1(ai), bi))

np.fromiter(Map().apply(UserFun2(), enumerate(zip(A, B))), dtype=int)

array([  60,  440, 1140])

The advantage of object-oriented programming is that it is verbose: naming things and limiting scope within names helps large-scale programming.

The disadvantage of object-oriented programming is that it is verbose: this scaffolding is not always needed and can get in the way of solving small problems.

Also, it's implemented imperatively in Python, so no speed-up.

<br><br><br><br><br>

### Array-oriented

Array-oriented (or "columnar") programming consists of operations on arrays. Most statements do something to a whole array.

Here's an array-oriented solution to the sample problem in one line:

In [9]:
np.sum(A[:, np.newaxis] * B, axis=1)

array([  60,  440, 1140])

Did you catch that? Array-oriented programming is usually concise, which is only good when you know what's going on.

Breaking down the above,

In [10]:
A

array([10, 20, 30])

In [11]:
B

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

**Step 1:** NumPy has a rich slicing syntax. "`:`" means "pass a dimension through unchanged" and "`np.newaxis`" means "make a length-1 dimension here, at this depth of the slice."

In [12]:
A[:, np.newaxis]

array([[10],
       [20],
       [30]])

**Step 2:** Binary operations, such as "`*`", _broadcast_ arrays of different shapes.

Each row of `A[:, np.newaxis]` has 1 element; each row of `B` has 4, the 1 element is duplicated for each of the 4.

In [13]:
A[:, np.newaxis] * B

array([[  0,  10,  20,  30],
       [ 80, 100, 120, 140],
       [240, 270, 300, 330]])

**Step 3:** Reducers, such as `np.sum`, apply at a given `axis` (dimension).

We want to sum over the inner lists, not the outer lists, so that's `axis=1`, not `axis=0`.

In [14]:
np.sum(A[:, np.newaxis] * B, axis=1)

array([  60,  440, 1140])

More on array-oriented programming later (the rest of this whole tutorial), but

   * the advantages are that it's concise (short expression) and explicit (you say exactly what happens to each array),
   * the disadvantages are that it's concise (too short to understand?) and explicit (interpreter can't optimize intermediate steps).

Array-oriented programming is how most scientific Python libraries manage to be _expressive_ and _fast_ (because loops over arrays are implemented in compiled code).

<br><br><br><br><br>

### Declarative

Most often, physicists use the word "declarative programming" to mean functional programming or array-oriented programming.

I try to avoid being the Semantics Police, but conflating "declarative programming" with either of the above makes it impossible to talk about declarative programming in its own right.

Here is a solution to the sample problem in a truly declarative language:

In [15]:
np.einsum("i,ij -> i", A, B)

array([  60,  440, 1140])

[NumPy's einsum language](https://ajcr.net/Basic-guide-to-einsum/) can only multiply values from `A` with values from `B`, sum over axes, and transpose axes, but it does so in a way that

   * never says what order they should be performed in (as imperative programming would)
   * does not take arbitrary functions as arguments (as functional programming would)
   * doesn't refer to any intermediate arrays (as array-oriented programming would)

It just associates letters on the left of `->` with axes of the input arrays, letters on the right of `->` with axes of the output array, repeated letters with same-index, and omitted letters (in the output) with summed-over axes.

#### Minor aside: the name "einsum"

The language in `np.einsum` is more general than [the notation Einstein invented](https://en.wikipedia.org/wiki/Einstein_notation), and our sample problem needs that generality.

However, without the `->`,

In [16]:
np.einsum("i,ij", A, B)

array([320, 380, 440, 500])

is the same as $A_i \, B^i_j$, as Einstein would write it.

That solves a different problem, though: one in which the sum is over `axis=0`, rather than `axis=1`:

In [17]:
np.sum(A[:, np.newaxis] * B, axis=0)

array([320, 380, 440, 500])

I don't know of a way to solve the sample problem in classical Einstein notation.

#### Other declarative languages

The most common declarative language you're likely to encounter is regular expressions:

In [18]:
import re

In [19]:
for match in re.finditer(r"[aeiou]\b", "Where do we see a vowel at the end of a word?"):
    print(match)

<re.Match object; span=(4, 5), match='e'>
<re.Match object; span=(7, 8), match='o'>
<re.Match object; span=(10, 11), match='e'>
<re.Match object; span=(14, 15), match='e'>
<re.Match object; span=(16, 17), match='a'>
<re.Match object; span=(29, 30), match='e'>
<re.Match object; span=(38, 39), match='a'>


Or configuration files in YAML, or templates in C++, or SQL, etc.

They're usually _mini-languages_.

The advantages are that they can be very suited to their task—concise and easy to read—and they can be implemented in a highly optimized way.

The disadvantage is that they rarely generalize, and when they do, they lose their declarativeness.

<br><br><br><br><br>

### ~~Which is best?~~ Which problems are each best suited for?

| | Good for... | Bad for... |
|:-|:-|:-|
| **Imperative** | General-purpose programming that you know how to optimize. | Verbosity. Letting the compiler optimize it for you. |
| **Functional** | General-purpose programming that you want a framework to distribute or delay for you. | Verbosity. Following the chain of which functions call which. |
| **Object-oriented** | Organizing the large-scale structure of a program. | Ultra-verbosity in small problems. Localizing mutable state isn't as good as eliminating it. |
| **Array-oriented** | Concise numerical processing, interactivity, fast (compiled) operations. | Iterative algorithms. Large or many intermediate arrays. |
| **Declarative** | Ultra-concise expressions for specific tasks. | General-purpose programming. |

History of programming paradigms mentioned at CHEP (Computing in HEP conferences from 1985 through present).

<img src="img/chep-papers-paradigm.svg" width="700">

<br><br><br><br><br>

## Array-oriented programming

<br><br><br><br><br>

### History

<br><br><br><br><br>

### Synergy with data analysis

<br><br><br><br><br>

### Array-oriented in Python: NumPy and everything else

<br><br><br><br><br>

### Limitations of array-oriented programming

<br><br><br><br><br>

### Accelerating code in Python

<br><br><br><br><br>

## From ROOT files into arrays

<br><br><br><br><br>

### Uproot

<br><br><br><br><br>

### Awkward Array

<br><br><br><br><br>

## Project: H → ZZ → 4ℓ

<br><br><br><br><br>

### 4 leptons of the same flavor

<br><br><br><br><br>

### Opposite charges

<br><br><br><br><br>

### On your own: the H → ZZ → 2μ2e case

<br><br><br><br><br>

### Hint!