## Setup

Before we can compare the methods of columnar analysis and event loops, we need to generate some sample data! For the purposes of this example, we will be using a file that has two columns (or branches, for people familiar with ROOT tree structures): the prime factors up to a number n and the unique divisors of that number. For example, 12 has prime divisors 2, 2, 3 and unique divisors 1, 2, 3, 4, 6, 12.

I've included a file to generate this data. You can adjust the value n below to increase the data generated, but 100 data points is sufficient for our purposes.

For this, we need primeface (used in the file to generate prime factors) and the Python json package (as the file is written to json).

In [None]:
!pip install primefac

In [2]:
import json

In [3]:
from utilities import gen_primes
n = 100
gen_primes.writer(n, dir='utilities')

## Event Loops

Now we'll be looking at our file "manually." This corresponds to the event loop method of investigating ROOT file contents. We open our file and look at its structure:

In [4]:
with open('utilities/prime_factors.json') as file:
    data = json.load(file)

Note that the structure is jagged. Each subarray is of varying (and arbitrary) size. This is contrast to rectangular arrays, which have the same shape across all axes (e.g., each subarray is the same length).

In [5]:
data

{'prime_factors': [[],
  [],
  [2],
  [3],
  [2, 2],
  [5],
  [2, 3],
  [7],
  [2, 2, 2],
  [3, 3],
  [2, 5],
  [11],
  [2, 2, 3],
  [13],
  [2, 7],
  [3, 5],
  [2, 2, 2, 2],
  [17],
  [2, 3, 3],
  [19],
  [2, 2, 5],
  [3, 7],
  [2, 11],
  [23],
  [2, 2, 2, 3],
  [5, 5],
  [2, 13],
  [3, 3, 3],
  [2, 2, 7],
  [29],
  [2, 3, 5],
  [31],
  [2, 2, 2, 2, 2],
  [3, 11],
  [2, 17],
  [5, 7],
  [2, 2, 3, 3],
  [37],
  [2, 19],
  [3, 13],
  [2, 2, 2, 5],
  [41],
  [2, 3, 7],
  [43],
  [2, 2, 11],
  [3, 3, 5],
  [2, 23],
  [47],
  [2, 2, 2, 2, 3],
  [7, 7],
  [2, 5, 5],
  [3, 17],
  [2, 2, 13],
  [53],
  [2, 3, 3, 3],
  [5, 11],
  [2, 2, 2, 7],
  [3, 19],
  [2, 29],
  [59],
  [2, 2, 3, 5],
  [61],
  [2, 31],
  [3, 3, 7],
  [2, 2, 2, 2, 2, 2],
  [5, 13],
  [2, 3, 11],
  [67],
  [2, 2, 17],
  [3, 23],
  [2, 5, 7],
  [71],
  [2, 2, 2, 3, 3],
  [73],
  [2, 37],
  [3, 5, 5],
  [2, 2, 19],
  [7, 11],
  [2, 3, 13],
  [79],
  [2, 2, 2, 2, 5],
  [3, 3, 3, 3],
  [2, 41],
  [83],
  [2, 2, 3, 7],
  [5, 17]

And now, if we want to do any selections, we loop through the data explicitly to handle our cut. Let's say we want to select the numbers which have 3 as a prime factor and at least four unique divisors. Then we'd do something like:

In [16]:
# Okay. Let's say we want to select only numbers that have 3 as a prime factor and at least four unique divisors. The event loop method would be:
cut = []
for i in range(n):
    if (3 in data['prime_factors'][i]) & (len(data['unique_divisors'][i]) > 4):
        cut.append((i, data['unique_divisors'][i]))
cut                                                            

[(12, [1, 2, 3, 4, 6, 12]),
 (18, [1, 2, 3, 6, 9, 18]),
 (24, [1, 2, 3, 4, 6, 8, 12, 24]),
 (30, [1, 2, 3, 5, 6, 10, 15, 30]),
 (36, [1, 2, 3, 4, 6, 9, 12, 18, 36]),
 (42, [1, 2, 3, 6, 7, 14, 21, 42]),
 (45, [1, 3, 5, 9, 15, 45]),
 (48, [1, 2, 3, 4, 6, 8, 12, 16, 24, 48]),
 (54, [1, 2, 3, 6, 9, 18, 27, 54]),
 (60, [1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60]),
 (63, [1, 3, 7, 9, 21, 63]),
 (66, [1, 2, 3, 6, 11, 22, 33, 66]),
 (72, [1, 2, 3, 4, 6, 8, 9, 12, 18, 24, 36, 72]),
 (75, [1, 3, 5, 15, 25, 75]),
 (78, [1, 2, 3, 6, 13, 26, 39, 78]),
 (81, [1, 3, 9, 27, 81]),
 (84, [1, 2, 3, 4, 6, 7, 12, 14, 21, 28, 42, 84]),
 (90, [1, 2, 3, 5, 6, 9, 10, 15, 18, 30, 45, 90]),
 (96, [1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 96]),
 (99, [1, 3, 9, 11, 33, 99])]

This isn't too complicated, but with more columnss and higher dimensions (as we have in high-energy physics!), it gets more burdensome. 

## Columns

Columnar analysis is a much simpler alternative which is increasingly popular throughout data science. Packages like **numpy** and **pandas** are the standards for columnar analysis, but our data is jagged, and they can only handle rectangular data. In HEP, we have a package called **awkward** that can handle jagged arrays.

In [17]:
import awkward as ak
data = ak.from_json('utilities/prime_factors.json')

In the columnar mode of analysis, our data is interpreted in a slightly different fashion. Instead of reading it in number-by-number as in the event loop, we read it in by its columnar keys (e.g., prime_factors and unique_divisors). These keys become the fields of our awkward array.

We look at this mode below:

In [60]:
data

<Record ... 1, 2, 4, 5, 10, 20, 25, 50, 100]]} type='{"prime_factors": var * var...'>

In [61]:
data.fields

['prime_factors', 'unique_divisors']

In [62]:
data.prime_factors, data.unique_divisors

(<Array [[], [], [2], [3, ... 11], [2, 2, 5, 5]] type='101 * var * int64'>,
 <Array [[], [1], [1, ... 10, 20, 25, 50, 100]] type='101 * var * int64'>)

To do the same selection as above (numbers with 3 as a prime factor and more than 4 unique divisors), we'd only have to do one line now. There is **no** explicit loop. Awkward (like numpy) has its own syntax that replaces the standard Pythonic loops/conditionals. Loops are done implicitly in C, which makes both packages more efficient than normal Python code.

In [47]:
cut = data[(ak.sum(data.prime_factors == 3, axis=1) > 0) & (ak.num(data.unique_divisors) > 4)]
cut.unique_divisors

<Array [[1, 2, 3, 4, 6, ... 3, 9, 11, 33, 99]] type='20 * var * int64'>

Truncation gets a bit in the way of seeing whether this is identical to our output above. We can loop through to print line-by-line to bypass truncation, but this isn't necessary to actually use the data (it's only an aesthetic issue!)

In [58]:
for i in range(ak.size(cut.unique_divisors, axis=0)):
    print(cut.unique_divisors[i])

[1, 2, 3, 4, 6, 12]
[1, 2, 3, 6, 9, 18]
[1, 2, 3, 4, 6, 8, 12, 24]
[1, 2, 3, 5, 6, 10, 15, 30]
[1, 2, 3, 4, 6, 9, 12, 18, 36]
[1, 2, 3, 6, 7, 14, 21, 42]
[1, 3, 5, 9, 15, 45]
[1, 2, 3, 4, 6, 8, 12, 16, 24, 48]
[1, 2, 3, 6, 9, 18, 27, 54]
[1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60]
[1, 3, 7, 9, 21, 63]
[1, 2, 3, 6, 11, 22, 33, 66]
[1, 2, 3, 4, 6, 8, 9, 12, 18, 24, 36, 72]
[1, 3, 5, 15, 25, 75]
[1, 2, 3, 6, 13, 26, 39, 78]
[1, 3, 9, 27, 81]
[1, 2, 3, 4, 6, 7, 12, 14, 21, 28, 42, 84]
[1, 2, 3, 5, 6, 9, 10, 15, 18, 30, 45, 90]
[1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 96]
[1, 3, 9, 11, 33, 99]


Which is indeed identical!