<br><br><br><br><br>

# Columnar data analysis

<br><br><br><br><br>

<br><br><br><br>

<p style="font-size: 1.25em">Array programming is a programming language paradigm like Object-Oriented Programming (OOP) and Functional Programming (FP).</p>

<br>

<p style="font-size: 1.25em">As physicists, we are mostly familiar with <i>imperative, procedural, structured, object-oriented programming</i> (see <a href="https://en.wikipedia.org/wiki/Comparison_of_programming_paradigms#Main_paradigm_approaches">this list</a>).</p>

<br><br><br><br>

In [12]:
from IPython.display import IFrame    
IFrame("http://zoom.it/6rJp", width="100%", height="440")

<br>

<p style="font-size: 1.25em">Array programming is common to languages and systems designed for interactive data analysis.</p>

<img src="img/apl-timeline.png" width="100%">

<br>

<br><br>

<table align="left" width="32%" style="margin-right: 50px">
<tr style="background: white"><td align="center"><img src="img/tshirt.jpg" width="50%"></td></tr>
<tr style="background: white"><td><img src="img/apl-keyboard.jpg" width="100%"></td></tr>
</table>

<br>

<p style="font-size: 1.25em">APL (1963) pioneered conciseness in programming languages—discovered the mistake of being too concise.</p>

| APL | <br> | Numpy |
|:---:|:----:|:-----:|
| <tt>ι4</tt> | <br> | <tt>numpy.arange(4)</tt> |
| <tt>(3+ι4)</tt> | <br> | <tt>numpy.arange(4) + 3</tt> |
| <tt>+/(3+ι4)</tt> | <br> | <tt>(numpy.arange(4) + 3).sum()</tt> |
| <tt>m ← +/(3+ι4)</tt> | <br> | <tt>m = (numpy.arange(4) + 3).sum()</tt> |

(The other extreme is writing a for loop for each of the above.)

<br><br>

<br><br><br><br>

<p style="font-size: 1.25em">The fundamental data type in this world is an array. (Some array languages don't have non-arrays.)</p>

<br>

<p style="font-size: 1.25em">Unlike the others (APL, IDL, MATLAB, R), Numpy is a library, not a language, though it goes all the way back to the beginning of Python (1995) and significantly influenced Python's grammar.</p>

<br><br><br><br>

In [14]:
import numpy, uproot
print(numpy.arange(20),                                        end="\n\n")
print(numpy.linspace(-5, 5, 21),                               end="\n\n")
print(numpy.empty(10000, numpy.float16),                       end="\n\n")
print(numpy.full((2, 7), 999),                                 end="\n\n")
print(numpy.random.normal(-1, 0.0001, 10000),                  end="\n\n")
print(uproot.open("data/Zmumu.root")["events"]["E1"].array(),  end="\n\n")

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

[-5.  -4.5 -4.  -3.5 -3.  -2.5 -2.  -1.5 -1.  -0.5  0.   0.5  1.   1.5
  2.   2.5  3.   3.5  4.   4.5  5. ]

[0.007812 0.       0.       ... 0.006435      nan 0.      ]

[[999 999 999 999 999 999 999]
 [999 999 999 999 999 999 999]]

[-1.00000817 -1.00010952 -1.00009081 ... -1.00002829 -0.9999803
 -1.00003187]

[82.20186639 62.34492895 62.34492895 ... 81.27013558 81.27013558
 81.56621735]



<br><br>

<center><img src="img/numpy-memory-layout.png" width="90%"></center>

<br><br>

In [3]:
a = numpy.array([2**30, 2**30 + 2**26, -1, 0, 2**30 + 2**24, 2**30 + 2**20], numpy.int32)
# a = a.view(numpy.float32)
# a = a.reshape((2, 3))
# a = a.astype(numpy.int64)

print("data:\n", a, end="\n\n")
print("type:", type(a), end="\n\n")
print("dtype (type of the data it contains):", a.dtype, end="\n\n")
print("shape: (size of each dimension):", a.shape, end="\n\n")

data:
 [1073741824 1140850688         -1          0 1090519040 1074790400]

type: <class 'numpy.ndarray'>

dtype (type of the data it contains): int32

shape: (size of each dimension): (6,)



In [5]:
# Any mathematical function that would map scalar arguments to a scalar result
#                                      maps array arguments to an array result.

a_array = numpy.random.uniform(5, 10, 10000);     a_scalar = a_array[0]
b_array = numpy.random.uniform(10, 20, 10000);    b_scalar = b_array[0]
c_array = numpy.random.uniform(-0.1, 0.1, 10000); c_scalar = c_array[0]

def quadratic_formula(a, b, c):
    return (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

print("scalar:\n", quadratic_formula(a_scalar, b_scalar, c_scalar), end="\n\n")
print("array:\n",  quadratic_formula(a_array,  b_array,  c_array), end="\n\n")

scalar:
 0.000676354845602843

array:
 [ 0.00067635 -0.00478878  0.0043468  ... -0.00593298  0.00644923
  0.00500946]



In [6]:
# Each step in the calculation is performed over whole arrays before moving on to the next.

a, b, c = a_array, b_array, c_array

roots1 = (-b + numpy.sqrt(b**2 - 4*a*c)) / (2*a)

tmp1 = numpy.negative(b)            # -b
tmp2 = numpy.square(b)              # b**2
tmp3 = numpy.multiply(4, a)         # 4*a
tmp4 = numpy.multiply(tmp3, c)      # tmp3*c
tmp5 = numpy.subtract(tmp2, tmp4)   # tmp2 - tmp4
tmp6 = numpy.sqrt(tmp5)             # sqrt(tmp5)
tmp7 = numpy.add(tmp1, tmp6)        # tmp1 + tmp6
tmp8 = numpy.multiply(2, a)         # 2*a
roots2 = numpy.divide(tmp7, tmp8)   # tmp7 / tmp8

roots1, roots2

(array([ 0.00067635, -0.00478878,  0.0043468 , ..., -0.00593298,
         0.00644923,  0.00500946]),
 array([ 0.00067635, -0.00478878,  0.0043468 , ..., -0.00593298,
         0.00644923,  0.00500946]))

In [7]:
# Even comparison operators are element-by-element.

roots1 == roots2

array([ True,  True,  True, ...,  True,  True,  True])

In [8]:
# So use a reducer (e.g. sum, max, min, any, all) to turn the array into a scalar.

(roots1 == roots2).all()

True

In [20]:
px, py, pz = uproot.open("data/Zmumu.root")["events"].arrays("p[xyz]1", outputtype=tuple)

p = numpy.sqrt(px**2 + py**2 + pz**2)
p

array([82.20179848, 62.34483942, 62.34483942, ..., 81.27006689,
       81.27006689, 81.56614892])

In [23]:
# But what if there are multiple values per event?

uproot.open("data/HZZ.root")["events"].array("Muon_Px")

<JaggedArray [[-52.899456 37.73778] [-0.81645936] [48.98783 0.8275667] ... [-29.756786] [1.1418698] [23.913206]] at 0x7c4d00a22630>

In [22]:
# JaggedArray is designed to act like Numpy arrays, even to be usable in Numpy functions like numpy.sqrt

px, py, pz = uproot.open("data/HZZ.root")["events"].arrays(["Muon_P[xyz]"], outputtype=tuple)

numpy.sqrt(px**2 + py**2 + pz**2)

<JaggedArray [[54.7794 39.401554] [31.69027] [54.739685 47.48874] ... [62.395073] [174.2086] [69.55613]] at 0x7c4d007b8f98>

<br><br>

<center><img src="img/numpy-memory-broadcasting.png" width="75%"></center>

<br><br>

In [24]:
E, px, py, pz = uproot.open("data/Zmumu.root")["events"].arrays(["E1", "p[xyz]1"], outputtype=tuple)

# Numpy arrays
#                   array   array   array   scalar
energy = numpy.sqrt(px**2 + py**2 + pz**2 + 0.1056583745**2)
energy, E

(array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]),
 array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]))

In [25]:
E, px, py, pz = uproot.open("data/HZZ.root")["events"].arrays(["Muon_E", "Muon_P[xyz]"], outputtype=tuple)

# JaggedArrays
#                   array   array   array   scalar
energy = numpy.sqrt(px**2 + py**2 + pz**2 + 0.1056583745**2)
energy, E

(<JaggedArray [[54.7795 39.401695] [31.690447] [54.739788 47.488857] ... [62.39516] [174.20863] [69.55621]] at 0x7c4d007b8908>,
 <JaggedArray [[54.7795 39.401695] [31.690445] [54.739788 47.488857] ... [62.39516] [174.20863] [69.55621]] at 0x7c4d00890e10>)

In [29]:
import awkward
scalar = 1000
flat   = numpy.array([100, 200, 300])
jagged = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])

# With JaggedArrays, there are more broadcasting cases:
print(f"scalar + flat:   {scalar + flat}")
print(f"scalar + jagged: {scalar + jagged}")
print(f"  flat + jagged: {flat + jagged}")

scalar + flat:   [1100 1200 1300]
scalar + jagged: [[1001.1 1002.2 1003.3] [] [1004.4 1005.5]]
  flat + jagged: [[101.1 102.2 103.3] [] [304.4 305.5]]


In [39]:
jetx, jety, metx, mety = uproot.open("data/HZZ.root")["events"].arrays(
    ["Jet_P[xy]", "MET_p[xy]"], outputtype=tuple)

jet_phi = numpy.arctan2(jety, jetx)
met_phi = numpy.arctan2(mety, metx)

print(f"multi per event: {jet_phi}")
print(f"one per event:   {met_phi}")

print(f"\ndifference:      {jet_phi - met_phi}")

multi per event: [[] [2.669215] [] ... [-1.6703207] [2.8687775 -2.0823672] []]
one per event:   [ 0.40911174 -0.58348763  2.5796134  ...  1.2252938  -0.58017296
 -0.18039851]

difference:      [[] [3.2527027] [] ... [-2.8956146] [3.4489505 -1.5021942] []]


In [44]:
# Q: What about ensuring that each delta-phi is between -pi and pi without if/then?
# A: You start to pick up tricks, like this:

raw_diff = jet_phi - met_phi

bounded_diff = (raw_diff + numpy.pi) % (2*numpy.pi) - numpy.pi

# Should dphi be a library function? That's the kind of question we think about...

bounded_diff.flatten().min(), bounded_diff.flatten().max()

(-3.14096, 3.137677)

In [45]:
# Another way JaggedArrays extend Numpy arrays:

# Reducers, like sum, min, max, turn flat arrays into scalars.

met_phi.min(), met_phi.max()

(-3.141034, 3.1297169)

In [46]:
# Another way JaggedArrays extend Numpy arrays:

# Reducers, like sum, min, max, turn jagged arrays into flat arrays.

jet_phi.min(), jet_phi.max()

(array([       inf,  2.669215 ,        inf, ..., -1.6703207, -2.0823672,
               inf], dtype=float32),
 array([      -inf,  2.669215 ,       -inf, ..., -1.6703207,  2.8687775,
              -inf], dtype=float32))

In [47]:
# The meaning of flat.sum() is "sum of all elements of the flat array."
# The meaning of jagged.sum() is "sum of all elements in each inner array of the jagged array."

jagged = awkward.fromiter([[1.0, 2.0, 3.0], [], [4.0, 5.0]])
jagged.sum()   # min, max

array([6., 0., 9.])

In [49]:
# jagged.sum().sum() completes the process, resulting in a scalar. But,
# jagged.flatten().sum() does the same thing. Why?

jagged.sum().sum(), jagged.flatten().sum()

(15.0, 15.0)

In [53]:
# mean, var, std are also available, just like Numpy, but these aren't associative as sum, min, max are.

# "Don't do a mean of means unless you mean it!"

jet_phi.mean()

array([        nan,  2.66921496,         nan, ..., -1.67032075,
        0.39320517,         nan])

In [67]:
# Also worth noting that any and all are reducers... of booleans.

same_hemicircle = (abs(bounded_diff) < numpy.pi/2)

print(f"same_hemicircle:             {same_hemicircle}")
print(f"same_hemicircle.any():       {same_hemicircle.any()}")
print(f"same_hemicircle.any().any(): {same_hemicircle.any().any()}")
print(f"same_hemicircle.any().all(): {same_hemicircle.any().all()}")
print(f"same_hemicircle.all():       {same_hemicircle.all()}")
print(f"same_hemicircle.all().any(): {same_hemicircle.all().any()}")
print(f"same_hemicircle.all().all(): {same_hemicircle.all().all()}")

same_hemicircle:             [[] [False] [] ... [False] [False True] []]
same_hemicircle.any():       [False False False ... False  True False]
same_hemicircle.any().any(): True
same_hemicircle.any().all(): False
same_hemicircle.all():       [ True False  True ... False False  True]
same_hemicircle.all().any(): True
same_hemicircle.all().all(): False
