# Lesson 2: Number Crunching in Python

There are two kinds of languages, and Python is of the slow kind.

<br>

<center>
<img src="img/benchmark-games-2023.svg" width="75%">
</center>

That is, it was designed with convenience in mind, rather than speed.

<br><br>

Data analysis frequently involves calculations on large datasets. Speed (and memory use) are important!

<br><br>

How did Python come to be such a popular data analysis language with this against it?

## The why and the how

<br>

Let's reload the Higgs dataset to get a single list of numbers.

In [1]:
import json
dataset_python = json.load(open("data/SMHiggsToZZTo4L.json"))

<br>

In [2]:
pt_python = []
for event in dataset_python:
    for electron in event["electron"]:
        pt_python.append(electron["pt"])
    for muon in event["muon"]:
        pt_python.append(muon["pt"])

<br>

In [7]:
len(pt_python)

28809

<br>

Look at the first few by slicing it.

In [6]:
pt_python[0:4]

[63.04386901855469, 38.12034606933594, 4.04868745803833, 21.902679443359375]

How much memory is this list using?

<br>

In [8]:
import sys

num_bytes_python = 0
num_bytes_python += sys.getsizeof(pt_python)   # size of the list, not including the numbers

for x in pt_python:
    num_bytes_python += sys.getsizeof(x)       # size of each number
    
num_bytes_python

937904

<br>

How many bytes per value? More than 8 (double precision floating point numbers)?

<br>

In [9]:
num_bytes_python / len(pt_python)

32.55593738067965

Get the same data as an array (from an HDF5 file).

<br>

In [11]:
import h5py

dataset_hdf5 = h5py.File("data/SMHiggsToZZTo4L.h5")

pt_numpy = dataset_hdf5["particles"]["pt"][:]
pt_numpy

array([63.04387  , 38.120346 ,  4.0486875, ..., 60.098644 ,  3.7663147,
       21.205685 ], dtype=float32)

<br>

Are they all the same?

In [12]:
assert len(pt_python) == len(pt_numpy)

for list_x, array_x in zip(pt_python, pt_numpy):
    assert list_x == array_x

How does their memory use compare?

In [13]:
sys.getsizeof(pt_numpy) / len(pt_numpy)

4.003887673990767

<br>

In [14]:
num_bytes_python / len(pt_python)

32.55593738067965

<br>

How does speed of computation compare? (Note the units.)

In [15]:
%%timeit

pt2_python = []
for x in pt_python:
    pt2_python.append(x**2)

3.49 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


<br>

In [16]:
%%timeit

pt2_numpy = pt_numpy**2

4.67 µs ± 154 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Memory layout of a Python list.

<br>

<center>
<img src="img/python-list-layout.svg" width="75%">
</center>

Memory layout of a NumPy array.

<br>

<center>
<img src="img/python-array-layout.svg" width="75%">
</center>

This also hints at a limitation of NumPy: an array can't mix data types.

<br>

Python list: data type is a property of each element.

In [None]:
type(pt_python[0])

<br>

NumPy array: data type is a property of the whole array.

In [None]:
pt_numpy.dtype

In [None]:
pt_numpy.dtype.type

<br>

(Caveat: actually, NumPy has a `dtype('object')` to store Python objects, but that has no advantage over Python lists.)

## NumPy

NumPy is a third-party (but fundamental!) library for data and computations _in_ C, _from_ Python.

<br>

In [17]:
import numpy as np

<br>

<center>
<img src="img/Numpy_Python_Cheat_Sheet.svg" width="60%">
</center>

We've already seen that NumPy has a different syntax than ordinary Python:

<br>

In [18]:
pt_numpy**2

array([3974.5295  , 1453.1608  ,   16.39187 , ..., 3611.847   ,
         14.185126,  449.68106 ], dtype=float32)

<br>

computes the square of _every_ element in the array, returning another array.

<br><br><br>

That's because the single Python command (`**`) invokes a loop that has been compiled in C, which runs at the speed of C.

<br><br><br>

C is fast because it doesn't have to stop and check data types before each operation (among other things).

What about whole expressions? Consider

In [19]:
def quadratic_formula(a, b, c):
    return (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)

<br>

given

In [20]:
a = 5
b = 10
c = -0.1

quadratic_formula(a, b, c)

0.009950493836207741

<br>

**Quizlet:** Before running it, what will happen in the following?

In [21]:
a = np.random.uniform(5, 10, 1000000)
b = np.random.uniform(10, 20, 1000000)
c = np.random.uniform(-0.1, 0.1, 1000000)

quadratic_formula(a, b, c)

array([-0.0036705 , -0.00320118,  0.00233061, ...,  0.00666895,
       -0.00512143,  0.00239149])

NumPy is a big step forward in performance, but it's still not as fast as compiled code.

<br>

Consider that each step in

In [22]:
def quadratic_formula(a, b, c):
    return (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)

<br>

is a separate loop that makes a separate array in each step.

<br>

In [23]:
def quadratic_formula_equivalent(a, b, c):
    tmp1 = np.negative(b)            # -b
    tmp2 = np.square(b)              # b**2
    tmp3 = np.multiply(4, a)         # 4*a
    tmp4 = np.multiply(tmp3, c)      # tmp3*c
    del tmp3
    tmp5 = np.subtract(tmp2, tmp4)   # tmp2 - tmp4
    del tmp2, tmp4
    tmp6 = np.sqrt(tmp5)             # sqrt(tmp5)
    del tmp5
    tmp7 = np.add(tmp1, tmp6)        # tmp1 + tmp6
    del tmp1, tmp6
    tmp8 = np.multiply(2, a)         # 2*a
    return np.divide(tmp7, tmp8)     # tmp7 / tmp8

NumPy is a _little_ smarter than this ("it fuses some operations"), but it's limited.

In [24]:
%%timeit

quadratic_formula(a, b, c)

8.8 ms ± 902 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


<br>

In [25]:
%%timeit

quadratic_formula_equivalent(a, b, c)

11.9 ms ± 644 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


<br>

Other libraries actually run the whole expression in one loop.

In [26]:
import numexpr

numexpr.evaluate("(-b + sqrt(b**2 - 4*a*c)) / (2*a)")   # quoted, to be sent to a compiler

ModuleNotFoundError: No module named 'numexpr'

<br>

In [None]:
%%timeit

numexpr.re_evaluate()

Anyway, let's focus on NumPy for now because the array concept is a convenient tool for fast-enough interactive computation.

<br>

It gets more interesting when you combine array-at-a-time logic with slicing and other tricks.

<br>

**Quizlet:** Given an array of numbers, like

<br>

In [27]:
array = np.array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
array

array([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])

<br>

How would you compute the distances between the elements?

<br>

<br>

<details>
    <summary><b>Hint...</b></summary>

How many elements does it have?

</details>

<details>
    <summary><b>Answer... Don't look!</b></summary>

Direct subtraction of the two arrays:

<center>
<img src="img/flat-operation.svg" width="40%">
</center>

Subtraction with slices:

<center>
<img src="img/shifted-operation.svg" width="40%">
</center>

</details>

### Slicing

Python has a very concise slicing syntax:

```python
container[start:stop:step]
```

any one of which can be left out, to get a default. Negative values count backward, from the end.

In [1]:
container = [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]

In [2]:
container[2:]

[2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]

<br>

In [3]:
container[:5]

[0.0, 1.1, 2.2, 3.3, 4.4]

<br>

In [4]:
container[-6:-2]

[4.4, 5.5, 6.6, 7.7]

<br>

In [5]:
container[4:10:2]

[4.4, 6.6, 8.8]

<br>

In [6]:
container[::3]

[0.0, 3.3, 6.6, 9.9]

<img src="img/numpy-slicing.png" width="25%">

NumPy goes beyond ordinary slicing by allowing slices in multiple dimensions.

In [28]:
arr = np.array([[1.1, 2.2, 3.3],
                [4.4, 5.5, 6.6],
                [7.7, 8.8, 9.9]])
arr

array([[1.1, 2.2, 3.3],
       [4.4, 5.5, 6.6],
       [7.7, 8.8, 9.9]])

In [31]:
arr[:2, 1:]

array([[2.2, 3.3],
       [5.5, 6.6]])

<br>

In [32]:
arr[2:, :]

array([[7.7, 8.8, 9.9]])

<br>

In [None]:
arr[:, :2]

<br>

In [None]:
arr[1:2, :2]

<center>
<img src="img/numpy-memory-layout.png" width="75%">
</center>

<center>
<img src="img/numpy-memory-reshape.png" width="75%">
</center>

<center>
<img src="img/numpy-memory-slice.png" width="75%">
</center>

**Quizlet:** Slice `three_dimensional` such that it looks like

```python
[[ 4  9]
 [24 29]]
```

In [33]:
three_dimensional = np.arange(30).reshape((3, 2, 5))
three_dimensional

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])

<br>

In [34]:
three_dimensional[ : : ]

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])

<br>

### Advanced slicing

Not only that, but NumPy can slice arrays with arrays.

In [35]:
arr  = np.array([  0.0,   1.1,   2.2,   3.3,   4.4,  5.5,   6.6,  7.7,   8.8,  9.9])
mask = np.array([False, False, False, False, False, True, False, True, False, True])
#                                                    5.5          7.7          9.9

<br>

In [36]:
arr[mask]

array([5.5, 7.7, 9.9])

<br>

An array of integers picks out elements by index.

In [37]:
arr[np.array([5, 7, -1])]

array([5.5, 7.7, 9.9])

<br>

They can be out of order.

In [38]:
arr[np.array([-1, 7, 5])]

array([9.9, 7.7, 5.5])

<br>

They can even include duplicates.

In [None]:
arr[np.array([-1, -1, -1, 7, 7, 5])]

In [None]:
text = """
WOULD YOU LIKE GREEN EGGS AND HAM?

I DO NOT LIKE THEM, SAM-I-AM.
I DO NOT LIKE GREEN EGGS AND HAM.

WOULD YOU LIKE THEM HERE OR THERE?

I WOULD NOT LIKE THEM HERE OR THERE.
I WOULD NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU LIKE THEM IN A HOUSE?
WOULD YOU LIKE THEN WITH A MOUSE?

I DO NOT LIKE THEM IN A HOUSE.
I DO NOT LIKE THEM WITH A MOUSE.
I DO NOT LIKE THEM HERE OR THERE.
I DO NOT LIKE THEM ANYWHERE.
I DO NOT LIKE GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.

WOULD YOU EAT THEM IN A BOX?
WOULD YOU EAT THEM WITH A FOX?

NOT IN A BOX. NOT WITH A FOX.
NOT IN A HOUSE. NOT WITH A MOUSE.
I WOULD NOT EAT THEM HERE OR THERE.
I WOULD NOT EAT THEM ANYWHERE.
I WOULD NOT EAT GREEN EGGS AND HAM.
I DO NOT LIKE THEM, SAM-I-AM.
"""

In [None]:
words = np.array(text.replace(",", " ").replace(".", " ").replace("?", " ").replace("!", " ").replace("-", " ").split())
dictionary, index = np.unique(words, return_inverse=True)

<br>

In [None]:
dictionary

<br>

In [None]:
index

**Quizlet:** What's going to happen?

<br>

In [None]:
dictionary[index]

<br>

<details>
    <summary><b>Hint...</b></summary>

<br>

```
index             : positions in corpus (0, 1, 2, ...) → integer codes
dictionary        : integer codes                      → words

dictionary[index] : positions in corpus (0, 1, 2, ...) → words
```

</details>

### Reductions

We've seen operations that apply to each element of an<br>array, producing a new array of the same length ("map").

<br>

NumPy also has operations that turn $n$-dimensional<br>arrays into $(n-1)$-dimensional arrays ("reduce").

<br>

In [None]:
arr = np.array([[  1,   2,   3,   4],
                [ 10,  20,  30,  40],
                [100, 200, 300, 400]])

<img src="img/example-reducer-2d.svg" width="50%">

<br>

In [None]:
np.sum(arr, axis=0)

<br>

In [None]:
np.sum(arr, axis=1)

This is the end of the NumPy section.

<br>
