In [93]:
%%html

<style>
.jp-MarkdownOutput {
    font-size: 2.5em !important;
}
.jp-MarkdownOutput table {
    font-size: 1em !important;
}
.jp-OutputArea-output pre {
    font-size: 2em !important;
}
.cm-content {
    font-size: 2em !important;
}
.page-id-xx, html {
    scrollbar-width: none; /* FF */
}
::-webkit-scrollbar {
    width: 0px; /* Chrome & Edge */
}
.jp-Notebook.jp-mod-commandMode .jp-Cell.jp-mod-selected {
    background: none;
}
</style>

<img src="img/title.svg" style="width: 90%; border: 1px solid black; margin: 3em auto;">

## Outline

<img src="img/swallows-coconut.jpg" style="width: 30%; float: right; margin-top: 50px;">

* Who wants high performance and why?
* Understanding why Python is slow
* Python escape hatches:
  * NumPy
  * Awkward Array
  * JIT-compilation
* Special topics:
  * The Python garbage collector
  * The Python GIL (Global Interpreter Lock)
* No conclusions: we'll stop when we run out of time

## Who wants high performace and why?

* "I'm shepherding a computation that runs for months and 5% faster means 5% less electricity, operating costs, and $CO_2$."

* "I'm building an interactive app, and if each user-initiated action completes in less than human reaction time (100 ms), the app will feel snappy."

* "I'm analyzing a large dataset, and if the computation completes over a lunch break instead of overnight, I'll be able to run it more often to do more in-depth investigations."

* "I'm trying to do &lt;anything at all&gt;, and I can't because I run out of RAM, open file handles, TTYs, ..."

This is me:

* "I'm analyzing a large dataset, and if the computation completes over a lunch break instead of overnight, I'll be able to run it more often to do more in-depth investigations."
* "I'm trying to do &lt;anything at all&gt;, and I can't because I run out of RAM, open file handles, TTYs, ..."

<br><br><br>

Therefore, I usually only care about **speed** when it's an order of magnitude gain and **memory** when necessary.

<img src="img/clock-rate-timeline-1.svg" style="width: 90%; margin: 3em auto;">

<img src="img/clock-rate-timeline-2.svg" style="width: 90%; margin: 3em auto;">

<img src="img/clock-rate-timeline-3.svg" style="width: 90%; margin: 3em auto;">

<img src="img/clock-rate-timeline-4.svg" style="width: 90%; margin: 3em auto;">

There are now only three ways to speed up code:

1. parallel processing

2. fixing bloopers: what you thought it was doing is not what it's doing, or there's a faster method you just didn't know about

3. turning off dynamic features that you don't need

<img src="img/dynamic-features.svg" style="width: 60%; margin: 3em auto;">

"Dynamic feature":

* decisions are made in the loop that scales with dataset size

<br>

"that you don't need":

* information is known before starting that loop; it doesn't need to be re-derived

Dynamic features are often built into the language

| | dynamic allocation | reference count | garbage collector | runtime evaluation | virtual machine | type reflection | parallel scheduling |
|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| Fortran77 | | | | | | | |
| C | ✓ | | | | | | |
| C++ | ✓ | `shared_ptr<T>` | | | | vtable | stdlib |
| Rust | ✓ | `Rc<T>` | | | | vtable | ✓ |
| Swift | ✓ | ✓ | | | | vtable | ✓ |
| Julia | ✓ | | ✓ | ✓ | | ✓ | stdlib |
| Go | ✓ | | ✓ | | | vtable | ✓ |
| JVM | ✓ | | ✓ | | ✓ | ✓ | stdlib |
| Lua | ✓ | | ✓ | ✓ | ✓ | ✓ | |
| Python | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | stdlib |

<img src="img/benchmark-games-2023.svg" style="width: 80%; margin: 3em auto;">

## Understanding why Python is slow

How do dynamic features slow down a calculation?

<br>

In [None]:
import numpy as np

<br>

In [None]:
million_integers = np.random.normal(0, 10, 1_000_000).astype(np.int32)
million_integers

<br>

In [None]:
def addthem_python(data):
    out = 0
    for x in data:
        out += x
    return out

In [None]:
%%writefile addthem.c

int run(int* data) {
    int out = 0;
    for (int i = 0;  i < 1000000;  i++) {
        out += data[i];
    }
    return out;
}

<br>

In [None]:
!cc -O0 -shared addthem.c -o libaddthem_attempt1.so

<br>

In [None]:
import ctypes

<br>

In [None]:
libaddthem = ctypes.cdll.LoadLibrary("./libaddthem_attempt1.so")
pointer_to_ints = ctypes.POINTER(ctypes.c_int)
libaddthem.run.argtypes = (pointer_to_ints,)
libaddthem.run.restype = ctypes.c_int

In [None]:
addthem_python(million_integers)

<br>

In [None]:
libaddthem.run(million_integers.ctypes.data_as(pointer_to_ints))

<br>

In [None]:
%%timeit

addthem_python(million_integers)

<br>

In [None]:
%%timeit

libaddthem.run(million_integers.ctypes.data_as(pointer_to_ints))

To see what's slowing Python down, let's look at a toy language with similar dynamic features

<br>

In [None]:
!c++ -std=c++11 -O3 baby-python.cpp -o baby-python

<br>

In [None]:
%%bash

./baby-python <<EOF
123
add(3, 5)
square_them = def(x) mul(x, x)
map(square_them, [1, 2, 3, 4, 5])
EOF

In [None]:
million_integers.tofile("million_integers.int32")

<br>

In [None]:
%%bash

./baby-python data=million_integers.int32 <<EOF
reduce(add, data)
reduce(add, data)
reduce(add, data)
EOF

What baby-python is doing:

* variables are in an `unordered_map<string, shared_ptr<Object>>`
* `Object` is an abstract class; C++ has to maintain a vtable to keep track of which type each concrete instance is (runtime polymorphism)
* instructions are represented by a data structure that must be traversed (abstract syntax tree, or AST)
* parsing and data-loading happen before the loop over data and are not slowing it down

<br>

In [None]:
from pygments import highlight
from pygments.lexers import CppLexer
from pygments.formatters import HtmlFormatter
from IPython.display import display, HTML

<br>

In [None]:
with open("baby-python.cpp") as file:
    display(HTML(highlight(file.read(), CppLexer(), HtmlFormatter())))

Similarly in Python:

* variables are in a `dict[str, object]`

<br>

In [None]:
globals()

Similarly in Python:

* all data share an overloaded struct type (implemented in C, rather than using C++ to generate vtables automatically)

```c
struct PyObject {
    Py_size_t ob_refcnt;    // reference count for garbage collection
    PyObject* ob_type;      // the object's type (also a PyObject)
                            // more fields that depend on type
};
```

In [None]:
class PyObject(ctypes.Structure):
    pass

PyObject._fields_ = [
    ("ob_refcnt", ctypes.c_size_t),
    ("ob_type", ctypes.POINTER(PyObject)),
]

<br>

In [None]:
some_string = "This is a nice string."

<br>

In [None]:
c_some_string = PyObject.from_address(id(some_string))

<br>

In [None]:
c_some_string.ob_refcnt

<br>

In [None]:
list_of_string = [some_string] * 100

<br>

In [None]:
ctypes.cast(c_some_string.ob_type, ctypes.c_void_p).value == id(str)

Similarly in Python:

* instructions are represented by a data structure that must be traversed (an array of bytecodes)

<br>

In [None]:
import dis

<br>

In [None]:
dis.dis(addthem_python)
# lineno|offset|opcode_name      |argument|description
# ------+------+-----------------+--------+------------

<br>

None of the above instructions specify data types. The code that adds (`BINARY_OP 13`) has to determine that `x` is an integer, over and over, for a million values.

By contrast, the instructions generated by a compiler are passed directly to the CPU, and therefore have to be different instructions for different data types.

<br>

In [None]:
!cc -O0 -c addthem.c

<br>

In [None]:
!objdump -d addthem.o

## Python escape hatches: NumPy

<br>

"Making Python go fast" usually means "using something other than Python in hot loops."

NumPy provides an array type and suite of functions to do that.

<img src="img/python-list-layout.svg" style="width: 80%; margin: 3em auto;">

<img src="img/python-array-layout.svg" style="width: 80%; margin: 3em auto;">

In [None]:
%%timeit

addthem_python(million_integers)

<br>

In [None]:
%%timeit

np.sum(million_integers)

<br>

But you know about NumPy (and Pandas, and everything else that uses NumPy).

## Python escape hatches: Awkward Array

<br>

Like NumPy, but for irregularly shaped data structures.

<img src="img/awkward-motivation-venn-diagram.svg" style="width: 30%; margin: 1em auto;">

In [None]:
import awkward as ak

In [None]:
ragged = ak.Array([
    [
      [[1.84, 0.324]],
      [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
      [[0.459, -1.517, 1.545], [0.33, 0.292]],
      [[-0.376, -1.46, -0.206], [0.65, 1.278]],
      [[], [], [1.617]],
      []
    ],
    [
      [[-0.106, 0.611]],
      [[0.118, -1.788, 0.794, 0.658], [-0.105]]
    ],
    [
      [[-0.384], [0.697, -0.856]],
      [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
    ],
    [
      [[0.205, -0.355], [-0.265], [1.042]],
      [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
      [[1.741, -0.199, 0.827]]
    ]
])

In [None]:
print(ragged[3, 1, -1, 2])

<br>

In [None]:
print(ragged[3, 1:, -1, 1:3])

<br>

In [None]:
print(ragged[[False, False, True, True], [0, -1, 0, -1], 0, -1])

<br>

In [None]:
print(ragged[ragged > 0])

<br>

In [None]:
print(ak.sum(ragged))

<img src="img/example-reducer-2d.svg" style="width: 40%; margin: 1em auto;">

In [None]:
regular = np.array([
    [  1,   2,   3,   4],
    [ 10,  20,  30,  40],
    [100, 200, 300, 400],
])

<br>

In [None]:
np.sum(regular, axis=0)

<br>

In [None]:
np.sum(regular, axis=1)

<img src="img/example-reduction-sum.svg" style="width: 40%; margin: 1em auto;">

In [None]:
irregular = ak.Array([
    [   1,    2,    4],
    [                ],
    [None,    8      ],
    [  16            ],
])

<br>

In [None]:
print(ak.sum(irregular, axis=0))

<br>

In [None]:
print(ak.sum(irregular, axis=1))

In [None]:
ragged.layout

<br>

All operations are implemented as compiled functions on `<NumpyArray>` and `<Index>` arrays, which don't need to know a node's `<content>` type.

In [None]:
million_records = ak.from_parquet("data/million_records.parquet")
million_records

In [None]:
%%timeit

np.square(million_records["y", ..., 1:])

<br>

In [None]:
million_dicts = ak.to_list(million_records)

<br>

In [None]:
%%timeit -n1 -r1

output = []
for sublist in million_dicts:
    tmp1 = []
    for record in sublist:
        tmp2 = []
        for number in record["y"][1:]:
            tmp2.append(np.square(number))
        tmp1.append(tmp2)
    output.append(tmp1)

<br>

Also, `million_records` uses 10× less memory than `million_dicts`.

## Python escape hatches: JIT-compilation

<br>

Despite being much faster than pure Python, array-computations are still a few times slower than they could be.

In [None]:
def quadratic_formula(a: np.array, b: np.array, c: np.array) -> np.array:
    return (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)

<br>

runs whole arrays through each operation before moving on to the next, like this:

<br>

In [None]:
def pedantic_quadratic_formula(a: np.array, b: np.array, c: np.array) -> np.array:
    tmp1 = np.negative(b)            # -b
    tmp2 = np.square(b)              # b**2
    tmp3 = np.multiply(4, a)         # 4*a
    tmp4 = np.multiply(tmp3, c)      # tmp3*c
    del tmp3
    tmp5 = np.subtract(tmp2, tmp4)   # tmp2 - tmp4
    del tmp2, tmp4
    tmp6 = np.sqrt(tmp5)             # sqrt(tmp5)
    del tmp5
    tmp7 = np.add(tmp1, tmp6)        # tmp1 + tmp6
    del tmp1, tmp6
    tmp8 = np.multiply(2, a)         # 2*a
    return np.divide(tmp7, tmp8)     # tmp7 / tmp8

In [None]:
a = np.random.uniform(5, 10, 5_000_000)
b = np.random.uniform(10, 20, 5_000_000)
c = np.random.uniform(-0.1, 0.1, 5_000_000)

<br>

In [None]:
%%timeit

quadratic_formula(a, b, c)

<br>

In [None]:
%%timeit

pedantic_quadratic_formula(a, b, c)

But if we compile the formula, we can make it only one loop over the arrays.

<br>

In [None]:
%%writefile quadratic_formula_pybind11.cpp

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
namespace py = pybind11;

void run(py::array_t<double, py::array::c_style | py::array::forcecast> a_numpy,
         py::array_t<double, py::array::c_style | py::array::forcecast> b_numpy,
         py::array_t<double, py::array::c_style | py::array::forcecast> c_numpy,
         py::array_t<double> output_numpy) {
    const double* a = a_numpy.data();
    const double* b = b_numpy.data();
    const double* c = c_numpy.data();
    double* output = output_numpy.mutable_data();
    for (int i = 0;  i < output_numpy.size();  i++) {
        output[i] = (-b[i] + sqrt(b[i]*b[i] - 4*a[i]*c[i])) / (2*a[i]);
    }
}

PYBIND11_MODULE(quadratic_formula_pybind11, m) {
    m.def("run", &run);
}

In [None]:
import os
import sys
from pybind11 import get_include

inc = "-I " + get_include()
plat = "-undefined dynamic_lookup" if "darwin" in sys.platform else "-fPIC"
pyinc = !python3-config --cflags

<br>

In [None]:
!c++ -std=c++11 quadratic_formula_pybind11.cpp -shared {inc} {pyinc.s} -o quadratic_formula_pybind11.so {plat}

<br>

In [None]:
import quadratic_formula_pybind11

<br>

In [None]:
output = np.zeros(len(a), dtype=np.float64)
quadratic_formula_pybind11.run(a, b, c, output)
output

In [None]:
%%timeit

quadratic_formula(a, b, c)

<br>

In [None]:
%%timeit

output = np.zeros(len(a), dtype=np.float64)
quadratic_formula_pybind11.run(a, b, c, output)

<br>

Accessing memory is slower than most mathematical operations, so doing a lot of math in one pass is better than doing a little math in many passes.

Many, many libraries try to make it easy to integrate compilation into Python workflows.

<img src="img/history-of-bindings-2.svg" style="width: 80%; margin: 1em auto;">

Numba is a Just-In-Time (JIT) compiler for Python

<br>

In [None]:
import numba as nb

<br>

In [None]:
@nb.jit
def quadratic_formula_numba(a, b, c):
    output = np.empty(len(a), dtype=np.float64)
    for i, (a_i, b_i, c_i) in enumerate(zip(a, b, c)):
        output[i] = (-b_i + np.sqrt(b_i**2 - 4*a_i*c_i)) / (2*a_i)
    return output

quadratic_formula_numba(a, b, c)

<br>

In [None]:
%%timeit

quadratic_formula_numba(a, b, c)

Numba _replaces_ the Python code with LLVM Intermediate Representation, then asks the LLVM toolchain to compile it. The result is essentially the same as compiled C code, except that it interfaces with _some_ Python data types, such as NumPy arrays.

<br>

In [None]:
print(quadratic_formula_numba.inspect_asm(quadratic_formula_numba.signatures[0]))

JAX is a JIT-compiler for Python that uses array syntax, rather than `for` loops

<br>

In [None]:
import jax
jax.config.update("jax_platform_name", "cpu")
jax.config.update("jax_enable_x64", True)

<br>

In [None]:
@jax.jit
def quadratic_formula_jax(a, b, c):
    return (-b + jax.numpy.sqrt(b**2 - 4*a*c)) / (2*a)

quadratic_formula_jax(a, b, c)

<br>

In [None]:
%%timeit

quadratic_formula_jax(a, b, c).block_until_ready()

I showed both Numba and JAX because of their different programming styles.

<br>

<img src="img/slow-fast-imperative-vectorized.svg" style="width: 60%; margin: 1em auto;">

<img src="img/mandelbrot.png" style="width: 400px; float: right;">

For a deep comparison of Python accelerators, see [mandelbrot-on-all-accelerators.ipynb](https://drive.google.com/file/d/1J0l5e0NZm5kEm5BEUDG4neN5EN0VVCnt/view?usp=sharing) on Google Colab.

<br clear="all">

The bottom line is:

<img src="img/plot-mandelbrot-on-all-accelerators.svg"  style="width: 80%; margin: 1em auto;">

## Special topics: the Python garbage collector

<br>

Different languages take different approaches to dynamic memory management

* C doesn't do it at all

* Swift uses reference counts and no garbage collector

* Julia, Go, JVM languages, and Lua use a garbage collector and no reference counts

* Python (CPython) uses both

In [94]:
import sys

<br>

In [101]:
x = object()
sys.getrefcount(x)

2

<br>

In [102]:
y = x
sys.getrefcount(x)

3

<br>

In [103]:
z = [x, x, x, x, x]
sys.getrefcount(x)

8

<br>

In [104]:
del x, z
sys.getrefcount(y)

2

In [105]:
class HasDestructor:
    def __del__(self):
        print("Goodbye, world")

<br>

In [106]:
x = HasDestructor()
del x

Goodbye, world


<br>

In [107]:
y = HasDestructor()
y.self = y
del y

<br>

In [108]:
import gc

<br>

In [109]:
gc.collect()

Goodbye, world


1086858

<img src="img/mark-and-sweep-1.png" style="width: 80%; margin: 3em auto;">

<img src="img/mark-and-sweep-2.png" style="width: 80%; margin: 3em auto;">

<img src="img/mark-and-sweep-3.png" style="width: 80%; margin: 3em auto;">

<img src="img/mark-and-sweep-4.png" style="width: 80%; margin: 3em auto;">

<img src="img/mark-and-sweep-5.png" style="width: 80%; margin: 3em auto;">

<img src="img/mark-and-sweep-6.png" style="width: 80%; margin: 3em auto;">

<img src="img/mark-and-sweep-7.png" style="width: 80%; margin: 3em auto;">

Experiments on garbage collectors:

```python
shuffleA = [7, 6, 4, 10, 0, 15, 9, 8, 13, 5, 12, 14, 3, 11, 2, 1]
shuffleB = [3, 8, 0, 15, 11, 2, 6, 7, 12, 9, 1, 14, 5, 13, 4, 10]
shuffleC = [2, 13, 6, 7, 4, 5, 10, 3, 12, 15, 8, 9, 14, 1, 0, 11]
shuffleD = [7, 5, 9, 15, 4, 2, 13, 12, 0, 8, 11, 6, 3, 1, 10, 14]
shuffleE = [14, 11, 10, 8, 0, 6, 5, 1, 13, 9, 7, 4, 2, 12, 3, 15]
array = np.empty(16**5, dtype=object)
for iA in shuffleA:
  for iB in shuffleB:
    for iC in shuffleC:
      for iD in shuffleD:
        for iE in shuffleE:
          array[(((iA*16 + iB)*16 + iC)*16 + iD)*16 + iE] = []
  for iB in shuffleB:
    for iC in shuffleC:
      for iD in shuffleD:
        for iE in shuffleE:
          array[(((iA*16 + iB)*16 + iC)*16 + iD)*16 + iE] = []
```

Adding or removing the second set of `for` loops varies the lifespan of data.

<img src="img/gcmeasure-java21.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcmeasure-python.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcmeasure-python-zoom.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcmeasure-pypy.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcmeasure-pypy-zoom.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcmeasure-julia.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcmeasure-julia-zoom.svg" style="width: 100%; margin: 3em auto;">

Surprising fact about Python's garbage collector:

* it runs after a fixed number of allocations
* _not_ when available memory is low!

<br>

In [110]:
gc.get_threshold()

(700, 10, 10)

<br>

First generation is mark-and-sweeped after 700 allocations; second and third generations after 10.

Experiment: put Python in a limited-memory box

```bash
systemd-run --user --scope -p MemoryMax=1000M -p MemorySwapMax=0M python
```

<img src="img/gcbox-python.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcbox-python-10x.svg" style="width: 100%; margin: 3em auto;">

<img src="img/gcbox-julia.svg" style="width: 100%; margin: 3em auto;">

## Special topics: the Python GIL (Global Interpreter Lock)

In [136]:
!python3.13 parallel-task.py 1

1 worker finished in 5.551621675491333 seconds


In [None]:
!python3.13 parallel-task.py 2

In [None]:
!python3.13 parallel-task.py 4