In [1]:
%%html

<style>
.jp-MarkdownOutput {
    font-size: 2.5em !important;
}
.jp-MarkdownOutput table {
    font-size: 1em !important;
}
.jp-OutputArea-output pre {
    font-size: 2em !important;
}
.cm-content {
    font-size: 2em !important;
}
.page-id-xx, html {
    scrollbar-width: none; /* FF */
}
::-webkit-scrollbar {
    width: 0px; /* Chrome & Edge */
}
.jp-Notebook.jp-mod-commandMode .jp-Cell.jp-mod-selected {
    background: none;
}
</style>

<img src="img/title.svg" style="width: 90%; border: 1px solid black; margin: 3em auto;">

## Outline

* Who wants high performance and why?
* Understanding why Python is slow
* Python escape hatches:
  * NumPy
  * Awkward Array
  * JIT-compilation
* Special topics:
  * JIT-compilation for GPUs
  * The Python garbage collector
  * The Python GIL (Global Interpreter Lock)
  * Memory-mapping
* No conclusions: we'll stop when we run out of time

## Who wants high performace and why?

* "I'm shepherding a computation that runs for months and 5% faster means 5% less electricity, operating costs, and $CO_2$."

* "I'm building an interactive app, and if each user-initiated action completes in less than human reaction time (100 ms), the app will feel snappy."

* "I'm analyzing a large dataset, and if the computation completes over a lunch break instead of overnight, I'll be able to run it more often to do more in-depth investigations."

* "I'm trying to do &lt;anything at all&gt;, and I can't because I run out of RAM, open file handles, TTYs, ..."

This is me:

* "I'm analyzing a large dataset, and if the computation completes over a lunch break instead of overnight, I'll be able to run it more often to do more in-depth investigations."
* "I'm trying to do &lt;anything at all&gt;, and I can't because I run out of RAM, open file handles, TTYs, ..."

<br><br><br>

Therefore, I usually only care about **speed** when it's an order of magnitude gain and **memory** when necessary.

<img src="img/clock-rate-timeline-1.svg" style="width: 90%; margin: 3em auto;">

<img src="img/clock-rate-timeline-2.svg" style="width: 90%; margin: 3em auto;">

<img src="img/clock-rate-timeline-3.svg" style="width: 90%; margin: 3em auto;">

<img src="img/clock-rate-timeline-4.svg" style="width: 90%; margin: 3em auto;">

There are now only three ways to speed up code:

1. parallel processing

2. fixing bloopers: what you thought it was doing is not what it's doing, or there's a faster method you just didn't know about

3. turning off dynamic features that you don't need

<img src="img/dynamic-features.svg" style="width: 60%; margin: 3em auto;">

"Dynamic feature":

* decisions are made in the loop that scales with dataset size

<br>

"that you don't need":

* information is known before starting that loop; it doesn't need to be re-derived

Dynamic features are often built into the language

| | dynamic allocation | reference count | garbage collector | runtime evaluation | virtual machine | type reflection | parallel scheduling |
|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| Fortran77 | | | | | | | |
| C | ✓ | | | | | | |
| C++ | ✓ | `shared_ptr<T>` | | | | vtable | stdlib |
| Rust | ✓ | `Rc<T>` | | | | vtable | ✓ |
| Swift | ✓ | ✓ | | | | vtable | ✓ |
| Julia | ✓ | | ✓ | ✓ | | ✓ | stdlib |
| Go | ✓ | | ✓ | | | vtable | ✓ |
| JVM | ✓ | | ✓ | | ✓ | ✓ | stdlib |
| Lua | ✓ | | ✓ | ✓ | ✓ | ✓ | |
| Python | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | stdlib |

<img src="img/benchmark-games-2023.svg" style="width: 80%; margin: 3em auto;">

## Understanding why Python is slow

How do dynamic features slow down a calculation?

<br>

In [2]:
import numpy as np

<br>

In [3]:
million_integers = np.random.normal(0, 10, 1_000_000).astype(np.int32)
million_integers

array([ 13,   0,  10, ..., -18,   0,  22], shape=(1000000,), dtype=int32)

<br>

In [4]:
def addthem_python(data):
    out = 0
    for x in data:
        out += x
    return out

In [5]:
%%writefile addthem.c

int run(int* data) {
    int out = 0;
    for (int i = 0;  i < 1000000;  i++) {
        out += data[i];
    }
    return out;
}

Writing addthem.c


<br>

In [6]:
!cc -O0 -shared addthem.c -o libaddthem_attempt1.so

<br>

In [7]:
import ctypes

<br>

In [8]:
libaddthem = ctypes.cdll.LoadLibrary("./libaddthem_attempt1.so")
pointer_to_ints = ctypes.POINTER(ctypes.c_int)
libaddthem.run.argtypes = (pointer_to_ints,)
libaddthem.run.restype = ctypes.c_int

In [9]:
addthem_python(million_integers)

np.int32(4328)

<br>

In [10]:
libaddthem.run(million_integers.ctypes.data_as(pointer_to_ints))

4328

<br>

In [11]:
%%timeit

addthem_python(million_integers)

52.2 ms ± 810 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


<br>

In [12]:
%%timeit

libaddthem.run(million_integers.ctypes.data_as(pointer_to_ints))

853 μs ± 7.89 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


To see what's slowing Python down, let's look at a toy language with similar dynamic features

<br>

In [13]:
!c++ -std=c++11 -O3 baby-python.cpp -o baby-python

<br>

In [14]:
%%bash

./baby-python <<EOF
123
add(3, 5)
square_them = def(x) mul(x, x)
map(square_them, [1, 2, 3, 4, 5])
EOF

                     num = -123        add(x, x)   get(lst, i)   map(f, lst)
               oo    lst = [1, 2, 3]   mul(x, x)   len(lst)      reduce(f, lst)
. . . __/\_/\_/`'    f = def(x) single-expr   f = def(x, y) { ... ; last-expr }

>> 123
(1.816e-06 seconds)
>> 8
(3.352e-06 seconds)
>> <user-defined function>
(1.467e-06 seconds)
>> [1, 4, 9, 16, 25]
(7.543e-06 seconds)
>> 


In [15]:
million_integers.tofile("million_integers.int32")

<br>

In [16]:
%%bash

./baby-python data=million_integers.int32 <<EOF
reduce(add, data)
reduce(add, data)
reduce(add, data)
EOF

                     num = -123        add(x, x)   get(lst, i)   map(f, lst)
               oo    lst = [1, 2, 3]   mul(x, x)   len(lst)      reduce(f, lst)
. . . __/\_/\_/`'    f = def(x) single-expr   f = def(x, y) { ... ; last-expr }

>> 4328
(0.0898215 seconds)
>> 4328
(0.0905186 seconds)
>> 4328
(0.0892013 seconds)
>> 


What baby-python is doing:

* variables are in an `unordered_map<string, shared_ptr<Object>>`
* `Object` is an abstract class; C++ has to maintain a vtable to keep track of which type each concrete instance is (runtime polymorphism)
* instructions are represented by a data structure that must be traversed (abstract syntax tree, or AST)
* parsing and data-loading happen before the loop over data and are not slowing it down

<br>

In [17]:
from pygments import highlight
from pygments.lexers import CppLexer
from pygments.formatters import HtmlFormatter
from IPython.display import display, HTML

<br>

In [18]:
with open("baby-python.cpp") as file:
    display(HTML(highlight(file.read(), CppLexer(), HtmlFormatter())))

Similarly in Python:

* variables are in a `dict[str, object]`

<br>

In [19]:
globals()

{'__name__': '__main__',
 '__doc__': 'Automatically created module for IPython interactive environment',
 '__package__': None,
 '__loader__': None,
 '__spec__': None,
 '__builtin__': <module 'builtins' (built-in)>,
 '__builtins__': <module 'builtins' (built-in)>,
 '_ih': ['',
  "get_ipython().run_cell_magic('html', '', '\\n<style>\\n.jp-MarkdownOutput {\\n    font-size: 2.5em !important;\\n}\\n.jp-MarkdownOutput table {\\n    font-size: 1em !important;\\n}\\n.jp-OutputArea-output pre {\\n    font-size: 2em !important;\\n}\\n.cm-content {\\n    font-size: 2em !important;\\n}\\n.page-id-xx, html {\\n    scrollbar-width: none; /* FF */\\n}\\n::-webkit-scrollbar {\\n    width: 0px; /* Chrome & Edge */\\n}\\n.jp-Notebook.jp-mod-commandMode .jp-Cell.jp-mod-selected {\\n    background: none;\\n}\\n</style>\\n')",
  'import numpy as np',
  'million_integers = np.random.normal(0, 10, 1_000_000).astype(np.int32)\nmillion_integers',
  'def addthem_python(data):\n    out = 0\n    for x in data:\n 

Similarly in Python:

* all data share an overloaded struct type (implemented in C, rather than using C++ to generate vtables automatically)

```c
struct PyObject {
    Py_size_t ob_refcnt;    // reference count for garbage collection
    PyObject* ob_type;      // the object's type (also a PyObject)
                            // more fields that depend on type
};
```

In [20]:
class PyObject(ctypes.Structure):
    pass

PyObject._fields_ = [
    ("ob_refcnt", ctypes.c_size_t),
    ("ob_type", ctypes.POINTER(PyObject)),
]

<br>

In [21]:
some_string = "This is a nice string."

<br>

In [22]:
c_some_string = PyObject.from_address(id(some_string))

<br>

In [23]:
c_some_string.ob_refcnt

1

<br>

In [24]:
list_of_string = [some_string] * 100

<br>

In [25]:
ctypes.cast(c_some_string.ob_type, ctypes.c_void_p).value == id(str)

True

Similarly in Python:

* instructions are represented by a data structure that must be traversed (an array of bytecodes)

<br>

In [26]:
import dis

<br>

In [27]:
dis.dis(addthem_python)
# lineno|offset|opcode_name      |argument|description
# ------+------+-----------------+--------+------------

  1           0 RESUME                   0

  2           2 LOAD_CONST               1 (0)
              4 STORE_FAST               1 (out)

  3           6 LOAD_FAST                0 (data)
              8 GET_ITER
        >>   10 FOR_ITER                 7 (to 28)
             14 STORE_FAST               2 (x)

  4          16 LOAD_FAST                1 (out)
             18 LOAD_FAST                2 (x)
             20 BINARY_OP               13 (+=)
             24 STORE_FAST               1 (out)
             26 JUMP_BACKWARD            9 (to 10)

  3     >>   28 END_FOR

  5          30 LOAD_FAST                1 (out)
             32 RETURN_VALUE


<br>

None of the above instructions specify data types. The code that adds (`BINARY_OP 13`) has to determine that `x` is an integer, over and over, for a million values.

By contrast, the instructions generated by a compiler are passed directly to the CPU, and therefore have to be different instructions for different data types.

<br>

In [28]:
!cc -O0 -c addthem.c

<br>

In [29]:
!objdump -d addthem.o


addthem.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <run>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	48 89 7d e8          	mov    %rdi,-0x18(%rbp)
   8:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%rbp)
   f:	c7 45 f8 00 00 00 00 	movl   $0x0,-0x8(%rbp)
  16:	eb 1d                	jmp    35 <run+0x35>
  18:	8b 45 f8             	mov    -0x8(%rbp),%eax
  1b:	48 98                	cltq
  1d:	48 8d 14 85 00 00 00 	lea    0x0(,%rax,4),%rdx
  24:	00 
  25:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
  29:	48 01 d0             	add    %rdx,%rax
  2c:	8b 00                	mov    (%rax),%eax
  2e:	01 45 fc             	add    %eax,-0x4(%rbp)
  31:	83 45 f8 01          	addl   $0x1,-0x8(%rbp)
  35:	81 7d f8 3f 42 0f 00 	cmpl   $0xf423f,-0x8(%rbp)
  3c:	7e da                	jle    18 <run+0x18>
  3e:	8b 45 fc             	mov    -0x4(%rbp),%eax
  41:	5d                   	pop    %rbp
  42:	c3                  

## Python escape hatches: NumPy

## Python escape hatches: Awkward Array

## Python escape hatches: JIT-compilation

## Special topics: JIT-compilation for GPUs

## Special topics: the Python garbage collector

## Special topics: the Python GIL (Global Interpreter Lock)

## Special topics: memory mapping