# Python Programming for Scientists - Day 5

In the fifth (and last!) day we will consider performance, i.e. "how to make my code faster"!

* `multiprocessing`
* `multithreading`
* `numba` - "just in time" compilation of python code
* `pybind11` - using external c/c++/fortran code

# [0] Parallel Programming

One way to make a given task execute faster is to use **parallel programming**, i.e. having more than one worker working on the task at once, reducing the overall time to completion.

There are two important concepts: **processes vs threads**:
* If you start a program on a computer, its running instance could be called a process.
* A given process can start one, or many, threads.
* Both processes and threads can be thought of as workers which execute a series of commands (e.g. a block of python code).

The main difference is:
* Threads (of the same process) run in a shared memory space, i.e. they can easily share variables and memory.
* Processes run in separate memory spaces: they cannot share variables or memory, unless they explicitly "communicate" (i.e. send and receive) information between themselves.

You can either use multiple processes, or multiple threads (or both!) to make a program parallel.

By definition all threads (of a process) run together on a single computer. On the other hand, multiple processes can run on the same computer (called "shared-memory parallel"), or they can actually be spread across different computers (called "distributed-memory parallel"), in which case communication must occur across a network.

We will focus just on the first case (one computer). If you want to write a program that uses more than one computer, the package to learn is [mpi4py](https://mpi4py.readthedocs.io/en/stable/).

# [1] Multiprocessing

A straightforward way to speed up a computation in Python is to have multiple Python processes working together. In this case we use the built-in `multiprocessing` library. This approach is good if:
* there is lots of computation to do, and not much data

In [None]:
import multiprocessing as mp

Imagine we have a function which is very expensive to run:

In [None]:
def f(x):
    return x*x

If we want to run it for a list of inputs values, and obtain the outputs, we can just do a loop:

In [None]:
for i in [1,2,3]:
    print(f(i))

The three calls to `f()` run **in serial** (one after another). We can instead run them **in parallel**:

In [None]:
with mp.Pool() as p:
    args = [1,2,3]
    print(p.map(f, args))

Note: We use the `with` syntax, just like opening a file, to let Python automatically clean up all the internal aspects of `Pool` when we are done with it.

The `Pool.map` function is a helper function which:
* (i) starts a number of independent "child" Python processes
* (ii) distributes ("maps") the set of arguments between these processes
* (iii) runs the function `f` with the argument(s) which each process is responsible for
* (iv) collects the results by sending them back to the "parent" process
* (v) shuts down all the child processes

Note: Initializing `mp.Pool()` will use all the available CPU cores on the current machine (1 process per core). Instead, `mp.Pool(4)` will start only 4 processes, so these will occupy only 4 cores.

> If you are sharing a system (non-exclusive), you will want to avoid using all the CPU cores.
>
> If you are on a system with only 1 CPU core, there is no point to multiprocessing.

Note: the `Pool.starmap` function is similar, except that each element of `args` is extended to also be a list, which is then unpacked and passed to `f` as a list of arguments.

> For example, an iterable of `args = [(1,2), (3, 4)]` results in `[func(1,2), func(3,4)]`.

This is a convenient helper which abstracts away much of the complexity. For instance, what if we want to run `f()` on 10 different arguments, but only have 4 cores (and so 4 processes)? This is automatically handled:

In [None]:
with mp.Pool(4) as p:
    args = np.arange(10)
    print(p.map(f, args))

One important note to keep in mind: data passed between the processes (the arguments, and the return) is pickled ("serialized", using the same `pickle` library we saw earlier to save/load an arbitrary Python object). This is **only ok for small data sizes**, but extremely slow/problematic for large data (e.g. > GB).

Is it actually faster? Let's make a benchmark.

In [None]:
def f(x):
    j = 0
    for i in range(10000000):
        j += x*x
    return j

In [None]:
import time

start_time = time.time()

args = [1,2,3]
for i in args:
    print(f(i))
    
print(f'Took {time.time() - start_time:.2f} seconds.')

In [None]:
start_time = time.time()

with mp.Pool() as p:
    print(p.map(f, args))
    
print(f'Took {time.time() - start_time:.2f} seconds.')

# [2] Multithreading