![IE](../img/ie.png)

# Sessions 9 & 10: Multi-process and asynchronous programming

### Juan Luis Cano Rodríguez <jcano@faculty.ie.edu> - Master in Business Analytics and Big Data (2019-04-24)

## First rule of performance analysis

> "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
>
> — Donald E. Knuth

> **premature** (adj.) pre·ma·ture : happening, arriving, existing, or performed before the proper, usual, or intended time
>
> Merriam-Webster dictionary

There are several techniques to make slow programs go faster. However, no matter what we do, the first step is to **analyze** _where does the slowdown come from_.

Among the most straight-forward tools to analyze performance in Python are cProfile and line_profiler, and they are very easy to use from the notebook. cProfile is part of the standard library, whereas line_profiler must be installed with pip:

In [None]:
#!pip install line_profiler  # Python <= 3.6
!pip install cython  # Required to install line_profiler from source
!pip install https://github.com/rkern/line_profiler/archive/master.zip  # Required in Python 3.7

Let's use a simple example: a function to estimate the value of $\pi$ using a Monte Carlo simulation:

![Monte Carlo pi](../img/monte_carlo_pi.png)

(Source: https://towardsdatascience.com/speed-up-jupyter-notebooks-20716cbe2025)

In [None]:
import numpy as np


def estimate_pi(num_sim=1e7):
    """Estimate pi with monte carlo simulation.
    
    Parameters
    ----------
    num_sim: int
        Number of simulations.

    """
    in_circle = 0
    ii = 0
    while ii < num_sim:
        prec_x = np.random.rand()
        prec_y = np.random.rand()

        # Let's use pow here instead of **
        # to see it in the cProfile report
        if pow(prec_x, 2) + pow(prec_y, 2) <= 1:
            in_circle += 1  # inside the circle

        ii += 1
        
    return 4 * in_circle / num_sim

Running it through `cProfile` will tell us which functions or methods are called more. For that, we have to write `%prun` in front of the executable code, adding `-s cumtime` to sort by cumulative time. Notice that this will only work in IPython and Jupyter notebook:

This opens a separate panel with information similar to this:

```
         4000004 function calls in 3.821 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.821    3.821 {built-in method builtins.exec}
        1    0.000    0.000    3.821    3.821 <string>:1(<module>)
        1    2.048    2.048    3.821    3.821 <ipython-input-87-d37e0e33fe4b>:4(estimate_pi)
  2000000    1.296    0.000    1.296    0.000 {method 'rand' of 'mtrand.RandomState' objects}
  2000000    0.478    0.000    0.478    0.000 {built-in method builtins.pow}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
```

Sometimes the information shown by `cProfile` gives a useful hint because it's a summary of the whole execution. However, other times it's better to profile the program line by line, and visually identifying the hotspots. For that, we have to load the `line_profiler` IPython extension and run the code with `%lprun`, adding `-f <my_function` for all the functions I want to display in the report.

This again opens a separate panel with information similar to this:

```
Timer unit: 1e-06 s

Total time: 8.75213 s
File: <ipython-input-2-d37e0e33fe4b>
Function: estimate_pi at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def estimate_pi(num_sim=1e7):
     5                                               """Estimate pi with monte carlo simulation.
     6                                               
     7                                               Parameters
     8                                               ----------
     9                                               num_sim: int
    10                                                   Number of simulations.
    11                                           
    12                                               """
    13         1          9.0      9.0      0.0      in_circle = 0
    14         1          6.0      6.0      0.0      ii = 0
    15   1000001    1074014.0      1.1     12.3      while ii < num_sim:
    16   1000000    1916723.0      1.9     21.9          prec_x = np.random.rand()
    17   1000000    1812770.0      1.8     20.7          prec_y = np.random.rand()
    18                                           
    19                                                   # Let's use pow here instead of **
    20                                                   # to see it in the cProfile report
    21   1000000    1947163.0      1.9     22.2          if pow(prec_x, 2) + pow(prec_y, 2) <= 1:
    22    785268     897605.0      1.1     10.3              in_circle += 1  # inside the circle
    23                                           
    24   1000000    1103842.0      1.1     12.6          ii += 1
    25                                                   
    26         1          3.0      3.0      0.0      return 4 * in_circle / num_sim
```

## Types of parallelism

In Python there are two parallelism models:

* **Multithreading**: There is one single process that can have several execution _threads_ at the same time. The (theoretical) advantage is that all these threads share the same memory and are lightweight.
* **Multiprocessing**: Several processes are launched at the same time. The disadvantage is that they have more overhead and it's more difficult to share data between them.

It seems clear that multithreading would be the obvious choice, but in Python it's not the case.

We can broadly classify computer programs in two groups:

* **CPU-bound**: The bottleneck is the CPU. The faster the CPU, the faster the program will complete. Basically any calculation the computer does with data that resides in RAM.
* **I/O-bound**: The bottleneck is the input/output, either from disk, from the network, or any other sources. The faster the I/O, the faster the program will complete. This includes downloading data from the Internet, reading files from disk, and so forth.

The Python canonical implementation (CPython, the one everybody uses) has a limitation called the Global Interpreter Lock (GIL), which means that **only one thread can use the CPU at a time**. Therefore, as a rule of thumb, **multithreading should not be used in Python for CPU-bound code**.

For the rest of this course, even though multithreading can be used in Python for I/O-bound code, we will skip it entirely, since it's also more complex to synchronize and can lead to subtle errors.

## Multiprocessing

The number of processes we run in parallel shouldn't be higher than the number of CPUs we have, and can be obtained with the `multiprocessing.cpu_count` function:

One way to leverage `multiprocessing` is to create a `Pool` with a pre-defind number of processes that will run in parallel in our computer, and that has convenience methods that allow us to call a function with a sequence of arguments in parallel:

<div class="alert alert-warning">Under certain circumstances, directly applying <code>multiprocessing.Pool</code> to some code in the notebook might not work, see <a href="https://medium.com/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac">this explanation</a>. In that case, it should be enough to move the function we want to parallelize to a <code>.py</code> module, or wrap the execution inside a <code>if __name__ == "__main__":</code> block.</div>

It's _almost_ four times faster! But why doesn't it take exactly 1 second? The reason is mostly **process overhead**. If the process is very fast, the time it taks to spawn a new process becomes the bottleneck. See what happens if we remove the `sleep`:

*It's actually 30 000 times slower!* That gives us an idea of what is the process overhead.

### Exercise

Create a Monte Carlo simulation that leverages `multiprocessing` to accelerate the estimation of $\pi$.

## Asynchronous programming

Asynchronous programming is different from parallelism, and is in fact a form of **concurrency**. The differences between parallelism and concurrency are subtle - for example, from [this Stack Overflow answer](https://stackoverflow.com/a/36604522/554319):

> An application can be concurrent – but not parallel, which means that it processes more than one task at the same time, but no two tasks are executing at same time instant.

> Concurrency is about _dealing_ with lots of things at once. Parallelism is about _doing_ lots of things at once. [Emphasis mine]

Before going any further, let's do a simple I/O-bound program and then we will accelerate it using `asyncio`:

In asynchronous programming there is an **event loop** that runs **coroutines**. Every time a coroutine is **_awaited_**, the event loop jumps to another coroutine. It's a way of saying "this will take some time, so start doing it and I will continue with other things".

The way to define coroutines is adding `async` before `def`, and the way to schedule them is using `asyncio.create_task`:

Notice how, after the last line of the cell, the kernel was not blocked anymore and all the coroutines started at once, so the total running time was sorter than the sum of running times. This is possible because Jupyter has an event loop already running:

However, the disadvantage is that our asynchronous code in Jupyter will behave very different outside it. In particular, you will notice that, the "hello world" example from the asyncio documentation does not work in Jupyter:

In [None]:
# https://docs.python.org/3.7/library/asyncio.html
import asyncio

async def main():
    print('Hello ...')
    await asyncio.sleep(1)
    print('... World!')

# Python 3.7+
asyncio.run(main())

Therefore, avoid developing async code in Jupyter unless you know exactly what you're doing. The snippet above works fine in a normal Python or IPython shell, and in a `.py` script. For more details, see [the technical explanation by IPython maintainer](https://blog.jupyter.org/ipython-7-0-async-repl-a35ce050f7f7).

## Big Data Science in Python with Dask