# Distributing code on a single machine using `joblib`

`joblib` is a pure-python package that allows you to run `Python` code in parallel.

Scientific numerical experiments often require running the **same function**, **multiple times** using a different set of parameters each time (well known examples include cross-validation, or hyper-parameter selection).

### A simple example: executing multiple function calls in serial mode

In [None]:
import time


# Here is the function that I want to parallelize -- a more realistic
# example would be named ``fit_model`` for instance.
def my_function(i):
    
    # Simulate a long-running computation. A real-life code could be np.linalg.svd(...).
    time.sleep(i)
    
    return i


# The set of arguments I want to run my function on. Each item in this list could
# be the value of a regularization parameter
my_args = [0.5, 1, 1.5, 2, 2.5]

In [None]:
%%time

results = []
for arg in my_args:
    results.append(my_function(arg))
print(results)

The cell above took ~ 7.5 seconds to run -- it ran all of theese function calls one after the other,
which is suboptimal when a computer has multiple processing units (CPUs), which is pretty much the case for all modern computers.

### Executing multiple function calls in parallel using `joblib`

In [None]:
from joblib import Parallel, delayed

In [None]:
%%time
results = Parallel(n_jobs=2)(delayed(my_function)(arg) for arg in my_args)
print(results)

The **exact** same code runs now on ~ 4.5 seconds, which is ~ half the time it took in the serial situation!

Advanced note: in standard CPU architectures, code in parallel is done using either **threads** or **processes**
By default, `joblib` relies on **processes**-based parallelism, because **thread**-based parallelism has limitations in `Python`. You can specify which kind of parallelism you want `joblib` to use using the `backend` option of the `Parallel` constructor:

In [None]:
%%time
results = Parallel(n_jobs=2, backend='threading')(delayed(my_function)(arg) for arg in my_args)
print(results)

### Automatic caching using `joblib.Memory`

Among the main pain points of numerical experiments, there is one that stands out: having a long-running experiment erroring out during its execution. Usually, intermediate computations will be lost, which can mean days of computing and human work wasted.

`joblib` provides a way to cache a function:

In [None]:
from joblib import Memory
memory = Memory('joblib-cache-directory')

# caching functions defined inside notebooks is tricky. For maximum
# robustness, my_function should be moved to a python file when using
# joblib.Memory
from my_module import my_function
my_function_cached = memory.cache(my_function)

In [None]:
%%time

results = Parallel(n_jobs=2)(delayed(my_function_cached)(arg) for arg in my_args)
print(results)

In [None]:
%%time
results = Parallel(n_jobs=2)(delayed(my_function_cached)(arg) for arg in my_args)
print(results)

Notice the drop in execution time between the two cells above? In the first cell, my_function did not have
any cached results yet; executing those function calls took the same time as in the not-cached case.

But in the next cells, the exact same `Parallel` call took only 10ms: `joblib` did not re-execute all of these function calls, but only loaded the already-computed results!

**Important remarks**: each time the source codes of a `joblib.Memory`-cached function changes, the cached associated to this function is cleared.

#### From single-machine parallelism to multi-machine parallelism

- Usually, `joblib` can yield improvements proportional to the number of CPUs of the machine you are using. If your machine has 4 CPUs, you can expect the total running time of your function calls to be divided by up to 4 as compared to the serial case. Usually, laptops have from 2 up to 12 CPUs.

- `joblib` provides a way to quickly parallelize code on a single machine. But scientific institutions such as our often have access to a larger set of computational resources, such as a slurm cluster. Running code in various nodes of a slurm cluster in parallel can be done in various ways (submitting `batch` scripts for instance). But doing this is error prone, and not the areas of expertise of researchers. 


# Natively scaling python code using `dask`

- The canonical package in the `Python` data science ecosysteAm to run `Python` code on a cluster of machine is `dask`. As opposed to `slurm` commmand lines utility, dask scales your `Python` code **natively**: no need to get out of your `jupyter` notebook!
- `joblib` integrates with `dask`, making scaling from a single machine to a HPC cluster as seamless as possible: the only additional code you need to add is the specifications of the slurm nodes you want to use: