# Parallelisation

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/swd6_hpp/blob/main/docs/06_parallelisation.ipynb)

In [1]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !pip install dask[dataframe] joblib ray

## What is it?

Parallelisation divides a large problem into many smaller ones and solves them *simultaneously*.
- *Divides up the time/space complexity across workers.*
- Tasks centrally managed by a scheduler.
- Multi-processing (cores)
    - Useful for compute-bound problems.
    - Overcomes the [Global Interpreter Lock, GIL](https://wiki.python.org/moin/GlobalInterpreterLock) (prevents running the bytecode on mutliple threads simutaneously).  
    - Lower performance when need to exchange/aggregate data
- Multi-threading (parts of processes)
    - Useful for memory-bound problems.  
    
Parallelised code often introduces overheads. So, the speed-up benefits are more pronounced with bigger jobs, rather than some of the small examples used in this tutorial.

## Parallelising a Python?

Python itself is not designed for massive scalability and controls threads preemptively using the GIL. This has lead many libraries to work around this using C/C++ backends.  

Some options include:  

[multiprocessing](https://docs.python.org/3/library/multiprocessing.html) for creating a pool of asynchronous workers. 

In [2]:
from multiprocessing import Pool

def my_function(x):
    return x * x

with Pool(3) as workers:
    print(workers.map(my_function, [1, 2, 3]))

[1, 4, 9]


[joblib](https://joblib.readthedocs.io/en/latest/) for creating lightweight pipelines that help with "embaressingly parallel" tasks.

In [3]:
import joblib
import math

In [4]:
joblib.Parallel(n_jobs=1)(
    joblib.delayed(math.sqrt)(i**2) for i in range(10)
)

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

[asyncio](https://docs.python.org/3/library/asyncio.html) for concurrent programs, especially ones that are IO-bound.  

```python
import asyncio

async def main():
    print('Hello ...')
    await asyncio.sleep(1)
    print('... World!')
    
asyncio.run(main())
```

These options work well for the CPU cores on your machine, though not really beyond that.  

## [Vectorisation](https://jakevdp.github.io/PythonDataScienceHandbook/02.03-computation-on-arrays-ufuncs.html)
[Vectors](https://en.wikipedia.org/wiki/Automatic_vectorization) effectively "parallelise" the code by operating on multiple array elements at once, rather than looping through them one at a time.  
Under the hood for NumPy arrays, functions, and aggregations (e.g., mean, sum).

In [5]:
import numpy as np

In [6]:
nums = np.arange(1_000_000)

In [7]:
%%timeit
[num + 2 for num in nums]

200 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%%timeit
nums + 2 # adds 2 to every elements by overloading the +

580 µs ± 3.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [9]:
%%timeit
np.add(nums, 2)

583 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


[Broadcasting](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html) (operations with different shaped arrays, [NumPy](https://numpy.org/doc/stable/user/basics.broadcasting.html), [xarray](https://xarray.pydata.org/en/v0.16.2/computation.html?highlight=Broadcasting#broadcasting-by-dimension-name)).

![broadcasting.png](images/broadcasting.png)  

*[Image source](https://mathematica.stackexchange.com/questions/99171/how-to-implement-the-general-array-broadcasting-method-from-numpy)*

In [10]:
nums_col = np.array([0, 10, 20, 30]).reshape(4, 1)
nums_row = np.array([0, 1, 2])

nums_col + nums_row

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])

In [11]:
import xarray as xr

In [12]:
nums_col = xr.DataArray([0, 10, 20, 30], [('col', [0, 10, 20, 30])])
nums_row = xr.DataArray([0, 1, 2], [('row', [0, 1, 2])])

nums_col + nums_row

NumPy [ufuncs](https://numpy.org/doc/stable/reference/ufuncs.html) (universal functions).
- *Optimised in C (statically typed and compiled).*
- Arbitrary Python function to NumPy ufunc:
    - [`np.frompyfunc`](https://numpy.org/doc/stable/reference/generated/numpy.frompyfunc.html). 
    - [`np.vectorize`](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) (keeps the documentation).  

In [13]:
random_array = np.random.rand(5, 5)
random_array

array([[0.78046373, 0.02082283, 0.66327924, 0.96713496, 0.04232469],
       [0.10403072, 0.92562836, 0.59985643, 0.21840083, 0.31095524],
       [0.11727009, 0.52451126, 0.61290502, 0.098795  , 0.62898025],
       [0.83474084, 0.15081632, 0.73288648, 0.4118689 , 0.52985103],
       [0.02298189, 0.89237702, 0.3305164 , 0.37169666, 0.69242908]])

In [14]:
def my_function(array, threshold):
    """Compare an array to a threshold."""
    if array > threshold:
        return round(array - threshold, 1)
    else:
        return round(array + threshold, 1)

In [15]:
frompyfunc_function = np.frompyfunc(
    my_function, 
    2, # number of input arguments 
    1) # number of returned objects
frompyfunc_function(random_array, 0.5)

array([[0.3, 0.5, 0.2, 0.5, 0.5],
       [0.6, 0.4, 0.1, 0.7, 0.8],
       [0.6, 0.0, 0.1, 0.6, 0.1],
       [0.3, 0.7, 0.2, 0.9, 0.0],
       [0.5, 0.4, 0.8, 0.9, 0.2]], dtype=object)

In [16]:
frompyfunc_function.__doc__

"my_function (vectorized)(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])\n\ndynamic ufunc based on a python function"

In [17]:
vectorized_function = np.vectorize(my_function)
vectorized_function(random_array, 0.5)

array([[0.3, 0.5, 0.2, 0.5, 0.5],
       [0.6, 0.4, 0.1, 0.7, 0.8],
       [0.6, 0. , 0.1, 0.6, 0.1],
       [0.3, 0.7, 0.2, 0.9, 0. ],
       [0.5, 0.4, 0.8, 0.9, 0.2]])

In [18]:
vectorized_function.__doc__

'Compare an array to a threshold.'

## [Dask](https://docs.dask.org/en/latest/)

- Great features.
- Helpful documentation.
- Familiar API.
- Under the hood for many libraries e.g. [xarray](http://xarray.pydata.org/en/stable/dask.html), [iris](https://scitools.org.uk/iris/docs/v2.4.0/userguide/real_and_lazy_data.html), [scikit-learn](https://ml.dask.org/).

### [Single machine](https://docs.dask.org/en/latest/setup/single-distributed.html)

See the excellent video from Dask creator, Matthew Rocklin, below.

In [19]:
from IPython.display import IFrame
IFrame(src='https://www.youtube.com/embed/ods97a5Pzw0', width='560', height='315')

In [20]:
if not IN_COLAB:
    from dask.distributed import Client
    client = Client()
    client 

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40179 instead


If want multiple threads, then could use keyword arguments in Client instance:
```python
client = Client(processes=False, threads_per_worker=4, n_workers=1)
```

Remember (important), always need to close down the client at the end:
```python
client.close()
```

### Dask behind the scenes

In [21]:
import xarray as xr

In [22]:
ds = xr.tutorial.open_dataset(
    'air_temperature',
    chunks={'time': 'auto'} # dask chunks
)

In [23]:
ds_mean = ds.mean()
ds_mean # a dask.array (an unexecuted task graph)

Unnamed: 0,Array,Chunk
Bytes,4 B,4.0 B
Shape,(),()
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray
Array Chunk Bytes 4 B 4.0 B Shape () () Count 4 Tasks 1 Chunks Type float32 numpy.ndarray,,

Unnamed: 0,Array,Chunk
Bytes,4 B,4.0 B
Shape,(),()
Count,4 Tasks,1 Chunks
Type,float32,numpy.ndarray


In [24]:
ds_mean.compute()

In [25]:
ds.close()

### [dask.array](https://examples.dask.org/array.html) (NumPy)
See the excellent video from Dask creator, Matthew Rocklin, below.

In [26]:
from IPython.display import IFrame
IFrame(src='https://www.youtube.com/embed/ZrP-QTxwwnU', width='560', height='315')

In [27]:
import dask.array as da

In [28]:
my_array = da.random.random(
    (5_000, 5_000),
    chunks=(500, 500) # dask chunks
)
result = my_array + my_array.T
result

Unnamed: 0,Array,Chunk
Bytes,190.73 MiB,1.91 MiB
Shape,"(5000, 5000)","(500, 500)"
Count,300 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 190.73 MiB 1.91 MiB Shape (5000, 5000) (500, 500) Count 300 Tasks 100 Chunks Type float64 numpy.ndarray",5000  5000,

Unnamed: 0,Array,Chunk
Bytes,190.73 MiB,1.91 MiB
Shape,"(5000, 5000)","(500, 500)"
Count,300 Tasks,100 Chunks
Type,float64,numpy.ndarray


In [29]:
if not IN_COLAB:
    result.compute()

### [dask.dataframe](https://examples.dask.org/dataframe.html) (Pandas)
See the excellent video from Dask creator, Matthew Rocklin, below.

In [30]:
IFrame(src='https://www.youtube.com/embed/6qwlDc959b0', width='560', height='315')

In [31]:
import dask

In [32]:
df = dask.datasets.timeseries()
df

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


In [33]:
type(df)

dask.dataframe.core.DataFrame

In [34]:
result = df.groupby('name').x.std()
result

Dask Series Structure:
npartitions=1
    float64
        ...
Name: x, dtype: float64
Dask Name: sqrt, 67 tasks

In [35]:
# result.visualize()

In [36]:
result_computed = result.compute()

In [37]:
type(result_computed)

pandas.core.series.Series

### [dask.bag](https://examples.dask.org/bag.html)
Iterate over a bag of independent objects (embarrassingly parallel).

In [38]:
import numpy as np
import dask.bag as db

In [39]:
nums = np.random.randint(low=0, high=100, size=(5_000))
nums

array([ 2, 92,  6, ..., 74, 70, 73])

In [40]:
def function(nums):
    return chr(nums)

In [41]:
if not IN_COLAB:
    bag = db.from_sequence(nums)
    bag = bag.map(function)
    
    bag.visualize()
    
    result = bag.compute()
    
    client.close()

### [Dask on HPC](https://docs.dask.org/en/latest/setup/hpc.html)

- Create/edit the [`dask_on_hpc.py`](https://github.com/lukeconibear/swd6_hpp/blob/main/docs/dask_on_hpc.py) file.
- Submit to the queue using [`qsub dask_on_hpc.bash`](https://github.com/lukeconibear/swd6_hpp/blob/main/docs/dask_on_hpc.bash).

If need to share memory across chunks:  
- Use [shared memory](https://docs.dask.org/en/latest/shared.html) (commonly OpenMP, Open Multi-Processing).
- `-pe smp np` on ARC4

Otherwise:  
- Use [message passing interface, MPI](https://docs.dask.org/en/latest/setup/hpc.html?highlight=mpi#using-mpi) (commonly OpenMPI).
- `-pe ib np` on ARC4

## [Ray](https://www.ray.io/)
Ray will automatically detect the available GPUs and CPUs on the machine.
- Can also [specify required resources](https://docs.ray.io/en/latest/walkthrough.html#specifying-required-resources).  

First, initialise Ray.

In [42]:
import ray
ray.init()

{'node_ip_address': '129.11.87.102',
 'raylet_ip_address': '129.11.87.102',
 'redis_address': '129.11.87.102:27892',
 'object_store_address': '/tmp/ray/session_2021-12-20_17-32-09_242176_105537/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-12-20_17-32-09_242176_105537/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-12-20_17-32-09_242176_105537',
 'metrics_export_port': 59763,
 'node_id': '82d29a72a86915913fc10e583c4494dbc3ab17c1ece79b64f311d9e9'}

### Functions become Tasks
- Parallelise functions by adding `@ray.remote` decorator  
- Then instead of calling it normally, use the `.remote()` method  
- This yields a future object reference that you can retrieve with `ray.get(object)` 

In [43]:
@ray.remote
def f(x):
    return x * x

In [44]:
# asynchronously run a task
futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

[0, 1, 4, 9]


### Classes become Actors
- Parallelise classes the same way
- These actors maintain their internal state  

In [45]:
@ray.remote
class Counter(object):
    def __init__(self):
        self.value = 0
        
    def increment(self):
        self.value += 1
    
    def read(self):
        return self.value

In [46]:
# construct an actor instance using .remote()
counters = [Counter.remote() for i in range(4)]

In [47]:
# asynchronously run actor methods
[counter.increment.remote() for counter in counters]
futures = [counter.read.remote() for counter in counters]
print(ray.get(futures))

[1, 1, 1, 1]


Other key API methods:
- `ray.put()`
    - Put a value in the distributed object store.
    - `put_id = ray.put(my_object)`
- `ray.get()`
    - Get an object from the distributed object store, either placed there by `ray.put()` explicitly or by a task or actor method, blocking until object is available.
    - `thing = ray.get(put_id)`
- `ray.wait()`
    - Wait on a list of ids until one of the corresponding objects is available (e.g., the task completes). Return two lists, one with ids for the available objects and the other with ids for the still-running tasks or method calls.
    `finished, running = ray.wait([train_id, track_id])`

### Ray's [`multiprocessing`](https://docs.ray.io/en/latest/multiprocessing.html)
To scale beyond one machine and generally manage a pool of processes.  

Replace:
```python
from multiprocessing.pool import Pool
```

With:

```python
from ray.util.multiprocessing.pool import Pool
```


In [48]:
from ray.util.multiprocessing.pool import Pool

In [49]:
def my_function(x):
    return x * x

In [50]:
with Pool(2) as workers:
    print(workers.map(my_function, [1, 2, 3]))

[1, 4, 9]


### Ray's [`joblib`](https://docs.ray.io/en/latest/joblib.html)
The underpinnings of [scikit-learn](https://scikit-learn.org/stable/), which Ray can scale to a cluster.

Import and instantiate `register_ray`, which registers Ray as a `joblib` backend for `scikit-learn`:  
```python
import joblib
from ray.util.joblib import register_ray
register_ray()
```

Then run your original `scikit-learn` code within a Ray/`joblib` backend:
```python
with joblib.parallel_backend('ray'):
    # original scikit-learn code
```

For example, here's some parallel hyperparameter tuning:
```python
import joblib
from ray.util.joblib import register_ray
register_ray()

import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV

digits = load_digits()
param_space = {
    'C': np.logspace(-6, 6, 30),
    'gamma': np.logspace(-8, 8, 30),
    'tol': np.logspace(-4, -1, 30),
    'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = sklearn.model_selection.RandomizedSearchCV(
    model, param_space, cv=5, n_iter=300, verbose=10)

with joblib.parallel_backend('ray'):
    search.fit(digits.data, digits.target)
```

When finished, remember to shut down the Ray connection.

In [51]:
ray.shutdown()

Please see this [repository](https://github.com/lukeconibear/distributed_deep_learning) for examples of how to do distributed deep learning using Ray Train with TensorFlow, PyTorch, and Horovod.

## [Dask on Ray](https://docs.ray.io/en/latest/data/dask-on-ray.html)
Use Ray as a backend for Dask tasks.  
Dask dispatches tasks to Ray for scheduling and execution.

In [52]:
import ray
import dask
import dask.dataframe as dd 
import pandas as pd
import numpy as np
from ray.util.dask import ray_dask_get

In [53]:
dask.config.set(scheduler=ray_dask_get) 
ray.init()

{'node_ip_address': '129.11.87.102',
 'raylet_ip_address': '129.11.87.102',
 'redis_address': '129.11.87.102:57522',
 'object_store_address': '/tmp/ray/session_2021-12-20_17-32-21_571729_105537/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-12-20_17-32-21_571729_105537/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2021-12-20_17-32-21_571729_105537',
 'metrics_export_port': 62666,
 'node_id': 'ae2f6013df29314db76b2d234d8c1fca5206cc59e3b159c23e25b29f'}

In [54]:
df = pd.DataFrame(np.random.randint(0, 100, size=(2**10, 2**8)))
df = dd.from_pandas(df, npartitions=10)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,246,247,248,249,250,251,252,253,254,255
0,53,72,0,7,83,74,86,49,81,42,...,77,81,65,45,35,2,22,45,30,33
1,45,65,41,25,62,29,69,77,46,91,...,56,42,85,31,69,49,8,75,9,36
2,64,14,42,25,95,4,79,80,8,36,...,5,48,91,33,63,93,30,98,8,20
3,21,30,95,17,64,12,97,69,5,75,...,45,77,79,54,60,15,7,55,94,93
4,50,25,15,27,3,77,0,15,30,68,...,67,74,80,29,78,69,97,0,7,9
5,64,73,66,77,72,20,57,70,41,19,...,97,70,54,29,1,28,15,98,40,65
6,11,84,8,81,73,12,38,82,57,34,...,60,59,2,76,38,57,19,23,9,53
7,21,76,6,13,54,95,17,50,15,82,...,93,35,47,22,97,96,10,18,54,89
8,82,31,3,95,74,82,71,77,25,29,...,37,27,81,26,26,33,32,31,80,31
9,46,28,57,72,47,40,93,93,76,89,...,28,66,78,17,69,7,25,39,24,24


In [55]:
ray.shutdown()

## [Modin](https://modin.readthedocs.io/en/latest/)
Modin uses Ray or Dask to easily speed up your Pandas code.  
To use Modin, simply replace the import and use Pandas API as normal.

In [56]:
import os
os.environ['MODIN_ENGINE'] = 'ray'
# os.environ['MODIN_ENGINE'] = 'dask'

In [57]:
if not IN_COLAB:
    # import pandas as pd
    import modin.pandas as pd

    frame_data = np.random.randint(0, 100, size=(5_000, 1_000))
    df = pd.DataFrame(frame_data)
    df.head(10)


    import ray
    ray.init()





## [Mars](https://docs.pymars.org/en/latest/)
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries.  
Swap out the library import, use the same API, and add `.execute()`.

```python
import mars
mars.new_session()
```

### [Mars Tensor](https://docs.pymars.org/en/latest/getting_started/tensor.html) for NumPy

```python
# import numpy as np
# np.random.rand(10)

import mars.tensor as mt
mt.random.rand(10).execute()
```

### [Mars DataFrame](https://docs.pymars.org/en/latest/getting_started/dataframe.html) for Pandas

```python
# import pandas as pd
# df = pd.DataFrame(
#     np.random.rand(10),
#     columns=['random_numbers']
# )

import mars.dataframe as md
df = md.DataFrame(
    np.random.rand(10),
    columns=['random_numbers']
).execute()
```

And remember to stop the server when you're finished.

```python
mars.stop_server()
```

Mars can also use Ray as the backend ([instructions](https://docs.ray.io/en/latest/data/mars-on-ray.html)).

## Further information
- [Using IPython for parallel computing](https://ipyparallel.readthedocs.io/en/latest/)
- [Spark on Ray](https://docs.ray.io/en/latest/data/raydp.html): RayDP combines your Spark and Ray clusters, making it easy to do large scale data processing using the PySpark API and seemlessly use that data to train your models using TensorFlow and PyTorch.
- [Concurrency](https://youtu.be/18B1pznaU1o) can also run different tasks together, but work is not done at the same time.  
- [Asynchronous](https://youtu.be/iG6fr81xHKA) (multi-threading), useful for massive scaling, threads controlled explicitly.  