# Parallel Computing

In [1]:
import os
cpus = os.cpu_count()
cpus

4

## Test Functions

In [2]:
import time

In [3]:
def long_io(i, res_dict): # a function simulating IO, user input, etc. (no computation)
    print(f'function {i} started at {time.ctime()}')
    time.sleep(2)
    res = i*2
    res_dict[i] = res
    print(f'function {i} finished at {time.ctime()}')
    return res

In [4]:
res = {}
[long_io(i, res) for i in range(4)]
res

function 0 started at Tue Jul  2 14:10:03 2019
function 0 finished at Tue Jul  2 14:10:05 2019
function 1 started at Tue Jul  2 14:10:05 2019
function 1 finished at Tue Jul  2 14:10:07 2019
function 2 started at Tue Jul  2 14:10:07 2019
function 2 finished at Tue Jul  2 14:10:09 2019
function 3 started at Tue Jul  2 14:10:09 2019
function 3 finished at Tue Jul  2 14:10:11 2019


{0: 0, 1: 2, 2: 4, 3: 6}

In [5]:
def long_comp(i, res_dict): # a function involving heavy computation
    print(f'function {i} started at {time.ctime()}')
    x = i
    for y in range(5000000):
        if x % 2 == 0:
            x = (x-y)**2
        else:
            x = (x+y)**0.5
    res_dict[i] = x
    print(f'function {i} finished at {time.ctime()}')
    return x

In [6]:
res = {}
[long_comp(i, res) for i in range(4)]
res

function 0 started at Tue Jul  2 14:10:11 2019
function 0 finished at Tue Jul  2 14:10:14 2019
function 1 started at Tue Jul  2 14:10:14 2019
function 1 finished at Tue Jul  2 14:10:16 2019
function 2 started at Tue Jul  2 14:10:16 2019
function 2 finished at Tue Jul  2 14:10:18 2019
function 3 started at Tue Jul  2 14:10:18 2019
function 3 finished at Tue Jul  2 14:10:20 2019


{0: 2236.5678097446853,
 1: 2236.5678097446853,
 2: 2236.5678097446853,
 3: 2236.5678097446853}

Time intensive functions can be either CPU-limited (calculations) or limited by other factors (such as IO, network response times, user interactions). For both cases, test functions are defined.

## Multithreading

In [7]:
import threading

In [8]:
# non-computational limited function
threads = []
res = {}
for i in range(4):
    threads.append(threading.Thread(target=long_io, args=(i,res)))
    threads[i].start()
print(f'all threads started at {time.ctime()}')
for i in range(4):
    threads[i].join()
print(f'all threads finished at {time.ctime()}')
res

function 0 started at Tue Jul  2 14:10:21 2019
function 1 started at Tue Jul  2 14:10:21 2019
function 2 started at Tue Jul  2 14:10:21 2019
function 3 started at Tue Jul  2 14:10:21 2019
all threads started at Tue Jul  2 14:10:21 2019
function 0 finished at Tue Jul  2 14:10:23 2019
function 1 finished at Tue Jul  2 14:10:23 2019
function 3 finished at Tue Jul  2 14:10:23 2019function 2 finished at Tue Jul  2 14:10:23 2019

all threads finished at Tue Jul  2 14:10:23 2019


{0: 0, 1: 2, 3: 6, 2: 4}

Using Multithreading, executing all 4 functions take the same time as executing a single function.


Threads use the same memory space as the main process, thus one could use data structures like dictionaries to pass information from and to the threads.

In [9]:
# computational limited function
threads = []
res = {}
for i in range(4):
    threads.append(threading.Thread(target=long_comp, args=(i,res)))
    threads[i].start()
print(f'all threads started at {time.ctime()}')
for i in range(4):
    threads[i].join()
print(f'all threads finished at {time.ctime()}')
res

function 0 started at Tue Jul  2 14:10:23 2019
function 1 started at Tue Jul  2 14:10:23 2019
function 2 started at Tue Jul  2 14:10:23 2019
function 3 started at Tue Jul  2 14:10:23 2019
all threads started at Tue Jul  2 14:10:23 2019
function 0 finished at Tue Jul  2 14:10:29 2019
function 3 finished at Tue Jul  2 14:10:31 2019
function 2 finished at Tue Jul  2 14:10:32 2019
function 1 finished at Tue Jul  2 14:10:32 2019
all threads finished at Tue Jul  2 14:10:32 2019


{0: 2236.5678097446853,
 3: 2236.5678097446853,
 2: 2236.5678097446853,
 1: 2236.5678097446853}

For the computational intensive function, running 4 instances as thread takes nearly 4 times as long as running a single instance.
This is true even for multiple processors.

Reason: The Global Interpreter Lock (GIL) in CPython allows only one CPU access at a time for one process.

## Multiprocessing

In [10]:
import multiprocessing

In [11]:
# non-computational limited function
pool = multiprocessing.Pool(processes=cpus)
processes = []
res = {}
for i in range(4):
    processes.append(pool.apply_async(long_io, args=(i,res)))
print(f'all processes started at {time.ctime()}')
pool.close() # close pool so that it does not accept further submissions
pool.join() # wait until all processes are finished
print(f'all processes finished at {time.ctime()}')
res

function 0 started at Tue Jul  2 14:10:32 2019
function 1 started at Tue Jul  2 14:10:32 2019
function 2 started at Tue Jul  2 14:10:32 2019
function 3 started at Tue Jul  2 14:10:32 2019
all processes started at Tue Jul  2 14:10:32 2019
function 1 finished at Tue Jul  2 14:10:34 2019
function 0 finished at Tue Jul  2 14:10:34 2019
function 2 finished at Tue Jul  2 14:10:34 2019
function 3 finished at Tue Jul  2 14:10:34 2019
all processes finished at Tue Jul  2 14:10:34 2019


{}

Note that the dictionary passed as function parameter is not updated by the processes (in contrast to the threads shown above). This is because the spawned processes do not share memory with each other / the main process.

The return value of the functions can be obtained using:

In [12]:
[process.get() for process in processes]

[0, 2, 4, 6]

In [13]:
# computational limited function
pool = multiprocessing.Pool(processes=cpus)
processes = []
res = {}
for i in range(4):
    processes.append(pool.apply_async(long_comp, args=(i,res)))
print(f'all processes started at {time.ctime()}')
pool.close() # close pool so that it does not accept further submissions
pool.join() # wait until all processes are finished
print(f'all processes finished at {time.ctime()}')
res

function 3 started at Tue Jul  2 14:10:34 2019
function 2 started at Tue Jul  2 14:10:34 2019
function 0 started at Tue Jul  2 14:10:34 2019
function 1 started at Tue Jul  2 14:10:34 2019
all processes started at Tue Jul  2 14:10:34 2019
function 0 finished at Tue Jul  2 14:10:37 2019
function 1 finished at Tue Jul  2 14:10:37 2019
function 3 finished at Tue Jul  2 14:10:37 2019
function 2 finished at Tue Jul  2 14:10:37 2019
all processes finished at Tue Jul  2 14:10:37 2019


{}

In [14]:
{i: process.get() for i, process in enumerate(processes)}

{0: 2236.5678097446853,
 1: 2236.5678097446853,
 2: 2236.5678097446853,
 3: 2236.5678097446853}

Using Multiprocessing, the instances of the computational intensive function were exectuted in parallel (here on 4 cores), resulting in the same calculation time as for a single instance.


Different processes do not share memory with each other and the main process (in contrast to threads). This may sound like a disadvantage compared to threads, but is actually in most cases an advantage:

* The GIL is avoided using Multiprocessing (the reason for GIL is to avoid memory conflicts, which could not happen here), allowing parallelization of computational-intensive functions on multiple CPU cores.
* Pure functions, where all input is given as function arguments and all output is in the return value, work fine with multiprocessing.
* Side-effects due to global variables or mutable data types are avoided. The code is enforced to be cleaner and more modular.

## Concurrent.Futures

High-level API for both Multithreading and Multiprocessing, introduced in Python 3.2.
Is is basically an abstraction over the threading and multiprocessing modules.

In [15]:
import concurrent.futures

### Multithreading

In [16]:
futures = []
res = {}
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    for i in range(4):
        futures.append(executor.submit(long_comp, i, res_dict=res))
futures, res

function 0 started at Tue Jul  2 14:10:37 2019
function 1 started at Tue Jul  2 14:10:37 2019
function 2 started at Tue Jul  2 14:10:37 2019
function 3 started at Tue Jul  2 14:10:37 2019
function 0 finished at Tue Jul  2 14:10:43 2019
function 2 finished at Tue Jul  2 14:10:46 2019
function 1 finished at Tue Jul  2 14:10:46 2019
function 3 finished at Tue Jul  2 14:10:46 2019


([<Future at 0x7f79b4069748 state=finished returned float>,
  <Future at 0x7f79b405f128 state=finished returned float>,
  <Future at 0x7f79b4069630 state=finished returned float>,
  <Future at 0x7f79b4069ef0 state=finished returned float>],
 {0: 2236.5678097446853,
  2: 2236.5678097446853,
  1: 2236.5678097446853,
  3: 2236.5678097446853})

In [17]:
{i: future.result() for i, future in enumerate(futures)}

{0: 2236.5678097446853,
 1: 2236.5678097446853,
 2: 2236.5678097446853,
 3: 2236.5678097446853}

### Multiprocessing

In [18]:
futures = []
res = {}
with concurrent.futures.ProcessPoolExecutor() as executor: 
    # when no max_workers are defined, the number of CPU cores is used
    for i in range(4):
        futures.append(executor.submit(long_comp, i, res_dict=res))
futures, res

function 0 started at Tue Jul  2 14:10:46 2019
function 1 started at Tue Jul  2 14:10:46 2019
function 2 started at Tue Jul  2 14:10:46 2019
function 3 started at Tue Jul  2 14:10:46 2019
function 0 finished at Tue Jul  2 14:10:49 2019
function 1 finished at Tue Jul  2 14:10:49 2019
function 3 finished at Tue Jul  2 14:10:49 2019
function 2 finished at Tue Jul  2 14:10:49 2019


([<Future at 0x7f79b40830f0 state=finished returned float>,
  <Future at 0x7f79b490a8d0 state=finished returned float>,
  <Future at 0x7f79b490a278 state=finished returned float>,
  <Future at 0x7f79b490a7b8 state=finished returned float>],
 {})

In [19]:
{i: future.result() for i, future in enumerate(futures)}

{0: 2236.5678097446853,
 1: 2236.5678097446853,
 2: 2236.5678097446853,
 3: 2236.5678097446853}

The syntax for multithreading and multiprocessing is identical. 
Note that only multitreading uses the same memory space as the main process. Pure functions work with both multitreading and multiprocessing.

## Dask

see [here](dask.ipynb)

## Conclusion

* For non-calculation bound processes, like IO, user interactions, network responses, use Threading because it is light-weight and creates less overhead.
* For calculation-bound processes, use Multiprocessing. In Python, there is no benefit using Multithreading in this case.

More information is given here:
https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b

It is recommended to use the *concurrent.futures* API in Python 3 both for threading and multiprocessing where possible. The functions to be parallized should be pure functions.

### Example for Pandas DataFrame

In [20]:
import numpy as np
import pandas as pd
import concurrent.futures
import os

#### Vectorized Operations

In [21]:
n_rows = 100000
df = pd.DataFrame({i: np.random.randn(n_rows) for i in ['a', 'b', 'c', 'd']})
df.head()

Unnamed: 0,a,b,c,d
0,1.1696,0.100671,0.435771,-0.866237
1,0.140947,-0.423965,-1.67793,1.186315
2,-0.967017,-2.106315,0.19316,-0.275644
3,-0.773805,1.496136,-0.092818,0.730723
4,-0.648026,0.996726,-0.470211,0.416909


In [22]:
def func_on_df(a, b, c, d):
    return (a-b)/(c+d)**2

In [23]:
df['res1'] = func_on_df(df.a, df.b, df.c, df.d) # "classical" single process
df.head()

Unnamed: 0,a,b,c,d,res1
0,1.1696,0.100671,0.435771,-0.866237,5.768629
1,0.140947,-0.423965,-1.67793,1.186315,2.337387
2,-0.967017,-2.106315,0.19316,-0.275644,167.454255
3,-0.773805,1.496136,-0.092818,0.730723,-5.578323
4,-0.648026,0.996726,-0.470211,0.416909,-578.918974


In [24]:
%timeit df['res1'] = func_on_df(df.a, df.b, df.c, df.d)

6.97 ms ± 46 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [25]:
df['res1b'] = df.apply(lambda x: func_on_df(x.a, x.b, x.c, x.d), axis=1)
df.head()

Unnamed: 0,a,b,c,d,res1,res1b
0,1.1696,0.100671,0.435771,-0.866237,5.768629,5.768629
1,0.140947,-0.423965,-1.67793,1.186315,2.337387,2.337387
2,-0.967017,-2.106315,0.19316,-0.275644,167.454255,167.454255
3,-0.773805,1.496136,-0.092818,0.730723,-5.578323,-5.578323
4,-0.648026,0.996726,-0.470211,0.416909,-578.918974,-578.918974


In [26]:
%timeit df['res1b'] = df.apply(lambda x: func_on_df(x.a, x.b, x.c, x.d), axis=1)

17.3 s ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The vectorized function working on numpy arrays is quite fast, although it uses only a single core. The row-wise apply function is over 2000 times slower.

What about multiprocessing?

In [27]:
cpus = os.cpu_count()
chunksize = int(n_rows / cpus)
chunksize

25000

In [28]:
with concurrent.futures.ProcessPoolExecutor(max_workers=cpus) as executor:
    df['res2'] = pd.Series(executor.map(
        func_on_df, df.a, df.b, df.c, df.d, chunksize=chunksize))
df.head()

Unnamed: 0,a,b,c,d,res1,res1b,res2
0,1.1696,0.100671,0.435771,-0.866237,5.768629,5.768629,5.768629
1,0.140947,-0.423965,-1.67793,1.186315,2.337387,2.337387,2.337387
2,-0.967017,-2.106315,0.19316,-0.275644,167.454255,167.454255,167.454255
3,-0.773805,1.496136,-0.092818,0.730723,-5.578323,-5.578323,-5.578323
4,-0.648026,0.996726,-0.470211,0.416909,-578.918974,-578.918974,-578.918974


In [29]:
%%timeit
with concurrent.futures.ProcessPoolExecutor(max_workers=cpus) as executor:
    df['res2'] = list(executor.map(
        func_on_df, df.a, df.b, df.c, df.d, chunksize=chunksize))

276 ms ± 7.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Executing vectorized operations on Pandas DataFrames in parallel on 4 cores is actually much slower than executing it natively single-core vectorized on a numpy array. However, it is much faster than row-wise apply.

#### Non-Vectorized Function

In [30]:
import hashlib

In [31]:
def calc_hash(*args):
    str_args = (str(arg) for arg in args)
    return hashlib.md5('_'.join(str_args).encode()).hexdigest()

In [32]:
calc_hash(42, 2.233, 23, 'hello')

'de8b962de5f8b353cdc9cc1fc89f0b65'

Applying the hash-function row-wise to the DataFrame is rather slow:

In [33]:
df['hash'] = df.apply(lambda x: calc_hash(x.a, x.b), axis=1)
df.head()

Unnamed: 0,a,b,c,d,res1,res1b,res2,hash
0,1.1696,0.100671,0.435771,-0.866237,5.768629,5.768629,5.768629,5c2ae2b9a14b690d77aef13a454dede7
1,0.140947,-0.423965,-1.67793,1.186315,2.337387,2.337387,2.337387,bdbc62e86cd028a16168e2d82dd8476e
2,-0.967017,-2.106315,0.19316,-0.275644,167.454255,167.454255,167.454255,66fec3541d4d2a356a5d0a124ad42a89
3,-0.773805,1.496136,-0.092818,0.730723,-5.578323,-5.578323,-5.578323,47396abdb6f71c2a4bbe4fb83f307dd8
4,-0.648026,0.996726,-0.470211,0.416909,-578.918974,-578.918974,-578.918974,d91da602d10bb8f42accd1562cc0cff2


In [34]:
%timeit df['hash'] = df.apply(lambda x: calc_hash(x.a, x.b), axis=1)

10.9 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [36]:
%timeit df['hash1b'] = calc_hash(df.a, df.b)

8.99 ms ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [37]:
with concurrent.futures.ProcessPoolExecutor(max_workers=cpus) as executor:
    df['hash2'] = pd.Series(executor.map(
        calc_hash, df.a, df.b, chunksize=int(n_rows/cpus)))
df.head()

Unnamed: 0,a,b,c,d,res1,res1b,res2,hash,hash1b,hash2
0,1.1696,0.100671,0.435771,-0.866237,5.768629,5.768629,5.768629,5c2ae2b9a14b690d77aef13a454dede7,1961a528354a86a84a214d674af8a4a1,5c2ae2b9a14b690d77aef13a454dede7
1,0.140947,-0.423965,-1.67793,1.186315,2.337387,2.337387,2.337387,bdbc62e86cd028a16168e2d82dd8476e,1961a528354a86a84a214d674af8a4a1,bdbc62e86cd028a16168e2d82dd8476e
2,-0.967017,-2.106315,0.19316,-0.275644,167.454255,167.454255,167.454255,66fec3541d4d2a356a5d0a124ad42a89,1961a528354a86a84a214d674af8a4a1,66fec3541d4d2a356a5d0a124ad42a89
3,-0.773805,1.496136,-0.092818,0.730723,-5.578323,-5.578323,-5.578323,47396abdb6f71c2a4bbe4fb83f307dd8,1961a528354a86a84a214d674af8a4a1,47396abdb6f71c2a4bbe4fb83f307dd8
4,-0.648026,0.996726,-0.470211,0.416909,-578.918974,-578.918974,-578.918974,d91da602d10bb8f42accd1562cc0cff2,1961a528354a86a84a214d674af8a4a1,d91da602d10bb8f42accd1562cc0cff2


In [39]:
all(df.hash == df.hash2)

True

In [38]:
%%timeit
with concurrent.futures.ProcessPoolExecutor(max_workers=cpus) as executor:
    df['hash2'] = pd.Series(executor.map(
        calc_hash, df.a, df.b, chunksize=int(n_rows/cpus)))

509 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In this case, there is a significant speed-up using multiprocessing.

Open: why a factor of 10 and not 4???