## Overview

This tutorial goes over the following:

* How to structure the code and understand the syntax to enable parallel processing using multiprocessing
* How to implement synchronous and asynchronous parallel processing
* How to parallize a Pandas DataFrame
* Solve 3 different usecases with the multiprocessing.Pool() interface

In [1]:
# Check how many parallel processes you can run on your system
import multiprocess as mp

print("Number of processors: ", mp.cpu_count())

Number of processors:  16


## Synchronous vs. Asynchronous

Synchronous completes processes in the same order they started. The main program is locked until the respective processes are finished.

Asynchronous does not lock the program but the results become a bit mixed. This process is the faster of the two.

There are two main objects in the multiprocessing library: Pool and Process

1. Synchronous execution
 * Pool.map() and Pool.starmap()
 * Pool.apply()
 
2. Asynchronous execution
 * Pool.map_async() and Pool.starmap_async
 * Pool.apply_async()
 
 
## Test with number counting problem

Given a 2d matrix, count how many numbers are present between a given range in each row. 

In [2]:
# Generate 2D matrix
import numpy as np
from time import time

np.random.RandomState(100)
arr = np.random.randint(0,10, size=[200000, 5])
data = arr.tolist()
data[:5]

[[7, 6, 9, 6, 3],
 [9, 9, 6, 3, 7],
 [7, 8, 0, 9, 8],
 [7, 3, 9, 9, 4],
 [7, 2, 4, 9, 5]]

### Solve without parallelization

Define a function to count how many number lie within range and returns the count

In [3]:
def howmany_within_range(row, minimum, maximum):
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

results = []

start = time()

for row in data:
    results.append(howmany_within_range(row, minimum = 4, maximum = 8))

end = time()
print(end - start, "seconds")
print(results[:10])

0.154587984085083 seconds
[3, 2, 3, 2, 3, 0, 4, 2, 3, 3]


## Parallelizing using Pool.apply()

We will parallelize the howmany_within_range() function using the multiprocessing.pool()

In [4]:
import multiprocess as mp

# Initiate multiprocessing.Pool() with the total number of processors we have
pool = mp.Pool(mp.cpu_count())

start = time()
print("start")
results = [pool.apply(howmany_within_range, args=(row, 4, 8)) for row in data]

# Must close the pool after use
pool.close()

end = time()

print(end - start, "seconds")
print(results[:10])

start
99.01706647872925 seconds
[3, 2, 3, 2, 3, 0, 4, 2, 3, 3]


In [None]:
# Parallelizing using Pool.map()
import multiprocess as mp

# Redefine, with only 1 mandatory argument.
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

pool = mp.Pool(mp.cpu_count())

results = pool.map(howmany_within_range_rowonly, [row for row in data])

pool.close()

print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]

Collecting multiprocess
  Downloading multiprocess-0.70.12.2-py37-none-any.whl (112 kB)
Collecting dill>=0.3.4
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
Installing collected packages: dill, multiprocess
Successfully installed dill-0.3.4 multiprocess-0.70.12.2
Note: you may need to restart the kernel to use updated packages.
