## Multiprocessing

As an alternative to using Python-level threading, where concurrency is managed by the Python interpreter, is using processes managed by the operating system.  There are advantages and disadvantages to processes vs threads.  Multiprocessing uses the `multiprocessing` module.

In Python there is a third approach as well, called *coroutines* that are managed by the standard library module `asyncio` or by one of several third party coordinating modules.  We look at those in a separate INE course; but the key idea is that they require explicitly writing "control release" code rather than allowing the Python interpreter or the operating system to preempt execution.

The advantage of multiprocessing is that it enables just as much actual parallelism as your operating system and hardware support.  If you have multiple cores, a different Python interpreter can run on each core.  Even if you have more processes than cores, the operating system will handle preemption and time-slicing.

The disadvantage of multiprocessing follows directly from the advantage.  Since each process runs a different interpreter, **nothing** is shared between them after process creation.  The only way data can be communicated is with explicit *interprocess communication* (IPC) or shared memory.  IPC means `Queue`s or `Pipe`s.  Shared memory can often be faster than IPC, but are more specialize and more difficult to work with.  These include `Value`s and `Array`s.

## A Word about Pipes

The main interface of `multiprocessing` is very close to the interface to `threading`.  Let us import some capabilities we work with in this lesson.

In [25]:
import os, sys
from time import time, sleep
from pprint import pprint
from multiprocessing import Process, Queue, Pipe, Pool
from queue import Empty
from itertools import product
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In the bulk of this lesson, we will use `Queue`s to communicate.  As we saw with `threading`, this is a structure that allows concurrent access, but with a very simplified `.put()` / `.get()` interface, not random access or other sophisticated access patterns.  Note that `multiprocessing.Queue` is a completely distinct object from `queue.Queue` and the two cannot be interchanged, despite sharing most interfaces.

Sometimes you do not want a single shared data structure, but rather a private communication between two particular processes.  When that is the case, you should use a `Pipe`.  `.send()` and `.recv()` deal with arbitrary pickled objects, while `.send_bytes()` and `.recv_bytes()` (or `.recv_bytes_into()`) avoid the pickling overhead, but are limited to raw bytes.

In [2]:
sender, recipient = Pipe()

def process_one(pipe):
    print(f"I am process_one; process id is {os.getpid()}.\n", flush=True, end='')
    pipe.send("Hello from process_one")
    sleep(0.01)
    print(f"Exiting {os.getpid()}\n", flush=True, end='')
    
def process_two(pipe):
    print(f"I am process_two ({os.getpid()}); waiting...\n", flush=True, end='')
    greeting = pipe.recv()
    print(f"Happy someone said '{greeting}' to me.\n", flush=True, end='')
    print(f"Exiting {os.getpid()}\n", flush=True, end='')

processes = [
    Process(target=process_one, args=(sender,)),
    Process(target=process_two, args=(recipient,))
]
for p in processes: 
    p.start()

I am process_one; process id is 3008.
I am process_two (3011); waiting...
Happy someone said 'Hello from process_one' to me.
Exiting 3008
Exiting 3011


## Producer/Consumer Parallelism

For the bulk of this lesson, let's revisit the Mandelbrot set generation from the last lesson.  It was moderately slow when done sequentially, and throwing threads at the *embarrasingly parallel* CPU-bound problem made it *slightly* worse.  Remember our `mandelbrot()` function that takes a complex coordinate, and returns the "orbit" at which the point "escapes* in the iterative calculation.

In [3]:
def mandelbrot(z0: complex, orbits: int = 255) -> int:
    z = z0
    for n in range(orbits):
        if abs(z) > 2.0:
            return n
        z = z * z + z0
    return orbits

# How many iterations for sample point to "escape"?
mandelbrot(0.0965-0.638j)

17

In order to recast this problem in a producer/consumer pattern, which you saw in the first lesson on `concurrent.futures`, we can break down the pieces.  First, we need a function that describes the collection of complex points to work with, and feeds them into a `multiprocessing.Queue` named `Q_todo`.  Batching the data is vastly more efficient.

In [4]:
def produce_points(Q_todo: Queue, pixdim: int, 
                   escape=255, x=0.1015, y=-0.633, size=0.01):
    # Generate the complex coords and queue them
    batch, npoints = [], pixdim**2
    for row, col in product(range(pixdim), range(pixdim)):
        real = x - (size/2) + (size * col/pixdim)
        imag = y - (size/2) + (size * row/pixdim)
        data = ((row, col), complex(real, imag))
        batch.append(data)
        if len(batch) % 1000 == 0:
            Q_todo.put_nowait(batch)
            batch = []

    print(f"Queued {npoints:,} points in process {os.getpid()}", flush=True)

The next thing we need is a function that will read complex `z0` points off the TODO queue, call the `mandlebrot()` function, and put the answer onto the RESULTS queue.  We batch these as well.

In [5]:
def process_points(Q_todo: Queue, Q_result: Queue, escape: int = 255) -> int:
    # Each item in Q_todo looks like: `((row, col), coord)`
    # After processing, push to Q_result like: `((row, col), orbit)`
    start = time()
    try:
        i = 0
        while batch := Q_todo.get(timeout=1):
            done = []
            for point in batch:
                (row, col), z0 = point
                done.append( ((row, col), mandelbrot(z0, escape)) )
            Q_result.put(done)
            i += len(batch)
    except Empty as err:
        duration = time() - start
        print(f"Processed {i:,} points in process {os.getpid()} "
              f"({duration:0.2f} seconds)\n", end='', flush=True)

The final step is not part of the computation per-se, but it allows us to utilize it.  We can take data off the RESULTS queue, and put it into the array `canvas`.

In [6]:
def fill_canvas(Q_result, canvas):
    try:
        i = 0
        while batch := Q_result.get(timeout=1):
            for result in batch:
                (row, col), orbit = result
                canvas[row, col] = orbit
            i += len(batch)
    except Empty as err:
        print(f"Filled canvas with {i:,} points", flush=True)

Let us run these functions purely sequentially.  Starting by producing the points.

In [10]:
%%time
Q_todo, Q_result = Queue(), Queue()
pixdim = 1600
canvas = np.zeros(shape=(pixdim, pixdim), dtype=np.uint8)

produce_points(Q_todo, pixdim)

Queued 2,560,000 points in process 2983
CPU times: user 2.19 s, sys: 188 ms, total: 2.38 s
Wall time: 2.37 s


In [11]:
Q_todo.qsize()

2560

Now we process those points that have been placed on the TODO queue.

In [None]:
%%time
process_points(Q_todo, Q_result)

Pull the array of escape orbits into the local array `canvas` from `Q_result`.

In [None]:
%%time
fill_canvas(Q_result, canvas)
canvas

In [None]:
# Visualize the canvas/array
fig, ax = plt.subplots(figsize=(8, 8))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
ax.imshow(canvas);

## Trying Processes

With the nicely modularized functions, we might try to run the actual computations in parallel across multiple processes.  First, let us check the status of the queues, and what process we are running inside now.

In [None]:
print("   TODO queue items:", Q_todo.qsize())
print("RESULTS queue items:", Q_result.qsize())
print(" Current process ID:", os.getpid())

Here is the actual work.  We launch separate OS processes for `produce_points()` and for multiple copies of `process_points()`.

In [None]:
%%time
processes = []

producer = Process(target=produce_points, args=(Q_todo, pixdim))
producer.start()
processes.append(producer)
sleep(0.5)  # Allow producer to start on filling queue

for _ in range(6):
    p = Process(target=process_points, args=(Q_todo, Q_result))
    p.start()
    processes.append(p)

while Q_todo.qsize() > 0:
    sleep(0.5)

We want to empty the queues to fill the canvas.  This *could* be done within a process, but we would need to orchestrate some short of shared memory to actually use it within this notebook.  Just reading queue locally is easier in this case.

In [None]:
%%time
fill_canvas(Q_result, canvas)
print(f"Queue sizes: {Q_todo.qsize()}, {Q_result.qsize()}")
canvas

## Process Pools

The `multiprocessing` module contains a very useful technique called a `Pool` that lets you describe operations to be performed in processes with a higher level abstraction.  This creates several processes within a context manager, and a few operations will simply take the "next available" process in which to perform that.

To demonstrate, let's create a small collection of points that were in our Mandelbrot rendering.  Processing 15 points takes a negligible time, but we could use this organization for larger collections.

In [14]:
points = [0.102125-0.63780625j, 0.098375-0.6378j, 0.104625-0.6378j,
          0.100875-0.63779375j, 0.097125-0.6377875j, 0.103375-0.6377875j, 
          0.099625-0.63778125j, 0.105875-0.63778125j, 0.102125-0.637775j,
          0.098375-0.63776875j, 0.104625-0.63776875j, 0.100875-0.6377625j,
          0.097125-0.63775625j, 0.103375-0.63775625j, 0.099625-0.63775j]

[mandelbrot(c) for c in points]

[49, 23, 47, 43, 18, 39, 59, 69, 60, 23, 47, 43, 18, 39, 57]

Using a `Pool`, we can define work to be performed, but the various processes will complete that work asynchronously.  There exists an `.apply()` method along with `.apply_async()`, but it is fairly worthless since it blocks for the process to become ready, defeating the purpose of multiprocessing.

In [26]:
with Pool(processes=8) as pool:
    results = [pool.apply_async(mandelbrot, (c,)) for c in points]
    pprint(results[:3])
    print("Ready?", [r.ready() for r in results[:3]])
    print("Sleeping..."); 
    sleep(0.1)
    print("Ready?", [r.ready() for r in results[:3]])
    print("Results:", [r.get() for r in results])

[<multiprocessing.pool.ApplyResult object at 0x79d525d40d60>,
 <multiprocessing.pool.ApplyResult object at 0x79d525d40eb0>,
 <multiprocessing.pool.ApplyResult object at 0x79d525d408b0>]
Ready? [False, False, False]
Sleeping...
Ready? [True, True, True]
Results: [49, 23, 47, 43, 18, 39, 59, 69, 60, 23, 47, 43, 18, 39, 57]


We might simplify this pattern of using a `Pool` even more by `.map()`ing data values to a single function.  Like `.apply()`, `.map()` blocks—but in this case it is not pointless since it will allocate the first N data items to the N processes launched before blocking.  As each result becomes available, that process is used to process the next data value.  You can also use `.map_async()` instead. Some slightly more exotic—but useful—variations like `.starmap()`, `.imap()`, `.imap_unordered()`, and `starmap_async()` are also available.

In [28]:
with Pool(processes=10) as pool:
    results = pool.map(mandelbrot, points)
    print("Results:", results)

Results: [49, 23, 47, 43, 18, 39, 59, 69, 60, 23, 47, 43, 18, 39, 57]


In [32]:
with Pool(processes=10) as pool:
    results = pool.map_async(mandelbrot, points, chunksize=3)
    print("Results:", results)
    print("Values (blocking):", results.get())

Results: <multiprocessing.pool.MapResult object at 0x79d525d683d0>
Values (blocking): [49, 23, 47, 43, 18, 39, 59, 69, 60, 23, 47, 43, 18, 39, 57]


## Summary

In this lesson we have only scratched the surface of `multiprocessing` module.  This lesson has hinted at or briefly mentioned most of what is in that module, but much of it was, of necessity, cursory. 

Working with multiple processes has different challenges than working with threads.  We mostly avoid the issues of race conditions and deadlocks, but not entirely since `multiprocessing.Lock` is also available, and you do need to use it sometimes.  But comparatively, queues are much slower in mutliprocessing than in multithreading, and process creation is much slower than thread creation.

The times when you really want multiprocessing are when you have substantial CPU-bound work that is parallelizable.  For the embarrassingly parallel cases, a producer/consumer model is a good choice.  For for complex structures, use of dedicated processes and pipes between them is often better.