<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# multiprocessing & parallelization

## 1. introduction

* difficult to apply module (it's hard to see where to apply efectively)
* python is simple threaded and sequential by default

## 2. get advantage of multiple cpus

In [None]:
import multiprocessing

In [None]:
multiprocessing.cpu_count()

## 3. processing pool

Let's create a square function and a lot of random numbers...

In [3]:
import numpy as np

def square(x):
    return x * x

data = list(np.random.randint(0, 
                              high=10000, 
                              size=int(1e3)))

Our data is a list of 1000 random integers from 0 to 10000...

In [4]:
data[:10]

[4557, 9957, 5183, 877, 7103, 5864, 66, 7956, 5132, 5725]

Let's make a benchmark about the best way of processing this data... to perform a benchmark we will use `%%timeit` Jupyter magic. You can read more about this magic here: https://ipython.readthedocs.io/en/stable/interactive/magics.html

Processing it sequentially looks like this:

In [5]:
%%timeit

seq_square = list(map(square, data))

139 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Now, let's use multiprocessing module to square all these numbers in parallel...

In [6]:
%%timeit

pool = multiprocessing.Pool(processes=8)
result = pool.map(square, data)
pool.terminate()
pool.join()

141 ms ± 3.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


What is happening? -> Inter-Process Communication (https://stackoverflow.com/a/35862168)

Let's repeat the example adding a 0.1 seconds sleep...

In [7]:
import time

def square(x):
    time.sleep(0.01)
    return x * x

With our standard approach...

In [8]:
%%timeit

seq_square = list(map(square, data))

10.1 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In case of using multiprocessing, it runs much faster...

In [9]:
%%timeit

pool = multiprocessing.Pool(processes=8)
result = pool.map(square, data)
pool.terminate()
pool.join()

1.35 s ± 42.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


But the fastest way is using a vectorized implementation (NumPy)...

In [10]:
data = np.array(data)

In [11]:
%%timeit

np.square(data)

1.04 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


## 3. managing filesystem faster with multiprocessing

In [12]:
import os
import itertools

In [13]:
root_path = '/home/david/Nextcloud/repos/'

directories = [root_path + 
               path for path in os.listdir(root_path)]

directories

['/home/david/Nextcloud/repos/pragsis',
 '/home/david/Nextcloud/repos/freelance',
 '/home/david/Nextcloud/repos/infrastructure',
 '/home/david/Nextcloud/repos/learning',
 '/home/david/Nextcloud/repos/products',
 '/home/david/Nextcloud/repos/rrss']

Let's find Python files the lesson way:

In [14]:
def find_py_file(path):
    files_list = []
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith('.py'):
                files_list.append(os.path.join(path, root, file))
    return files_list

In [15]:
directories[1]

'/home/david/Nextcloud/repos/freelance'

Let's test our function...

In [16]:
find_py_file(directories[1])[:10]

['/home/david/Nextcloud/repos/freelance/teaching/ironhack/ironhack-data-analytics-gurus/materials/week_7/supervised_learning_more_models/pytorch_mnist.py',
 '/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/main_script.py',
 '/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_wrangling/__init__.py',
 '/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_wrangling/m_wrangling.py',
 '/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_acquisition/m_acquisition.py',
 '/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_acquisition/__init__.py',
 '/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/

In [17]:
%%timeit

py_files = map(find_py_file, directories)
py_files_flatten = list(itertools.chain.from_iterable(py_files))

80.9 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
%%timeit

pool = multiprocessing.Pool()
py_files = pool.map(find_py_file, directories)
pool.terminate()
pool.join()
py_files_flatten = list(itertools.chain.from_iterable(py_files))

141 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## 5. the best way to do it...

The Pythonist way to do this 😎 (maybe not the fastest...):

In [19]:
from pathlib import Path

In [20]:
my_directory = Path(root_path)

In [21]:
%%timeit
py_files_flatten = list(my_directory.glob('**/*.py'))

134 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [22]:
py_files_flatten = list(my_directory.glob('**/*.py'))
py_files_flatten[:10]

[PosixPath('/home/david/Nextcloud/repos/freelance/teaching/ironhack/ironhack-data-analytics-gurus/materials/week_7/supervised_learning_more_models/pytorch_mnist.py'),
 PosixPath('/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/main_script.py'),
 PosixPath('/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_wrangling/__init__.py'),
 PosixPath('/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_wrangling/m_wrangling.py'),
 PosixPath('/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_acquisition/m_acquisition.py'),
 PosixPath('/home/david/Nextcloud/repos/freelance/teaching/ironhack/dataptmad-0420-classes/week_8/data_pipelines/data_pipelines_project/p_acquisition/__init__.py'),
 PosixPath('/ho

In [23]:
py_files_flatten[0].stat()

os.stat_result(st_mode=33204, st_ino=4854457, st_dev=2053, st_nlink=1, st_uid=1000, st_gid=1000, st_size=4867, st_atime=1571736617, st_mtime=1571736617, st_ctime=1588267543)

## 5. other important concepts about multiprocessing

* spawn:

The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.

Available on Unix and Windows. The default on Windows and macOS.
    
* fork:

The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

Available on Unix only. The default on Unix.

* forkserver

When the program starts and selects the forkserver start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use os.fork(). No unnecessary resources are inherited.

Available on Unix platforms which support passing file descriptors over Unix pipes.

## 6. process and queues

In [24]:
from multiprocessing import Process, Queue

def f(q, number):
    q.put(number)

q = Queue()

for i in range(10):
    p = Process(target=f, args=(q, np.random.randint(128)))
    p.start()
    p.join()

In [25]:
q.get()

88

In [26]:
q.put(50)

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>