# Compare threading and multiprocessing

In [None]:
import time

import multiprocessing as mp
from multiprocessing import Pool as ProcessPool
from multiprocessing.pool import ThreadPool

import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.set_context('talk')

## Introduction

There are various choices when trying to run code in parallel.

The `threading` module will run all threads on the same CPU core which requires less overhead and allows for more efficient sharing of memory. However, it is not truly parallel and executes the threads when others are idling.

The `multiprocessing` module runs the processes on multiple CPU cores and can thus execute code at the same time.

## Preparations

To investigate the differences between threading and multiprocessing,
we will simulate work for each data point and measure when it was executed.

In [None]:
def worker(data):
    tmp = []
    for i in data:
        # simulate CPU load
        for _ in range(1_000_000):
            pass

        # store execution time
        tmp.append(time.time())
    return tmp 

In [None]:
data = list(range(20))
num = mp.cpu_count()

## Computations

We run the `worker` function on the dataset for each executor in both the process and thread pool.

In [None]:
%%time
with ThreadPool(num) as p:
    thread_result = p.map(worker, [data] * num)

In [None]:
%%time
with ProcessPool(num) as p:
    process_result = p.map(worker, [data] * num)

Next, we store the result in a dataframe.

In [None]:
df_thread = pd.melt(pd.DataFrame(thread_result, index=[f'job {i:2}' for i in range(num)]).T)
df_thread['type'] = 'thread'

df_process = pd.melt(pd.DataFrame(process_result, index=[f'job {i:2}' for i in range(num)]).T)
df_process['type'] = 'process'

df = pd.concat([df_thread, df_process], ignore_index=True)
df['value'] = df.groupby('type')['value'].apply(lambda x: x - x.min())  # normalize time
df.head()

## Investigation

There are two main observations:

* `multiprocessing` execution takes less total runtime than `threading`

* `multiprocessing` timestamps are in parallel, while `threading` timestamps are serial

In [None]:
g = sns.FacetGrid(df, row='type', aspect=2)
g.map_dataframe(sns.scatterplot, x='value', y='variable')
g.set_axis_labels('Time [s]', 'Pool')

The decision whether to use threading or multiprocessing depends on the use case.

As a general rule of thumb, one should use threading if the problem is IO bound and multiprocessing if it is CPU bound.