### Multiprocessing over Pandas Dataframe Columns

<br>

#### Objective:
Iterate over pandas dataframe and perform column-wise operations in separate processes that honour shared _write-on-copy_ (due to _fork()_) memory

#### Steps:
  * Monitor used and available memory
  * Allocate large dataframe
  * Do row counts in multiple processes on columns of that dataframe


In [1]:
import multiprocessing
from functools import partial
import numpy as np
import pandas as pd
import time
import psutil

System memory checkin

In [2]:
print('used: {}% free: {:.2f}GB'.format(psutil.virtual_memory().percent, float(psutil.virtual_memory().free)/1024**3))

used: 46.5% free: 7.66GB


Allocation of a reasonably large frame (for `cols=6` and `rows=10^8` this results in `4.47GB` of data)

In [3]:
cols = 6
columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
rows = 10**8
df = pd.DataFrame(np.random.randn(rows,cols), columns=columns[0:cols])
print('{0:.2f}GB'.format(float(df.memory_usage(index=True).sum())/1024**3))

4.47GB


In [4]:
# Another memory check-in
print('used: {}% free: {:.2f}GB'.format(psutil.virtual_memory().percent, float(psutil.virtual_memory().free)/1024**3))

used: 74.2% free: 3.46GB


Perform multiprocessing across each of the columns

In [6]:
from multiprocessing import Pool, cpu_count

def parallelise_cols(func, df, num_processes=None):
    if num_processes==None:
        num_processes = min(df.shape[1], cpu_count())

    with Pool(num_processes) as pool:
        results_list = pool.map(func, df.iteritems())

def action(data):
    print('name: {}; row.count: {}; used: {}% free: {:.2f}GB'.format(data[0], 
                                                                     len(data[1].index),
                                                                     psutil.virtual_memory().percent, 
                                                                     float(psutil.virtual_memory().free)/1024**3))
    time.sleep(2)
    print('done sleeping {}'.format(multiprocessing.current_process()))
    return None

In [7]:
parallelise_cols(action, df, num_processes=2)


name: a; row.count: 100000000; used: 85.5% free: 1.74GB
done sleeping <ForkProcess(ForkPoolWorker-1, started daemon)>
name: b; row.count: 100000000; used: 85.7% free: 1.71GB
done sleeping <ForkProcess(ForkPoolWorker-2, started daemon)>
name: c; row.count: 100000000; used: 83.2% free: 2.09GB
done sleeping <ForkProcess(ForkPoolWorker-1, started daemon)>
name: d; row.count: 100000000; used: 84.3% free: 1.92GB
done sleeping <ForkProcess(ForkPoolWorker-2, started daemon)>
name: e; row.count: 100000000; used: 90.7% free: 0.96GB
done sleeping <ForkProcess(ForkPoolWorker-1, started daemon)>
name: f; row.count: 100000000; used: 79.2% free: 2.69GB
done sleeping <ForkProcess(ForkPoolWorker-2, started daemon)>


In [8]:
parallelise_cols(action, df, num_processes=4)


name: a; row.count: 100000000; used: 86.3% free: 1.62GB
done sleeping <ForkProcess(ForkPoolWorker-3, started daemon)>
name: b; row.count: 100000000; used: 87.3% free: 1.46GB
done sleeping <ForkProcess(ForkPoolWorker-4, started daemon)>
name: c; row.count: 100000000; used: 86.3% free: 1.62GB
done sleeping <ForkProcess(ForkPoolWorker-5, started daemon)>
name: d; row.count: 100000000; used: 88.5% free: 1.28GB
done sleeping <ForkProcess(ForkPoolWorker-6, started daemon)>
name: e; row.count: 100000000; used: 90.7% free: 0.96GB
done sleeping <ForkProcess(ForkPoolWorker-3, started daemon)>
name: f; row.count: 100000000; used: 79.6% free: 2.64GB
done sleeping <ForkProcess(ForkPoolWorker-4, started daemon)>
