Parallel Map on Files
------------------------

For each of a set of filenames, we parse JSON data contents, load that data into a Pandas DataFrame, and then output the result to another file with a nicer format, HDF5.

We find that parsing JSON is slow and so we parallelize the process using the [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) module to do this work in multiple processes.

### Objectives

*  Profile code to find bottleneck
*  Use `concurrent.futures` to `map` a function across many inputs in parallel


### Requirements

*  Pandas
*  concurrent.futures (standard in Python 3, `pip install futures` in Python 2)
*  snakeviz (for profile visualization, `pip install snakeviz`)


    pip install snakeviz
    pip install futures
    
### Extra Exercise

Try out alternative binary formats.  Perhaps try [feather](https://github.com/wesm/feather).

## Before we start

We need to get some data to work with.
We generate some fake stock data by adding a bunch of points between real stock data points. This will take a few minutes the first time we run it.  If you already ran `python prep.py` when going through the README then you can skip this step.

In [1]:
%run ../prep.py

Finished CSV: aet
Finished CSV: afl
Finished CSV: aig
Finished CSV: al
Finished CSV: amgn
Finished CSV: avy
Finished CSV: b
Finished CSV: bwa
Finished CSV: ge
Finished CSV: hal
Finished CSV: hp
Finished CSV: hpq
Finished CSV: ibm
Finished CSV: jbl
Finished CSV: jpm
Finished CSV: luv
Finished CSV: met
Finished CSV: pcg
Finished CSV: tgt
Finished CSV: usb
Finished CSV: xom
Finished JSON: hpq
Finished JSON: ibm
Finished JSON: avy
Finished JSON: usb
Finished JSON: afl
Finished JSON: jbl
Finished JSON: hal
Finished JSON: jpm
Finished JSON: bwa
Finished JSON: pcg
Finished JSON: aet
Finished JSON: xom
Finished JSON: tgt
Finished JSON: met
Finished JSON: hp
Finished JSON: ge
Finished JSON: amgn
Finished JSON: al
Finished JSON: luv
Finished JSON: aig
Finished JSON: b


## Sequential Execution

In [2]:
%load_ext snakeviz

In [3]:
from glob import glob
import json
import pandas as pd
import os

In [4]:
filenames = sorted(glob(os.path.join('..', 'data', 'json', '*.json')))  # ../data/json/*.json
filenames[:5]

['../data/json/aet.json',
 '../data/json/afl.json',
 '../data/json/aig.json',
 '../data/json/al.json',
 '../data/json/amgn.json']

In [5]:
%%snakeviz

for fn in filenames:
    print(fn)
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    
    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

../data/json/aet.json
../data/json/afl.json
../data/json/aig.json
../data/json/al.json
../data/json/amgn.json
../data/json/avy.json
../data/json/b.json
../data/json/bwa.json
../data/json/ge.json
../data/json/hal.json
../data/json/hp.json
../data/json/hpq.json
../data/json/ibm.json
../data/json/jbl.json
../data/json/jpm.json
../data/json/luv.json
../data/json/met.json
../data/json/pcg.json
../data/json/tgt.json
../data/json/usb.json
../data/json/xom.json
 
*** Profile stats marshalled to file '/tmp/tmp48h9ucti'. 


Parallel Execution
--------------------

We can process each file independently and in parallel.  To accomplish this we'll transform the body of our for loop into a function and then use the [concurrent.futures.ProcessPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#executor-objects) to apply that function across all of the filenames in parallel using multiple processes.

### Before

Whenever we have code like the following:

```python
results = []
for x in L:
    results.append(f(x))
```

or the following:

```python
results = [f(x) for x in L]
```

or the following:

```python
results = list(map(f, x))
```

### After

We can instead write it as the following:

```python
from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()

results = list(e.map(f, L))
```

### Example

In [6]:
%%time

### Sequential code

import time

results = []

for i in range(8):
    time.sleep(1)
    results.append(i + 1)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 8.01 s


In [None]:
#!pip install loky  # Windows users will need to install loky

In [7]:
%%time

### Parallel code

from concurrent.futures import ProcessPoolExecutor
# from loky import ProcessPoolExecutor  # for Windows users

e = ProcessPoolExecutor()

def slowinc(x):
    time.sleep(1)
    return x + 1

results = list(e.map(slowinc, range(8)))

CPU times: user 16 ms, sys: 24 ms, total: 40 ms
Wall time: 2.04 s


### Exercise:  Convert JSON data to HDF5 in parallel using `concurrent.futures.Executor.map`

In [None]:
import json

In [10]:
%%time

### Sequential code

for fn in filenames:
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    
    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

CPU times: user 12.5 s, sys: 372 ms, total: 12.9 s
Wall time: 12.8 s


In [15]:
%%snakeviz
# try replacing %%time with %%snakeviz when everything's working
# to get a profile

### Parallel code

def json_to_hdf5(fname):
    with open(fname) as f:
        data = [json.loads(line) for line in f]        
    df = pd.DataFrame(data)    
    out_filename = fname[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

e = ProcessPoolExecutor()
[_ for _ in e.map(json_to_hdf5, filenames)]


 
*** Profile stats marshalled to file '/tmp/tmptogunz2e'. 


In [16]:
%%time
# %load solutions/map-1.py
# Parallel code

from concurrent.futures import ProcessPoolExecutor
e = ProcessPoolExecutor()

def load_parse_store(fn):
    with open(fn) as f:
        data = [json.loads(line) for line in f]

    df = pd.DataFrame(data)

    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

list(e.map(load_parse_store, filenames))


CPU times: user 20 ms, sys: 36 ms, total: 56 ms
Wall time: 7.43 s


Try visualizing your parallel version with `%%snakeviz`. Where does it look like it's spending all its time?

Parallelism isn't everything
--------------------------------

We get a moderate increase in performance when using multiple processes.  However parallelism isn't the only way to accelerate this computation.  Recall that the bulk of the cost comes from the `json.loads` function.  A quick internet search on "fast json parsing in python" yields the [ujson](https://pypi.python.org/pypi/ujson) library as the top hit.

Knowing about and importing the optimized `ujson` library is just as effective as multi-core execution.

In [17]:
import ujson

In [18]:
%%time
filenames = sorted(glob(os.path.join('..', 'data', 'json', '*.json')))

for fn in filenames:
    with open(fn) as f:
        data = [ujson.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    
    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

CPU times: user 6.04 s, sys: 304 ms, total: 6.35 s
Wall time: 6.35 s


History: multiprocessing.Pool
--------------------------------

Previously people have done multi-processing computations with the `multiprocessing.Pool` object, which behaves more or less identically.

However, today most library designers are coordinating around the `concurrent.futures` interface, so it's wise to move over.

In [None]:
%%time 

from multiprocessing import Pool
p = Pool()

list(p.map(load_parse_store, filenames))

Conclusion
-----------

*  Used `snakeviz` to profile code
*  Used `concurrent.futures.ProcessPoolExecutor` for simple parallelism across many files
    *  Gained some speed boost (but not as much as expected)
    *  Lost ability to diagnose performance within parallel code
*  Describing each task as a function call helps use tools like map for parallelism
*  Saw that other options than parallelism exist to speed up code, including the `ujson` library.
*  Making your tasks fast is often at least as important as parallelizing your tasks.