This example shows how to run the [proper noun counting example](/GetAllProperNouns.ipynb) in parallel using iPython Parallel.

To start, you'll want to install iPython Parallel,

`pip install ipyparallel`

and start a controller in the same folder as this script. This example creates 4 nodes:

`ipcluster start -n 4`

In [None]:
from htrc_features import FeatureReader
import glob
import pandas as pd

In [None]:
paths = glob.glob('../data/PZ-volumes/*.basic.json.bz2')

## Timing the normal, single-thread code

In [None]:
idx = pd.IndexSlice
def get_proper_nouns_normal(vol):
    tl = vol.tokenlist(pages=False)
    tl.index = tl.index.droplevel(0)
    tl['date'] = vol.year
    tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])
    try:
        proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
        proper_nouns.index = proper_nouns.index.droplevel(2)
        return proper_nouns[proper_nouns['count'] > 1]
    except:
        return pandas.DataFrame()

In [None]:
def test():
    fr = FeatureReader(paths)
    nnp_dfs = []
    for vol in fr.volumes():
        nnp_dfs.append(get_proper_nouns_normal(vol))
    all_nnp = pd.concat(nnp_dfs)
%timeit test()

1 loops, best of 3: 26.1 s per loop


## Timing the ipyparallel code (4 engines, one volume at a time)

In [None]:
from ipyparallel import Client, require
# Create client
c= Client()
# Create load Balanced view
lview = c.load_balanced_view()

            Controller appears to be listening on localhost, but not on this machine.
            If this is true, you should specify Client(...,sshserver='you@172.31.99.245')
            or instruct your controller to listen on an external IP.


The function, which takes a path, creates a Feature reader for just that path, gets the volume and processes it:

In [None]:
@lview.parallel()
@require('pandas', 'htrc_features')
def get_proper_nouns(path):
    idx = pandas.IndexSlice
    vol = next(htrc_features.FeatureReader(path).volumes())
    tl = vol.tokenlist(pages=False)
    tl.index = tl.index.droplevel(0)
    tl['date'] = vol.year
    tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])
    try:
        proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
        proper_nouns.index = proper_nouns.index.droplevel(2)
        return proper_nouns[proper_nouns['count'] > 1]
    except:
        return pandas.DataFrame()

In [None]:
%timeit pd.concat(get_proper_nouns.map(paths))

1 loops, best of 3: 4.65 s per loop


About 5.6x times faster.

## Timing the ipyparallel code (4 engines, 4 volumes at a time)

Requiring libraries is the main time bottleneck for this library. I'm not sure if ipyparallel does anything fancy to mitigate the `import` time, but ideally, we wouldn't send one path to an engine at a time. If we were processing lots of data, I would probably do ~100 volumes at a time. Since we're testing with just 15 volumes, let's see if there's a speed improvement by sending 4 paths at a time. 

In [None]:
paths_per_engine = 4
multipaths = [paths[i::paths_per_engine] for i in range(1,paths_per_engine+1)]
multipaths[1]

['../data/PZ-volumes/hvd.hwrevu.basic.json.bz2',
 '../data/PZ-volumes/njp.32101068970662.basic.json.bz2',
 '../data/PZ-volumes/uc2.ark+=13960=t0tq5v13m.basic.json.bz2',
 '../data/PZ-volumes/uiuo.ark+=13960=t72v2t63s.basic.json.bz2']

Same function as before, but it now can return info from multiple Volumes.

In [None]:
@lview.parallel()
@require('pandas', 'htrc_features')
def get_proper_nouns_multi(paths):
    idx = pandas.IndexSlice
    fr = htrc_features.FeatureReader(paths)
    dfs = []
    for vol in fr.volumes():
        tl = vol.tokenlist(pages=False)
        tl.index = tl.index.droplevel(0)
        tl['date'] = vol.year
        tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])
        try:
            proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
            proper_nouns.index = proper_nouns.index.droplevel(2)
            dfs.append(proper_nouns[proper_nouns['count'] > 1])
        except:
            pass
    try:
        return pd.concat(dfs)
    except:
        return pandas.DataFrame()

In [None]:
%timeit pd.concat(get_proper_nouns_multi.map(multipaths))

1 loops, best of 3: 4.33 s per loop


Somewhat trivial improvement, though this may potentially be greater on larger datasets.