# Multiprocessing example

This example shows how to using the Multiprocessing built into the Feature Reader.

In [None]:
from htrc_features import FeatureReader
import glob
paths = glob.glob('../data/PZ-volumes/*basic.json.bz2')
fr  = FeatureReader(paths[:2])

`FeatureReader.multiprocessing(map_func)` creates an iterator, which passes a tuple with a feature reader and volume path to the map function for each path.

The mapping function can create a volume object with this information, do some processing, and send a response back. Here's an example that simply returns `vol.tokens`:

In [None]:
def printTokenList(args):
    fr, path = args
    vol = fr.create_volume(path)
    return ('tokens', vol.tokens)

In [None]:
mapper = fr.multiprocessing(printBasicMetadata)

# Reduce
all_tokens = []
for key, result in mapper:
    all_tokens = all_tokens + result

# Print all unique
set(all_tokens)

{u'funereal',
 u'systematic',
 u'raining',
 u'and\u2014some',
 u'Debts',
 u'yellow',
 u'four',
 u'prices',
 u'Does',
 u'hanging',
 u'ringlets',
 u'woody',
 u'Hello!',
 u'marching',
 u'looking',
 u'self-pity',
 u'kens',
 u'rupture',
 u'wheeled',
 u'Western',
 u'lord',
 u'Municipio',
 u'sinking',
 u'swivel',
 u'bile',
 u'powders',
 u'Evanses',
 u'bathing-place',
 u'strictest',
 u'bringing',
 u'disturb',
 u'recollections',
 u'internally',
 u'scholar',
 u'buttonhole',
 u'persisted',
 u'woods',
 u'Paul',
 u'reliable',
 u'specially',
 u'tired',
 u'ornate',
 u'precocity',
 u'pulse',
 u'270',
 u'elegant',
 u'second',
 u'273',
 u'chaffingly',
 u'277',
 u'278',
 u'shrugging',
 u'Burned',
 u'haughtily',
 u'inanimate',
 u'admire',
 u'errors',
 u'relieving',
 u'thunder',
 u'Involuntarily',
 u'fingers',
 u'\u2014which',
 u'hostile',
 u'Hamilton',
 u'hull',
 u'increasing',
 u'succumb',
 u'exclamations',
 u'practi227',
 u'insinuation',
 u'misjudged',
 u'Logging',
 u'avert',
 u'reporter',
 u'herb',
 u'

Let's compare time, on a personal computer.

In [None]:
def time_regular(paths):
    fr  = FeatureReader(paths)
    all_tokens = []
    for vol in fr.volumes():
        all_tokens = all_tokens + vol.tokens
    return set(all_tokens)

%timeit -n 3 time_regular(paths[:20])

3 loops, best of 3: 20.2 s per loop


In [None]:
def time_multi_proc(paths):
    fr  = FeatureReader(paths)
    all_tokens = []
    mapper = fr.multiprocessing(printTokenList)
    for key, result in mapper:
        all_tokens = all_tokens + result
    return set(all_tokens)

%timeit -n 3 time_multi_proc(paths[:20])

3 loops, best of 3: 10.1 s per loop
