Sometimes it is useful to output some additional records after all lines are processed.
For example, you maintain some data structure and on every call of call() method you update this it. After mapper/reducer/combiner finishes processing the input you can iterate through the structure and output some additional records.
As far as I know this feature is supported by Java-Hadoop. There is no restriction to do this in streaming. This feature can be easily implemented in dumbo.
def mapper(_, line):
for word in line.strip().split():
yield word, 1
self.nwords = 0
def __call__(self, word, values):
s = sum(values)
self.nwords += s
yield word, s
yield 'Total words', self.nwords
dumbo.run(mapper, Reducer, combiner=dumbo.sumreducer)
Fix #60 Added cleanup functionality
Is it possible to add this example to the short tutorial? It took me a while to find it. Thanks.
I just added it to the "further reading" section at the end. Might be possible to integrate it more somehow I guess, but now it should at least be easier to find...
Java hadoop also supports Setup methods to run before the mapper/reducer start processing lines (e.g., open and read a file). Should this be done in the init method or as a separate method?
You should implement configure(self) method for initialization routines