Skip to content

Loading…

Add cleanup functionality #60

Closed
mshevelev opened this Issue · 5 comments

5 participants

@mshevelev

Sometimes it is useful to output some additional records after all lines are processed.
For example, you maintain some data structure and on every call of call() method you update this it. After mapper/reducer/combiner finishes processing the input you can iterate through the structure and output some additional records.

As far as I know this feature is supported by Java-Hadoop. There is no restriction to do this in streaming. This feature can be easily implemented in dumbo.

@mshevelev

Example usage:

import dumbo

def mapper(_, line):
for word in line.strip().split():
yield word, 1

class Reducer:

  def __init__(self):
      self.nwords = 0 

  def __call__(self, word, values):
      s = sum(values)
      self.nwords += s
      yield word, s

  def cleanup(self):
      yield 'Total words', self.nwords

dumbo.run(mapper, Reducer, combiner=dumbo.sumreducer)

@klbostee klbostee pushed a commit that closed this issue
Mikhail Shevelev Fix #60 Added cleanup functionality 4f7b037
@klbostee klbostee closed this in 4f7b037
@kzhai

Is it possible to add this example to the short tutorial? It took me a while to find it. Thanks.

@klbostee
Owner

I just added it to the "further reading" section at the end. Might be possible to integrate it more somehow I guess, but now it should at least be easier to find...

@scottkwong

Java hadoop also supports Setup methods to run before the mapper/reducer start processing lines (e.g., open and read a file). Should this be done in the init method or as a separate method?

@a4tunado

You should implement configure(self) method for initialization routines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.