Add cleanup functionality #60

Closed
mshevelev opened this Issue Nov 1, 2012 · 5 comments

Comments

Projects
None yet
5 participants
@mshevelev

Sometimes it is useful to output some additional records after all lines are processed.
For example, you maintain some data structure and on every call of call() method you update this it. After mapper/reducer/combiner finishes processing the input you can iterate through the structure and output some additional records.

As far as I know this feature is supported by Java-Hadoop. There is no restriction to do this in streaming. This feature can be easily implemented in dumbo.

@mshevelev

This comment has been minimized.

Show comment
Hide comment
@mshevelev

mshevelev Nov 1, 2012

Example usage:

import dumbo

def mapper(_, line):
for word in line.strip().split():
yield word, 1

class Reducer:

  def __init__(self):
      self.nwords = 0 

  def __call__(self, word, values):
      s = sum(values)
      self.nwords += s
      yield word, s

  def cleanup(self):
      yield 'Total words', self.nwords

dumbo.run(mapper, Reducer, combiner=dumbo.sumreducer)

Example usage:

import dumbo

def mapper(_, line):
for word in line.strip().split():
yield word, 1

class Reducer:

  def __init__(self):
      self.nwords = 0 

  def __call__(self, word, values):
      s = sum(values)
      self.nwords += s
      yield word, s

  def cleanup(self):
      yield 'Total words', self.nwords

dumbo.run(mapper, Reducer, combiner=dumbo.sumreducer)

@klbostee klbostee closed this in 4f7b037 Nov 13, 2012

klbostee added a commit that referenced this issue Nov 13, 2012

Merge pull request #61 from mshevelev/cleanup
Fix #60 Added cleanup functionality
@kzhai

This comment has been minimized.

Show comment
Hide comment
@kzhai

kzhai Dec 1, 2012

Is it possible to add this example to the short tutorial? It took me a while to find it. Thanks.

kzhai commented Dec 1, 2012

Is it possible to add this example to the short tutorial? It took me a while to find it. Thanks.

@klbostee

This comment has been minimized.

Show comment
Hide comment
@klbostee

klbostee Jan 8, 2013

Owner

I just added it to the "further reading" section at the end. Might be possible to integrate it more somehow I guess, but now it should at least be easier to find...

Owner

klbostee commented Jan 8, 2013

I just added it to the "further reading" section at the end. Might be possible to integrate it more somehow I guess, but now it should at least be easier to find...

@scottkwong

This comment has been minimized.

Show comment
Hide comment
@scottkwong

scottkwong Jan 26, 2014

Java hadoop also supports Setup methods to run before the mapper/reducer start processing lines (e.g., open and read a file). Should this be done in the init method or as a separate method?

Java hadoop also supports Setup methods to run before the mapper/reducer start processing lines (e.g., open and read a file). Should this be done in the init method or as a separate method?

@a4tunado

This comment has been minimized.

Show comment
Hide comment
@a4tunado

a4tunado Mar 1, 2014

You should implement configure(self) method for initialization routines

a4tunado commented Mar 1, 2014

You should implement configure(self) method for initialization routines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment