backend abstraction #8

klbostee opened this Issue Feb 21, 2010 · 10 comments


None yet

2 participants


At some point we should add a proper interface between the dumbo core and the unix and hadoop streaming backends we currently have (both for the code that runs mapreduce iterations and the code that executes commands). In addition to cleaning up the code, this would also make it possible to easily add additional backends.


I'd be a bit reluctant to integrating such a big change before we have a more extensive suite of unit tests though...


Adam Hadani started some work on this:

If you have time please take a look at my fork. The basic idea is that I stripped
apart a lot of what used to be at The basic classes all reside in
now, and i created a new sub-package, 'backends' that should hold each backend
implementation as a module ( currently '', '' ).

I think it could be also nice in the future to make it so that backends can be loaded
as plugins e.g entry points or eggfiles, but thats for later.

Mostly, I think a lot of the 'mess' is around the command-line/configuration
processing, so i tried to formalize this using a 'Configurable' interface that the
backends implement. This way they can do their own option parsing/processing for
themselves, but in a structured way, and separate from the actual running code.

If this makes sense to you on the whole, I'd be happy to continue work /modify.

I think Adam's work is definitely a great start. Here are my comments:

  1. Most of the command implementations in dumbo.cmd depend on the backend, so I'd like to see that be part of the backend abstraction too.
  2. I would call the hadoop streaming backend "streaming" instead of "hadoop", since we might want to build other backends that run on hadoop (but not via hadoop streaming) in the future.
  3. I don't really like the new dumbo.base submodule. I can see why you wanted to split things, but I think this should be done by making dumbo.core a package instead of a module and making it look like it's still all in one module to the "outside world". So instead of creating dumbo.base, I'd create dumbo.core.base and import everything from it in dumbo.core. The "dumbo/" file shouldn't need any changes then, and it has the main advantage that you don't break programs that directly import something from dumbo.core. Also, I'm not really sure about the name "base" itself, there must be a better alternative. Maybe dumbo.core.objects or dumbo.core.obj? Those names aren't overly informative either, but at least they are a bit better than dumbo.core.base, in my opinion.
  4. I think I would rename Configurable.set_options to Configurable.configure, since that just sounds better and makes a bit more sense maybe.

Think these are my main comments for now. Please do continue work on this! :)


Also, I really like the idea of pluggable backends in the form of eggs, but I agree that we should keep that for later and focus this effort solely on establishing an interface for backends.


Agree on pretty much all points. I usually try to defer splitting modules into sub-packages so as to try to keep project tree as flat as makes sense, but I reckon .core would make for a good subpackage that will include module for base classes ( 'obj' / 'adt' / 'base' ..? ), module for main runnables etc.
Will probably commit some of these changes abit later today/tmrw and continue work to make the whole thing runnable asap


comitted most of the changes we talked about as well as the following:
1. Updated most tests. currently testexamples still fails, and will continue to until framework is runnable in new format
2. moved (which now only includes main() and run() ) to dumbo.core.executors. Any suggestions for better name appreciated :) generally I was thinking of moving the contents to, however i'm thinking of doing away completely with once backend implementation of most the command-line options is in. Also, having under dumbo.core gives us backward compatibility (from dumbo.core import run/main works)
3. Added interface to backends for the command-line invocations so we can delegate these to backend-specific implementation. In particular, I split this off to a FileSysBackend interface that supports file system like operations (ls, rm, cat, exists, ..) and which the Backend derives from.

Next steps are splitting off the code to per-backend implementations, and rounding off the actual run()'s.
This should put us in a good place - I think immediate goals should be
1. get the whole thing running unix/streaming
2. address backward compatibility
3. extend unit tests to cover most code ( use coverage python package? )

Once all is done, some polishing and perhaps introducing couple new features that might be now easier to implement, if only to make it more release-worthy for end users..


Sounds and looks good.

Regarding backwards compatibility:

  1. Using abstract base classes (ABCs) will break backwards compatibility in the sense that we currently support python 2.5, while ABCs require 2.6. We should think very carefully if the benefits of using ABCs justify this.
  2. Importing commands from and running them from python code is a fairly common use case, so we can't do away with completely and we should also make sure that the functions that are currently there can be called in exactly the same way. In particular, specifying the options as a list of tuples should still be possible.
  3. Do we gain anything significant from using optparse instead of the custom parsing code? If not, the switch to optparse might not be worth the risk of maybe introducing backwards incompatibilities. (And if we're going to change things anyway, it might maybe make sense to abstract away the notion of a list of options in an object?)

Other comments:

  1. The submodule cmd should only expose commands that can be executed via dumbo command, so I wouldn't move main and run there.
  2. How can the FileSysBackend methods get the arguments and options for the command? Guess you just haven't added the method parameters yet?
  3. I think I would simply call the FileSysBackend mixin "FileSystem".
  4. The backend interfaces seem a bit asymmetrical to me. Maybe we should split off an "Executor" and let a Backend be a configurable object that is both a filesystem and an executor?

We should maybe add another backend before doing a release to show off the possibilities of these changes a bit yeah, but that should not be part of this and should be handled as a separate issue in my opinion.

  • python2.5 support - Agreed that this is pretty important, I'll probably remove abc for now ( and go with 'raise NotImplementedError' or so for base classes )
  • cmd module - wasnt aware that this gets imported alot from python code - can you give some example use cases? if so, I'll keep it as is for now -Backend / names stuff - Agree mostly. I'm a little reluctant of calling an object 'FileSystem', maybe FileSystemI or so to signify interface? not a big deal either way.

Hopefully will have some time to chip away at it later on today/tomorrow



Possible use cases for accessing functions from cmd are removing the output from previous runs and starting other dumbo jobs, e.g.:

def starter(prog):
    cmdopts = {"hadoop": prog.getopt("hadoop")}
    rm("myprog/", opts=cmdopts.items())

    cmdopts["input"] = prog.delopt("input")
    cmdopts["output"] = "myprog/preprocessed"
    start("", opts=cmdopts.items())

    prog.addopt("input", cmdopts["output"])
    prog.addopt("output", "myprog/final")

And the name "FileSystem" isn't ideal either I guess, yeah. I wonder if maybe "Accessor" and "Executor" could be good names? They're not very specific, but I like the symmetry and they do capture the meaning quite nicely.


I've taken this up again and committed an (evolved) implementation to my master branch. Comments are welcome. If nobody screams, it will probably be part of the release I'm planning to do in the next few days.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment