Building and installing

On the machine from which you want to run your Dumbo programs, do:

$ wget -O ez_setup.py http://bit.ly/ezsetup
$ python ez_setup.py -z dumbo

or, if you already have easy_install installed, simply:

$ easy_install -z dumbo

The -z option is important since Dumbo only works when it’s installed as a self-contained, zipped egg.

Alternatively, you can also install Dumbo in a virtual Python environment:

$ wget -O virtualenv.py http://bit.ly/virtualenv
$ python virtualenv.py env
$ env/bin/easy_install -z dumbo

Once you have completed the steps above, you can move on to Running programs. The recommended Hadoop distribution to run your Dumbo programs on is Cloudera’s Hadoop distribution, which supports Dumbo out of the box from version 2 (CDH2) onwards.

Old Hadoop Versions

If you use an old version of Hadoop then you’ll first have to apply a few patches. More precisely, you then have to download the patches for HADOOP-1722, HADOOP-5450, and MAPREDUCE-764 and rebuild Hadoop after applying these patches (the order in which you apply the patches is important!):

$ cd /path/to/hadoop
$ patch -p0 < /path/to/HADOOP-1722.patch
$ patch -p0 < /path/to/HADOOP-5450.patch
$ patch -p0 < /path/to/MAPREDUCE-764.patch
$ ant package

If you want to use Dumbo’s convenient joining abstraction, you need to apply HADOOP-5528 as well.

Old Dumbo Versions

Dumbo 0.21 has been around for quite a while now and can be considered very stable, but if you still want to use 0.20 or lower for some reason then you’ll have to follow these instructions instead.

As part of Hadoop (mandatory)

To build Dumbo, you just have to add it to the src/contrib directory of Hadoop (version 0.18) and build Hadoop:

$ wget http://github.com/klbostee/dumbo/tarball/release-0.20.28 -O dumbo.tar.gz
$ tar zxvf dumbo.tar.gz
$ mv klbostee-dumbo* $HADOOP_HOME/src/contrib/dumbo
$ cd $HADOOP_HOME
$ ant package

This should generate a Hadoop build in build/ that contains a contrib/dumbo directory:

$ ls build/hadoop-*/contrib/dumbo
bin  examples  lib

The shell script example in the subdirectory bin/ runs the wordcount.py example on Hadoop.

As a Python module (optional)

You can also install Dumbo as a Python module on your system:

$ cd $HADOOP_HOME/src/contrib/dumbo
$ sudo ant install_pymod

This additional installation step is not required, but we do recommend it because it allows you to run programs locally using UNIX pipes, which can be very useful for debugging. The dumbo command that gets added to /usr/bin by this optional installation step can be used in the same way as $HADOOP_HOME/build/hadoop-*/contrib/dumbo/bin/dumbo. The only difference is that it requires an additional -hadoop <path_to_hadoop_dir> option. Hence, this same command can be used to run programs on different Hadoop clusters, and by omitting the -hadoop option you can run a Dumbo program locally using UNIX pipes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building and installing

Old Hadoop Versions

Old Dumbo Versions

As part of Hadoop (mandatory)

As a Python module (optional)

Clone this wiki locally