Skip to content
klbostee edited this page Aug 21, 2012 · 70 revisions

The instructions below apply to recent Dumbo versions only. If you’re still using Dumbo 0.20 and only installed it as part of Hadoop (as described under Building and installing), then you cannot run programs locally on UNIX and have to omit the -hadoop option when you execute contrib/dumbo/bin/dumbo to start a distributed run of a program.

Locally on UNIX

If you completed both the mandatory and optional installation steps described in Building and installing, then you can run a Dumbo program program.py locally as follows:

$ dumbo start program.py -input input.txt -output output.txt
$ dumbo cat output.txt | more

Other useful options for local runs are:

  • -inputformat <name of inputformat> (“text” by default)
  • -input <additional input path>
  • -python <python command to use> (“python” by default)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -cmdenv <env var name>=<value>
  • -pv yes (use “pv” to display progress info)
  • -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
  • -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)
  • -overwrite yes (remove output path before running the job)

Distributed on Hadoop

To run a program on Hadoop, you basically just have to add the -hadoop option:

$ dumbo start program.py -hadoop <path to local hadoop> \
-input <DFS input path> -output <DFS output path>
$ dumbo cat <DFS output path> -hadoop <path to local hadoop> | more

The path given via the -hadoop option is where Dumbo will look for both the hadoop command (in the bin/ subdirectory) and the Hadoop Streaming jar. Since these might not always be in the same directory, you can also specify additional Hadoop jar search paths via the -hadooplib option. In case of CDH4, for instance, you’ll typically want to use the hadoop command from /usr/bin/ and the Hadoop Streaming jar from /usr/lib/hadoop-0.20-mapreduce/, which can easily be achieved by specifying -hadoop /usr and -hadooplib /usr/lib/hadoop-0.20-mapreduce.

Other useful options are (see also the Hadoop streaming page and wiki):

  • -inputformat <name of inputformat (class)> (“auto” by default)
  • -outputformat <name of outputformat (class)> (“sequencefile” by default)
  • -input <additional DFS input path>
  • -python <python command to use on nodes> (“python” by default)
  • -name <job name> (“program.py” by default)
  • -nummaptasks <number>
  • -numreducetasks <number> (no sorting or reducing will take place if this is 0)
  • -priority <priority value> (“NORMAL” by default)
  • -libjar <path to jar> (this jar gets put in the class path)
  • -libegg <path to egg> (this egg gets put in the Python path)
  • -file <local file> (this file will be put in the dir where the python program gets executed)
  • -cachefile hdfs://<host>:<fs_port>/<path to file>#<link name> (a link “<link name>” to the given file will be in the dir)
  • -cachearchive hdfs://<host>:<fs_port>/<path to jar>#<link name> (link points to dir that contains files from given jar)
  • -cmdenv <env var name>=<value>
  • -pypath <colon-delimited list of paths> (these paths are appended to the PYTHONPATH environment variable and are included in sys.path)
  • -hadoopconf <property name>=<value>
  • -addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)
  • -getpath yes (writes multiple output files into <outputdir>/<path> during the final reduce phase, as long as the output key is a (<path>, <key>) tuple; requires feathers)
  • -fake yes (fake run, only prints the underlying shell commands but does not actually execute them)
  • -memlimit <number of bytes> (set an upper limit on the amount of memory that can be used)
  • -overwrite yes (remove output path before running the job)
  • -preoutputs yes (don’t delete intermediate output results – can be useful for debugging when running multiple map/red iterations)
  • -queue <name of queue to run in>