This adds a -hadooplib command-line switch to tell dumbo where
hadoop-streaming.jar is stored, along with an addition to the jar search path.
It also uses 'hadoop fs' instead of 'hdfs dfs' since it's easier to find.
This may not be the best long-term solution, but it seems to work for now. I have not tested with MRv1 or anything off the beaten path. Addresses Issue #53.
Quick fix for CDH4 / MRv2.
For people who want to try this out, the paths for CDH4 on Red Hat would look like this:
$ dumbo start examples/wordcount.py -input brian.txt -output wordcounts -hadoop /usr/lib/hadoop -hadooplib /usr/lib/hadoop-mapreduce
Don't require -hadooplib.
This should still work for CDH3 where you don't need a separate -hadooplib.
Sounds good. Will have a look at this soonish hopefully.
Just had a closer look. I ended up generalizing things a bit so that several -hadooplib options can be specified. I'll do some final testing and commit the code on Monday.
With my changes you can then run Dumbo scripts as follows:
dumbo start wordcount.py -input brian.txt -output wc -hadoop /usr -hadooplib /usr/lib/hadoop-0.20-mapreduce
This seems to be a slightly better way of doing it because apparently CDH4 has some JAVA_HOME auto detection in /usr/bin/hadoop that isn't available in /usr/lib/hadoop/bin/hadoop.
This is in release 0.21.35.