What is this? Is it is awesome?
Componentize is an implementation of the connected component MapReduce algorithm found in Graph Twiddling in a MapReduce World.
It is my understanding that it is pretty cool. I don't know about "awesome", yet. We'll see.
(The why and wherefore: Componentize is written in Scala, uses Hadoop 0.20.1 to distribute the work, and Buildr for building the source code.)
Doing the Thing
Building the Jarfile
To run this pretty little creature, you'll need to create it's jar file.
To do that, you'll run
buildr package (or
jbuildr package if you are on OS X and
have to use the JRuby version). However! However, you will need in
CLASSPATH the Hadoop 0.20.1 core jar, all of the Hadoop's
depedencies (found in the
lib directory of the 0.20.1 tarball,
natch), and, of course, the Scala jarfiles.
A couple of nice things:
Buildr will go out of its way to find the Scala jarfiles and add them for you. I know for a fact, in fact, that if you install Scala through MacPorts on OS X, the
jbuildrcommand will find your Scala jarfiles and add them for you to the classpath Componentize will be built with.
If you create a
libdirectory in your checkout of Componentize and copy all of the Hadoop and Hadoop dependency jarfiles to it, you will magically have them added to classpath used to build Componentize.
In any case, the jarfile will appear in the
target directory as
Running the Hadoop job
The jarfile created in the
target directory by
buildr package can
then be added to your
HADOOP_CLASSPATH, or just run with the usual
single-core, single-machine Hadoop command like so:
$HADOOP_HOME/bin/hadoop jar target/componentize-$someversion.jar \ componentize.Main -edgedir edges/
Before you do that, you'll need some input data! In the
slyly mentioned in the above command, you'll need one or more files
with data formatted as:
nodeid1 nodeid2 nodeid56 nodeid1000
and so on. Let's be clear here, those are big fat TAB characters in the middle there. TAB as in "\t"; as in that key over "Caps Lock". They are not spaces.
You can find some example input data in the
The output of Componentize is ... verbose. And I don't just mean
what's printed to
/dev/stdout. In the output directory, you'll have
3 directories of output for each iteration of Componentize plus one
for the set up. This is not a lot of fun.
However, when Componentize finishes it will be a good sport and log
what directory you can find your output. This directory will be a
directory starting with
zonefiles and ending in a number. It will
also be the
zonefiles directory with the greatest number at the end.
This output directory will be filled with one or more tab and newline delimited files where the first column is the component label (or "zone") of the node in the second. As in:
zone1 nodeid73 zone2 nodeid100
Things That Would Be Nice
We should use
Combinersto reduce our load.
There is probably a better way of storing the transient data between jobs than just simply writing out another output directory with
There is no way to do only some of the processing. It is all of the set up and phases or none at all. Though, we could just make it easier for someone to configure the
On build, we use Scala X-Ray to
build a nice HTML version of the code. Check it out in
target/classes.sxr/index.html after you compile Componentize.