-input format not handled in local mode #42

Closed
jmesnil opened this Issue May 26, 2011 · 1 comment

2 participants

@jmesnil

hi,

I want to run Dumbo with a specific input format (to read from Avro files).
It seems Dumbo does not use the input format specified by '-inputformat' when it is run locally (without specifying '-hadoop'). Instead it uses its default input format.

To check that, I specify a unknown class with '-inputformat foo.bar.UnknownClass'. It fails on hadoop but passes in local mode.

Hadoop mode:

$ dumbo start cat.py \
-input word-count.avro \
-output tmp \
-libjar avro-1.4.1.jar \
-libjar avro-utils-1.5.3-SNAPSHOT.jar \
-inputformat foo.bar.UnknownClass \
-python /home/sites/sci-env/0.0.5/bin/python \
-hadoop /usr/lib/hadoop
...
-inputformat : class not found : foo.bar.UnknownClass
Streaming Command Failed!

Local mode:

$ dumbo start cat.py \
-input word-count.avro \
-output tmp \
-libjar avro-1.4.1.jar \
-libjar avro-utils-1.5.3-SNAPSHOT.jar \
-inputformat foo.bar.UnknownClass \
-python /home/sites/sci-env/0.0.5/bin/python
INFO: buffersize = 168960

=> no error, tmp was created but it contains the content of the binary avro file as it was read as text...

Is it a limitation of Dumbo that the '-input' format is working only in Hadoop mode or is it a bug?

thanks,
jeff

@klbostee
Owner

It's a limitation. Dumbo's local mode only relies on UNIX pipes and doesn't use Hadoop in any way, so specifying a java class as input format for a local run simply cannot work. If you want to test Hadoop helper classes locally, you have to locally install a Hadoop build that is configured to run in local mode (which is the default configuration).

@klbostee klbostee closed this May 26, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment