Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

additer not working since 0.21.33 #57

Closed
jmeynet opened this Issue · 3 comments

2 participants

@jmeynet

It seems that multiple iteration jobs are failing since 0.21.33.
In version 0.21.32, input/output format beween iterations were sequencefiles. ('code' shortcut in the core.py).
In version 0.21.33, first iteration output format is set to text while second iteration inputformat remains sequence file or text depending of the -inputformat passed to the main dumbo program.

To reproduce the problem:

Run the example script itertwice:

$ dumbo start itertwice.py -hadoop /usr/lib/hadoop -python `which python` -name 'toto' -inputformat text -output /tmp/toto -outputformat text -input /benchmark/wc_dbpedia/part-r-00000
EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar -outputformat 'org.apache.hadoop.mapred.TextOutputFormat' -inputformat 'org.apache.hadoop.mapred.TextInputFormat' -mapper '/home/sites/sci-env/1.0.0/bin/python -m itertwice map 0 262144000' -reducer '/home/sites/sci-env/1.0.0/bin/python -m itertwice red 0 262144000' -file '/home/blaurencin/work/svn_repository/trunk/science/bom_dumbo/examples/itertwice.py' -file '/home/sites/sci-env/1.0.0/lib/python2.6/site-packages/dumbo-0.21.33-py2.6.egg' -file '/home/sites/sci-env/1.0.0/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg' -input '/benchmark/wc_dbpedia/part-r-00000' -jobconf 'mapred.job.name=toto (1/2)' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -output '/tmp/toto_pre1' -cmdenv 'PYTHONPATH=dumbo-0.21.33-py2.6.egg:typedbytes-0.3.8-py2.6.egg' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo'
/usr/lib/hadoop/bin/hadoop: ligne 301: /usr/lib/jvm/java-6-sun/bin/java: Aucun fichier ou dossier de ce type
12/07/03 10:31:21 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/home/blaurencin/work/svn_repository/trunk/science/bom_dumbo/examples/itertwice.py, /home/sites/sci-env/1.0.0/lib/python2.6/site-packages/dumbo-0.21.33-py2.6.egg, /home/sites/sci-env/1.0.0/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg, /tmp/hadoop-blaurencin/hadoop-unjar307213919339965419/] [] /tmp/streamjob6460042582272007332.jar tmpDir=null
[10:56:16 CEST] brice.laurencin.bom: 12/07/03 10:38:18 INFO streaming.StreamJob:  map 100%  reduce 100%
12/07/03 10:38:19 INFO streaming.StreamJob: Job complete: job_201207021700_0214
12/07/03 10:38:19 INFO streaming.StreamJob: Output: /tmp/toto_pre1

outputformat of first iter is -outputformat 'org.apache.hadoop.mapred.TextOutputFormat'

Second iter:

EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar -outputformat 'org.apache.hadoop.mapred.TextOutputFormat' -inputformat 'org.apache.hadoop.mapred.TextInputFormat' -mapper '/home/sites/sci-env/1.0.0/bin/python -m itertwice map 1 262144000' -reducer '/home/sites/sci-env/1.0.0/bin/python -m itertwice red 1 262144000' -file '/home/blaurencin/work/svn_repository/trunk/science/bom_dumbo/examples/itertwice.py' -file '/home/sites/sci-env/1.0.0/lib/python2.6/site-packages/dumbo-0.21.33-py2.6.egg' -file '/home/sites/sci-env/1.0.0/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg' -input '/tmp/toto_pre1' -jobconf 'mapred.job.name=toto (2/2)' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -output '/tmp/toto' -cmdenv 'PYTHONPATH=dumbo-0.21.33-py2.6.egg:typedbytes-0.3.8-py2.6.egg' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo'
/usr/lib/hadoop/bin/hadoop: ligne 301: /usr/lib/jvm/java-6-sun/bin/java: Aucun fichier ou dossier de ce type
12/07/03 10:38:21 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/home/blaurencin/work/svn_repository/trunk/science/bom_dumbo/examples/itertwice.py, /home/sites/sci-env/1.0.0/lib/python2.6/site-packages/dumbo-0.21.33-py2.6.egg, /home/sites/sci-env/1.0.0/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg, /tmp/hadoop-blaurencin/hadoop-unjar3196862767270890733/] [] /tmp/streamjob1331402886963010625.jar tmpDir=null

Second iteration fails with:

File "/home/hdfs/mapred/local/taskTracker/blaurencin/jobcache/job_201207021700_0216/attempt_201207021700_0216_m_000013_0/work/itertwice.py", line 9, in mapper2
    for letter in key: yield letter,1
TypeError: 'int' object is not iterable

The number of the line is passed as key instead of the expected word string.

Thanks,
Julien.

@klbostee
Owner

Ouch, not good :( I seem to be able to reproduce the problem, but not sure why it's occurring yet. I'll try to figure it out by the end of today hopefully.

@klbostee klbostee closed this issue from a commit
Klaas Bosteels preserve order of option values (fixes #57) 9937c92
@klbostee klbostee closed this in 9937c92
@klbostee
Owner

So should be fixed in 0.21.34...

@jmeynet

great, thanks for your reactivity!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.