Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

The filenames don't get escaped in output #59

Open
poison opened this Issue Sep 6, 2012 · 0 comments

Comments

Projects
None yet
1 participant

poison commented Sep 6, 2012

I was running a job that outputted to 'twoo/flowanalysis/2012/09/*', but this gives issues because when dumbo runs the hdfs (re)move operations (on overwrite="yes" for instance), it doesn't escape it properly and this results in an error.

See the output below:

12/09/06 14:00:00 INFO streaming.StreamJob: map 100% reduce 100%
12/09/06 14:00:32 INFO streaming.StreamJob: Job complete: job_201208201604_77368
12/09/06 14:00:32 INFO streaming.StreamJob: Output: twoo/flowanalysis/2012/09/__pre1
Moved to trash: hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/__pre1
Moved to trash: hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/04
EXEC: HADOOP_CLASSPATH="/home/jeroen/mm.metrics/jars/tusks.jar:$HADOOP_CLASSPATH" /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -mapper 'python -m base64_users map 1 629145600' -reducer 'python -m base64_users red 1 629145600' -numReduceTasks '60' -file '/home/poison/scripts/base64_users.py' -file '/usr/lib/dumbo/eggs/ctypedbytes-0.1.8-py2.6-linux-x86_64.egg' -file '/usr/lib/dumbo/lib/python2.6/site-packages/dumbo-0.21.34-py2.6.egg' -file '/usr/lib/dumbo/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg' -file '/home/jeroen/mm.metrics/jars/tusks.jar' -output 'twoo/flowanalysis/2012/09/_' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=base64_users.py (2/2)' -input 'twoo/flowanalysis/2012/09/__pre1' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -cmdenv 'PYTHON_EGG_CACHE=/tmp/eggcache' -cmdenv 'PYTHONPATH=ctypedbytes-0.1.8-py2.6-linux-x86_64.egg:dumbo-0.21.34-py2.6.egg:typedbytes-0.3.8-py2.6.egg'
12/09/06 14:00:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/home/poison/scripts/base64_users.py, /usr/lib/dumbo/eggs/ctypedbytes-0.1.8-py2.6-linux-x86_64.egg, /usr/lib/dumbo/lib/python2.6/site-packages/dumbo-0.21.34-py2.6.egg, /usr/lib/dumbo/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg, /home/jeroen/mm.metrics/jars/tusks.jar, /tmp/hadoop-poison/hadoop-unjar6857496811056060599/] [] /tmp/streamjob4826297620725394446.jar tmpDir=null
12/09/06 14:00:33 INFO mapred.JobClient: Cleaning up the staging area hdfs://hadoopname02/tmp/hadoop-mapred/mapred/staging/poison/.staging/job_201208201604_77492
12/09/06 14:00:33 ERROR security.UserGroupInformation: PriviledgedActionException as:poison (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/__pre1 matches 0 files
12/09/06 14:00:33 ERROR streaming.StreamJob: Error Launching job : Input Pattern hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/__pre1 matches 0 files
Streaming Command Failed!

By the way; thanks a lot for this great contribution. I use it almost every day and works like a charm. I really like using the hadoop streaming in python!

Nicolas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment