dumbo cat /hdfs/path/part* silently fails to concatenate all part files #1

klbostee opened this Issue Feb 21, 2010 · 2 comments

1 participant


As originally reported by Zak Stone:

It appears that

dumbo cat /hdfs/path/part*

does not actually concatenate all of the parts in an HDFS directory — instead, it silently emits only the key-value pairs from the first part.

Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.

Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.


Closed by ac4262f


This fix made dumbo cat a bit slow in some cases. See #67 for more info.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment