As originally reported by Zak Stone:
It appears that
dumbo cat /hdfs/path/part*
does not actually concatenate all of the parts in an HDFS directory — instead, it silently emits only the key-value pairs from the first part.
Since the normal Dumbo syntax without the final star chokes on the _logs directory that Hadoop creates by default, people may be using this part* syntax frequently, and they may not realize that it yields incorrect results.
Current workarounds include using dumbo cat without the star by manually deleting the _logs directory or configuring Hadoop not to create it. It may be more convenient to use the HDFS ls command to iterate through the part files in a directory explicitly to ensure that each one is processed as expected.
Closed by ac4262f
This fix made dumbo cat a bit slow in some cases. See #67 for more info.