This repository has been archived by the owner on Oct 30, 2020. It is now read-only.
Fix camus-hourly mapper skew and improve mapper logging #209
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There were 4 issues in Camus that could cause mapper skew:
long
, and calculate the running total size bycurrentAvgSize * currentCount
. This is simply wrong. For example, suppose we currently pulled 10000 records, avgSize = 100, so total size = 1million. Then, even if we pull an infinite number of records with size=200 each, the avgRecordSize will still be 100. And if we pull a single record with size < 100, then avgRecordSize will become 99. So, the avgRecordSize will keep decreasing.The solution is to modify the way the running total size is computed. See
EtlMultiOutputCommitter.addOffset()
.The bug is caused by: if Camus times out before reading a partition, it will emit a (EtlKey, ExceptionWritable) pair. The getMessageSize() of the ExceptionWritable object will return the size of the last read record. This pair is used to calculate the avgMsgSize of this partition.
The solution is to force each mapper to pull at least 5 records in each partition assigned to it. See
EtlRecordReader.nextKeyValue()
line 204.The solution is to only pull to lastestOffest. See
KafkaReader.hasNext()
.The solution is to check for timeout. See
EtlRecordReader.nextKeyValue()
line 302.Tested on Nertz, skew has largely disappeared. The remaining occasional skew is due to a single large partition, which has to be assigned to a single mapper.
Also improved mapper logging: remove the log msg "Received message with null message.key()", which dominated the mapper log and caused other important msg to be truncated. Added the time spent on each partition, the total bytes read and the avg record size in mapper log. See
EtlRecordReader.nextKeyValue()
.