Store broker offset history #338

toddpalino · 2018-01-29T22:24:58Z

For partitions that receive data very slowly, it is possible for Burrow to falsely alert for stopped partitions. This case is described in #303 (the second case). This is the starting state:

There has been no data in a long time
The currentLag is zero
The partition would be considered stopped by timestamps (if currentLag was not zero)
The consumer is running properly

Then the following things happen:

A new message comes in
The broker offset is updated in Burrow
A status check is performed in Burrow for the consumer - this reports stopped for this partition, as the lag is non-zero and the timestamps say it is stopped

To get around this, we need to wait a little bit to give the consumer a chance to see the new message and commit an updated offset. We can do this by storing the recent browser offsets that Burrow has gotten for the partition. If one of those recent offsets would have had the consumer at zero lag, we don't consider the partition to be stopped. If the interval config is 10, and the cluster offset-refresh is 30 seconds, this would delay the partition being marked as stopped by 5 minutes.

This does not expose the broker offset history to the user, or use it in any other way except when the partition might be considered stopped.

…ry of partition LEO

In order to account for the race condition described in linkedin#303 (second part), we need to delay alerting for a stopped partition for a short period of time. We do that by only marking the partition stopped if the timestamps show it is stopped AND if the partition did not have zero lag against any recent broker LEO. In the case where the intervals config for storage is 10 and the cluster offset-refresh is 30 seconds, this would give the consumer 5 minutes to consume from a slow partition and commit an offset before it gets marked as stopped.

…heck

* upstream/master: Fix an incorrect cast from linkedin#338 and add a test to cover it (linkedin#340)

toddpalino added 8 commits January 29, 2018 15:18

Add a ring of broker offsets for each partition to keep a short histo…

15c973d

…ry of partition LEO

Clean up repetitive casting in tests

ec01b5d

Add asserts to check for broker offset history

97211c9

Update current tests for the evaluator to handle the new recent lag c…

722fdd9

…heck

Reformat test objects so they're readable

f41d1f7

Add a test to cover a slow data partition as described in linkedin#303

e7775ed

gofmt fixes

5f5f428

toddpalino merged commit 389ef47 into linkedin:master Jan 30, 2018

toddpalino mentioned this pull request Jan 30, 2018

SIGSEGV from ringval update in inmemory.go. #339

Closed

toddpalino added a commit to toddpalino/Burrow that referenced this pull request Jan 30, 2018

Fix an incorrect cast from linkedin#338 and add a test to cover it

9f3a698

toddpalino mentioned this pull request Jan 30, 2018

Fix an incorrect cast from #338 #340

Merged

toddpalino added a commit that referenced this pull request Jan 30, 2018

Fix an incorrect cast from #338 and add a test to cover it (#340)

a91cf4d

grumbler pushed a commit to grumbler/Burrow that referenced this pull request Feb 2, 2018

Merge remote-tracking branch 'upstream/master' into github-master

a574454

* upstream/master: Fix an incorrect cast from linkedin#338 and add a test to cover it (linkedin#340)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store broker offset history #338

Store broker offset history #338

toddpalino commented Jan 29, 2018 •

edited

Loading

Store broker offset history #338

Store broker offset history #338

Conversation

toddpalino commented Jan 29, 2018 • edited Loading

toddpalino commented Jan 29, 2018 •

edited

Loading