Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store broker offset history #338

Merged
merged 8 commits into from
Jan 30, 2018

Conversation

toddpalino
Copy link
Contributor

@toddpalino toddpalino commented Jan 29, 2018

For partitions that receive data very slowly, it is possible for Burrow to falsely alert for stopped partitions. This case is described in #303 (the second case). This is the starting state:

  1. There has been no data in a long time
  2. The currentLag is zero
  3. The partition would be considered stopped by timestamps (if currentLag was not zero)
  4. The consumer is running properly

Then the following things happen:

  1. A new message comes in
  2. The broker offset is updated in Burrow
  3. A status check is performed in Burrow for the consumer - this reports stopped for this partition, as the lag is non-zero and the timestamps say it is stopped

To get around this, we need to wait a little bit to give the consumer a chance to see the new message and commit an updated offset. We can do this by storing the recent browser offsets that Burrow has gotten for the partition. If one of those recent offsets would have had the consumer at zero lag, we don't consider the partition to be stopped. If the interval config is 10, and the cluster offset-refresh is 30 seconds, this would delay the partition being marked as stopped by 5 minutes.

This does not expose the broker offset history to the user, or use it in any other way except when the partition might be considered stopped.

In order to account for the race condition described in linkedin#303 (second part), we need to delay alerting for a stopped partition for a short period of time. We do that by only marking the partition stopped if the timestamps show it is stopped AND if the partition did not have zero lag against any recent broker LEO. In the case where the intervals config for storage is 10 and the cluster offset-refresh is 30 seconds, this would give the consumer 5 minutes to consume from a slow partition and commit an offset before it gets marked as stopped.
@toddpalino toddpalino merged commit 389ef47 into linkedin:master Jan 30, 2018
toddpalino added a commit to toddpalino/Burrow that referenced this pull request Jan 30, 2018
grumbler pushed a commit to grumbler/Burrow that referenced this pull request Feb 2, 2018
* upstream/master:
  Fix an incorrect cast from linkedin#338 and add a test to cover it (linkedin#340)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant