Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rewind detection when previous zero lag #716

Merged

Conversation

jyates
Copy link
Contributor

@jyates jyates commented Sep 29, 2021

Rewinds can occur even when the consumer has no lag and its often an indication
that something that gone quite wrong. This change checks for a rewind when there
is any current lag. If a rewind is found in the history of the consumer offsets,
it then checks to see if the consumer has 'recovered' from the rewind. If not,
it is marked as a rewind. If it has recovered, we fall back to the usual
usual status evaluation rules.

Currently the wiki doesn't say anything about REWIND status, but does say:

If any lag within the window is zero, the status is considered to be OK.

However that misses the case where lag can be zero for a while and then jump
back in time. This can occur during a rebalance as consumers drain existing
work (for those that keep a pipeline of work) or due to a broker bug that throws an
Exception when attempting to retrieve the current offsets, causing many consumers
to revert to the beginning of the topic (auto.offset.reset = earliest). The
latter can occur with periodically and unfortunately is a reoccuring theme of
bug (see: KAFKA-9824, KAFKA-9543, KAFKA-9807, KAFKA-9835, etc.)

Addresses #714

@jyates jyates requested a review from bai as a code owner September 29, 2021 12:12
@jyates jyates force-pushed the allow-rewind-detection-when-previous-zero-lag branch from 06e0c3c to 6f71208 Compare September 29, 2021 12:15
Rewinds can occur even when the consumer has no lag and its often an indication
that something that gone quite wrong. This change checks for a rewind when there
is any current lag. If a rewind is found in the history of the consumer offsets,
it then checks to see if the consumer has 'recovered' from the rewind. If not,
it is marked as a rewind. If it has recovered, we fall back to the usual
usual status evaluation rules.

Currently the wiki doesn't say anything about REWIND status, but does say:
>  If any lag within the window is zero, the status is considered to be OK.

However that misses the case where lag can be zero for a while and then jump
back in time. This can occur during a rebalance as consumers drain existing
work (for those that keep a pipeline of work) or due to a broker bug that throws an
Exception when attempting to retrieve the current offsets, causing many consumers
to revert to the beginning of the topic (`auto.offset.reset = earliest`). The
latter can occur with periodically and unfortunately is a reoccuring theme of
bug (see: KAFKA-9824, KAFKA-9543, KAFKA-9807, KAFKA-9835, etc.)

Addresses linkedin#714
@jyates jyates force-pushed the allow-rewind-detection-when-previous-zero-lag branch from 6f71208 to d69a8ec Compare September 29, 2021 13:42
@bai
Copy link
Collaborator

bai commented Sep 30, 2021

Thanks for your contribution!

@bai bai merged commit eab12c5 into linkedin:master Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants