Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker deploy triggers brief REWIND state. #150

Closed
rgevers opened this issue Nov 3, 2016 · 6 comments
Closed

Broker deploy triggers brief REWIND state. #150

rgevers opened this issue Nov 3, 2016 · 6 comments
Assignees
Milestone

Comments

@rgevers
Copy link

rgevers commented Nov 3, 2016

I haven't had a chance to dig into the code and verify this yet but we have run into in one of our broker deploys where a large number of partitions and consumer groups reported REWIND errors very briefly. I don't believe we actually committed earlier offsets. Some messages were definitely replayed due to offset commits that failed while the consumer group was rebalancing but it doesn't seem like burrow would have been aware of any of that if the commit didn't succeed, and the consumer shouldn't have replayed work that it did commit successfully.

The other flag that made us think it might be an edge case in the burrow implementation is that the starting and ending offsets weren't right for REWIND. Eg.

{
    topic: "TopicName",
    partition: 11,
    status: "REWIND",
    start: {
        offset: 6413708,
        timestamp: 1478147779590,
        lag: 0
    },
    end: {
        offset: 6413737,
        timestamp: 1478147856000,
        lag: 0
    }
}

No lag at either time and the ending offset is larger than the starting offset. Unless there is some intermediate sample that is hidden here, I am not sure how that could happen.

@toddpalino
Copy link
Contributor

The problem is going to be that there's an intermediate sample that we can't view using the current interface. I've been meaning to add an endpoint that dumps all the data for a group for debugging things like this.

@rgevers
Copy link
Author

rgevers commented Nov 24, 2016

I have been planning to try to recreate the scenario to troubleshoot more but haven't had a chance and it has only occurred once out of the last handful of deploys. Having a view of all of the samples would definitely help. Something I also planned to look at but haven't yet.

@rkling01
Copy link

rkling01 commented Aug 1, 2017

Thought I'd add to this since I'm actually seeing somewhat frequent occurrences of REWIND status conditions of late. I'm seeing them with our mirror maker consumer groups we monitor. The same conditions as indicated above where ending offset is larger than starting offset - so it doesn't appear to be an issue. Am looking into this further as well since we don't see any specific issues in those processes happening when the REWIND status is triggered.

@toddpalino
Copy link
Contributor

I'm going to add that more detailed endpoint in 1.0, which will make this easier to debug as to what's actually going on. We see periodic rewinds as well, especially on heavily loaded clusters.

@toddpalino toddpalino self-assigned this Sep 28, 2017
@toddpalino toddpalino added this to the burrow-1.0 milestone Sep 28, 2017
@rgevers
Copy link
Author

rgevers commented Sep 29, 2017

We've continued to tune things for our cluster and seem to see fewer rewinds where the offsets appear to be growing (as in my example). We still see them but now the most common case appears to be with an actual decrease in committed offsets. Extra details would definitely help in pinning that down, but it does seem like we are also dealing with legitimate rewinds when brokers are deployed.

@toddpalino
Copy link
Contributor

Resolved by 1.0 rewrite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants