Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributors not leaving the ring on shutdown when memberlist is used #2401

Closed
pracucci opened this issue Jul 13, 2022 · 3 comments · Fixed by #2418 or grafana/loki#6773
Closed

Distributors not leaving the ring on shutdown when memberlist is used #2401

pracucci opened this issue Jul 13, 2022 · 3 comments · Fixed by #2418 or grafana/loki#6773

Comments

@pracucci
Copy link
Collaborator

I noticed that when we rollout distributors, some of them are left in the ring until the auto-forget triggers (introduced in #2154). We run distributors with -distributor.ring.heartbeat-timeout=4m, so the auto-forget triggers after 40 minutes (10x 4m).

The screenshot below shows 3 consecutive rollouts. In all cases, the actual number of distributors in the ring increase in the ring, and start decreasing after 40m the rollout has started, which is when the auto-forget triggers.

Screenshot 2022-07-13 at 14 18 34

@pracucci
Copy link
Collaborator Author

Some hypothesis I made so far.

[Discarded] Distributors are not configured to leave the ring on shutdown

I guessed in #2154 we didn't configure the BasicLifecycler to leave the ring on shutdown, but we did:
https://github.com/grafana/mimir/pull/2154/files?w=1#diff-2f9976bddbfa07b6aca5575607f9aef06f1e4c19638a3ab9a7b24aa323e581cdR85

Distributors are getting killed after terminationGracePeriodSeconds

The distributor has `terminationGracePeriodSeconds: 30 . Could be possible that we can't gracefully shutdown distributors in 30s? I doubt it, but worth double checking.

[Unlikely] Gossip messages are lost because too many distributors shutdown at once

I also thought it could be caused by shutting down too many distributors at once, causing the gossip messages about the distributor leaving the ring being lost, but we configure distributors with maxUnavailable: 1 which looks unlikely to happen:

  strategy:
    rollingUpdate:
      maxSurge: 5
      maxUnavailable: 1

[Discarded] Gossip messages are lost because distributor termination is too quick

What if distributors are too fast to shutdown and gossip messages to remove the distributor from the ring don't leave the distributor itself before shutting down?

However, we try to wait until broadcast queue is empty when shutting down memberlist KV service: https://github.com/grafana/dskit/blob/99f3d0043c23665adadad60d53bd8be5a3a3e354/kv/memberlist/memberlist_client.go#L599

I checked distributor logs and there are no occurrences of "broadcast messages left in queue".

@pracucci
Copy link
Collaborator Author

We found the issue. It's in the memberlist library, causing our LEFT messages (used to remove an instance from the ring) to be dropped under a specific scenario. We're working on a fix.

@pstibrany pstibrany mentioned this issue Jul 14, 2022
1 task
@pstibrany
Copy link
Member

Fix is available in hashicorp/memberlist#263, and merged in our fork as grafana/memberlist@09ffed8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants