Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Hash ring only with the in-ready-status replicas of Statefulset #70

Closed
wants to merge 2 commits into from

Conversation

spaparaju
Copy link

@spaparaju spaparaju commented Mar 22, 2021

Currently Hash ring is getting updated based on the .spec of the statefulset that is being watched. There are few edge cases where Hash ring is updated pre maturely (eg: replicas take some time to come up) / Hash ring is updated with incorrect replicas if the scaling of statefulset would not succeed (eg: not enough resources on the cluster). Downside of this behaviour is that requests to statefulsets like Thanos-default-receive result in temporary (with successful scaling up of a statefulset) / permanent (with unsuccessful scaling up of a statefulset) HTTP 500s.

This fix update Hash ring only with the replicas of the statefulset which are in 'Ready' status.

@spaparaju
Copy link
Author

As this PR updates hash ring based on replicas in the 'Ready state', understandably tests are failing as the current tests are performed against a 'fake cluster'.

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should retrofit the tests to support this case.

Also what happens while we rollout? The number of ready replicas changing constantly, do we react to that? Or do we only react to the configmap changes?

@bwplotka
Copy link
Member

And let's think about the typical scenario of single node getting down for a while (restart,crash). We might not want to retrigger whole hashring update and causing cascading error 🤗

Maybe there is something in between we could do?

@bwplotka
Copy link
Member

cc @brancz

@metalmatze
Copy link
Contributor

As part of an effort to make auto scaling more reliable I happened to work on the same stuff. PR incoming.

@metalmatze
Copy link
Contributor

I've commented on that topic on the CNCF Slack in #thanos-dev. Trying not to repeat the findings please allow me to simply link you there: https://cloud-native.slack.com/archives/CL25937SP/p1616512341031300

@metalmatze
Copy link
Contributor

I've opened my attempt at solving the problem in #71 Seems like we've accidentally work on this at the same time. Sorry.

@spaparaju
Copy link
Author

closing this PR in favor of #75 to address the scenario of Hashring contain endpoints of replicas in ready status even if scaling up statefulsets do not reach intentended # of replicas

@spaparaju spaparaju closed this Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants