Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downscale webhook fails when currently upscaling #72

Open
jhalterman opened this issue Jul 18, 2023 · 2 comments
Open

Downscale webhook fails when currently upscaling #72

jhalterman opened this issue Jul 18, 2023 · 2 comments

Comments

@jhalterman
Copy link
Member

jhalterman commented Jul 18, 2023

If a pod is in the process of being created for a statefulset, the downscale webhook will reject an attempt to change a resource that would cause pods to downscale:

level=error ts=2023-07-18T01:23:33.406992007Z name=ingester-zone-a resource=statefulsets namespace=mimir-dev-11 request_gvk="apps/v1, Kind=StatefulSet" old_replicas=225 new_replicas=5 msg="downscale not allowed due to error" err="Post "http://ingester-zone-a-218.ingester-zone-a.mimir-dev-11.svc.cluster.local:80/ingester/prepare-shutdown": dial tcp: lookup ingester-zone-a-218.ingester-zone-a.mimir-dev-11.svc.cluster.local on 10.188.0.10:53: no such host"

This was discovered when an HPA was scaling up too aggressively, and when trying to revert the change that caused that, the downscale webhook rejected the change since the statefulset was currently upscaling.

@jhalterman jhalterman changed the title Downscale webhook fails if any pod is being created Downscale webhook fails when currently upscaling Jul 18, 2023
@56quarters
Copy link
Contributor

What would be the correct behavior in this case? Did the prepare-shutdown call eventually succeed once the pod started?

@jhalterman
Copy link
Member Author

What would be the correct behavior in this case?

We could ignore "no such host" errors when performing this check since that implies the machine wasn't running in the first place. This might not be a perfect solution, but an improvement at least.

Did the prepare-shutdown call eventually succeed once the pod started?

Yes, it would succeed for a pod eventually, but in this scenario the HPA was regularly creating new pods, so then the same error would be hit on a new pod the next time a resource change was attempted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants