Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug- race condition due to unused resource generation metadata/spec #10859

Closed
muang0 opened this issue Apr 14, 2022 · 6 comments · Fixed by #10920
Closed

bug- race condition due to unused resource generation metadata/spec #10859

muang0 opened this issue Apr 14, 2022 · 6 comments · Fixed by #10920

Comments

@muang0
Copy link
Contributor

muang0 commented Apr 14, 2022

I would like to propose a new 'ReadinessCheckDelay' argument to the helm commands: 'install', 'upgrade', and 'rollback' based on a race condition we are seeing between helm and a kubernetes controller. We are running helm install with the 'wait' and 'atomic' arguments, but helm doesn't track the resource readiness state (we are seeing this in approximately half of our installs that contain a statefulset). We have reviewed the api-server logs and found that helm makes both the patch and resource readiness get calls before the statefulset controller has the chance to update the resource. This is causing a false positive for the readiness check on the statefulset, and helm incorrectly updates the install status as completed successfully. I am proposing a 'ReadinessCheckDelay' argument that can be used to add some delay between the resources being patched by helm and the subsequent readiness checks. Unfortunately I can't share logs because this is something I am encountering in my cluster at work (highly restrictive fintech). This is something I'm happy to contribute assuming the community is open to this change. Thanks for the feedback :)

Output of helm version: 3.8.0

Output of kubectl version: 1.21.5

Cloud Provider/Platform (AKS, GKE, Minikube etc.): EKS

@joejulian
Copy link
Contributor

If I'm understanding correctly, this affects the upgrade and rollback processes due to the fact that when a StatefulSet or Deployment is updated, it's still considered ready because the subresources (pods, replicasets, etc) have not been updated yet. This fulfills the resources expectations (3 pods ready out of 3, for instance) causing the wait to consider the resource ready.

Have you considered filing a Kubernetes bug for this? It feels like this should be handled in the API. When patching a Deployment, StatefulSet, etc, it seems like maybe it should reset the readiness. I'd be really interested in seeing what sig-api-machinery thinks.

@joejulian
Copy link
Contributor

As for a proposal, that would start as a HIP.

@muang0
Copy link
Contributor Author

muang0 commented Apr 14, 2022

Thanks for the response! Just opened an issue to kubernetes and tagged sig-api-machinery. Let's see what their analysis is and go from there 😄

@muang0
Copy link
Contributor Author

muang0 commented Apr 19, 2022

@joejulian Check out Jordans response on my issue opened to kubernetes. Seems like a great suggestion to incorporate generational checks into the readiness logic. Might be worth a review to see if there are additional resources that could benefit from generational checks. What are your thoughts on this idea and it's potential as a HIP? Thanks for the help again!

@muang0 muang0 changed the title Proposal- 'ReadinessCheckDelay' argument bug- race condition due to unused resource generation metadata/spec Apr 28, 2022
@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Jul 28, 2022
@muang0
Copy link
Contributor Author

muang0 commented Jul 28, 2022

Bumping, this isn't stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants