-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] helm chart: add readinessProbe and let probes be configurable #1242
[MRG] helm chart: add readinessProbe and let probes be configurable #1242
Conversation
enabled: true | ||
initialDelaySeconds: 0 | ||
periodSeconds: 5 | ||
failureThreshold: 1000 # we rely on the liveness probe to resolve issues if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education: what does the comment mean/try to say? Why not use a startup probe which seems to be made for exactly our case of "pod can take time to start up"?
I think setting the failure threshold to 1000 means that the readiness probe will never(*) become "not ready" again after it becomes ready for the first time.
Why not use an initial delay similar to the liveness probe to cover the expected time the process needs to start up (I think the initial delay only starts counting after the container image has been pulled but couldn't confirm that).
(*) in reality it would become "not ready" if the probe fails for ~5000s, so about 1.3hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting the failure threshold to 1000 means that the readiness probe will never(*) become "not ready" again after it becomes ready for the first time.
Yepp!
Why not use an initial delay similar to the liveness probe to cover the expected time the process needs to start up (I think the initial delay only starts counting after the container image has been pulled but couldn't confirm that).
An initial delay of the readinessProbe would delay the first probe. If you know that the pod startup time is between 1-2 seconds, then it could make sense to set the initial delay to 2 seconds for example in order to optimize the speed to become ready. If you would set the startup time to 1 second, and it typically takes 1-2 seconds to startup the container, then you delayed the speed of getting the pod Ready initially.
readinessProbe
- Required to make a pod Ready.
- A failure of this -> the container is becoming Unready
- Ready is required for the pod to receive traffic from k8s Services
- Ready is required for
kubectl rollout status --wait
checks of a deployment resource. - A readinessProbe with a very lrage failureThreshold makes it act as a delay to become Ready during startup, but will then always remain Ready.
- To let a readinessProbe fail following having become ready, the purpose would be to temporarily remove it from receiving traffic from k8s Services, perhaps to recover or similar, but this is only useful in specific situations I don't think is applicable to our Binder pod.
livenessProbe
- A failure of this -> the container is getting restarted
startupProbe
- k8s 1.16+ required
- doesn't influence the pod's Ready status
- serves the purpose only to complement the livenessProbe for slow startups, see this docs
Meaning of comment?
I meant to convey that we rely on the livenessProbe's functionality to recover from bad states, not by simply making the pod Unready which can be unhelpful.
Reference
startupProbe and such is discussed also in jupyterhub/zero-to-jupyterhub-k8s#1732 (comment)
Merging to close #1237 and test if the updated henchbot (henchbot/mybinder.org-upgrades#5) works as it should. |
jupyterhub/binderhub#1242 Merge pull request #1242 from consideRatio/pr/readiness-probe
My diagnosis of #1237 is that for whatever reason, the ci test fails because it is
curl
ing the /version endpoint before it is ready. Further, from testing in various dummy PRs, that this is an outcome influenced quite reliably based on if the binder image has been recently rebuilt or not.When the binder image has been rebuilt recently, it means that it will be available without a download step that risks queing behind other image downloads etc, which makes pods startup in sequence. When pods startup in sequence, its plausible that the binder pod starts late and won't get a chance to startup properly.
With this PR, we introduce a readinessProbe that will never make the pod unready before the livenessProbe triggers a restart. Due to this, the purpose becomes solely to ensure that the pod is started properly before accepting network traffic.
Benefits:
Drawbacks:
periodSeconds: 5
of actual readiness. This value is a compromise of not cluttering logs and not waiting too long for readiness on startup.