[MRG] helm chart: add readinessProbe and let probes be configurable #1242

consideRatio · 2021-01-06T22:07:57Z

My diagnosis of #1237 is that for whatever reason, the ci test fails because it is curling the /version endpoint before it is ready. Further, from testing in various dummy PRs, that this is an outcome influenced quite reliably based on if the binder image has been recently rebuilt or not.

When the binder image has been rebuilt recently, it means that it will be available without a download step that risks queing behind other image downloads etc, which makes pods startup in sequence. When pods startup in sequence, its plausible that the binder pod starts late and won't get a chance to startup properly.

With this PR, we introduce a readinessProbe that will never make the pod unready before the livenessProbe triggers a restart. Due to this, the purpose becomes solely to ensure that the pod is started properly before accepting network traffic.

Benefits:

Provides a clear criteria to know if it has properly started up useful in CI system, closes ci system following two merged PRs get stuck #1237.
Helps when multiple replicas are used to ensure traffic isn't redirected to a recently started byt not yet responsive binder pod.

Drawbacks:

Can clutter logs, but only if we use debug logging because of an exception to /health and /version is made to only log successful requests if debug mode is enabled.
Can delay readiness with up to 5 seconds because periodSeconds: 5 of actual readiness. This value is a compromise of not cluttering logs and not waiting too long for readiness on startup.

betatim · 2021-01-07T07:35:21Z

helm-chart/binderhub/values.yaml

+    enabled: true
+    initialDelaySeconds: 0
+    periodSeconds: 5
+    failureThreshold: 1000  # we rely on the liveness probe to resolve issues if needed


For my education: what does the comment mean/try to say? Why not use a startup probe which seems to be made for exactly our case of "pod can take time to start up"?

I think setting the failure threshold to 1000 means that the readiness probe will never(*) become "not ready" again after it becomes ready for the first time.

Why not use an initial delay similar to the liveness probe to cover the expected time the process needs to start up (I think the initial delay only starts counting after the container image has been pulled but couldn't confirm that).

(*) in reality it would become "not ready" if the probe fails for ~5000s, so about 1.3hours

I think setting the failure threshold to 1000 means that the readiness probe will never(*) become "not ready" again after it becomes ready for the first time.

Yepp!

Why not use an initial delay similar to the liveness probe to cover the expected time the process needs to start up (I think the initial delay only starts counting after the container image has been pulled but couldn't confirm that).

An initial delay of the readinessProbe would delay the first probe. If you know that the pod startup time is between 1-2 seconds, then it could make sense to set the initial delay to 2 seconds for example in order to optimize the speed to become ready. If you would set the startup time to 1 second, and it typically takes 1-2 seconds to startup the container, then you delayed the speed of getting the pod Ready initially.

readinessProbe

Required to make a pod Ready.

A failure of this -> the container is becoming Unready

Ready is required for the pod to receive traffic from k8s Services

Ready is required for kubectl rollout status --wait checks of a deployment resource.

A readinessProbe with a very lrage failureThreshold makes it act as a delay to become Ready during startup, but will then always remain Ready.

To let a readinessProbe fail following having become ready, the purpose would be to temporarily remove it from receiving traffic from k8s Services, perhaps to recover or similar, but this is only useful in specific situations I don't think is applicable to our Binder pod.

livenessProbe

A failure of this -> the container is getting restarted

startupProbe

k8s 1.16+ required

doesn't influence the pod's Ready status

serves the purpose only to complement the livenessProbe for slow startups, see this docs

Meaning of comment?
I meant to convey that we rely on the livenessProbe's functionality to recover from bad states, not by simply making the pod Unready which can be unhelpful.

Reference
startupProbe and such is discussed also in jupyterhub/zero-to-jupyterhub-k8s#1732 (comment)

consideRatio · 2021-01-09T22:15:50Z

Merging to close #1237 and test if the updated henchbot (henchbot/mybinder.org-upgrades#5) works as it should.

jupyterhub/binderhub#1242 Merge pull request #1242 from consideRatio/pr/readiness-probe

helm chart: add readinessProbe and let probes be configurable

32e06ee

consideRatio mentioned this pull request Jan 6, 2021

ci system following two merged PRs get stuck #1237

Closed

consideRatio changed the title ~~helm chart: add readinessProbe and let probes be configurable~~ [MRG] helm chart: add readinessProbe and let probes be configurable Jan 6, 2021

betatim reviewed Jan 7, 2021

View reviewed changes

consideRatio merged commit cb2fa46 into jupyterhub:master Jan 9, 2021

consideRatio pushed a commit to jupyterhub/helm-chart that referenced this pull request Jan 9, 2021

[binderhub] Automatic update for commit 0.2.0-n472.h32e06ee

7e713bc

jupyterhub/binderhub#1242 Merge pull request #1242 from consideRatio/pr/readiness-probe

consideRatio mentioned this pull request Jan 10, 2021

Merge requirements.yaml into Chart.yaml for Helm3 henchbot/mybinder.org-upgrades#5

Merged

henchbot mentioned this pull request Jan 10, 2021

binderhub: 0.2.0-n471.hf76e37e...0.2.0-n472.h32e06ee jupyterhub/mybinder.org-deploy#1770

Merged

henchbot mentioned this pull request Jan 23, 2021

binderhub: 0.2.0-n472.h32e06ee...0.2.0-n499.h81660eb jupyterhub/mybinder.org-deploy#1793

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] helm chart: add readinessProbe and let probes be configurable #1242

[MRG] helm chart: add readinessProbe and let probes be configurable #1242

consideRatio commented Jan 6, 2021 •

edited

Loading

betatim Jan 7, 2021

consideRatio Jan 7, 2021 •

edited

Loading

consideRatio commented Jan 9, 2021

[MRG] helm chart: add readinessProbe and let probes be configurable #1242

[MRG] helm chart: add readinessProbe and let probes be configurable #1242

Conversation

consideRatio commented Jan 6, 2021 • edited Loading

betatim Jan 7, 2021

Choose a reason for hiding this comment

consideRatio Jan 7, 2021 • edited Loading

Choose a reason for hiding this comment

consideRatio commented Jan 9, 2021

consideRatio commented Jan 6, 2021 •

edited

Loading

consideRatio Jan 7, 2021 •

edited

Loading