Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] helm chart: add readinessProbe and let probes be configurable #1242

Merged
merged 1 commit into from
Jan 9, 2021

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Jan 6, 2021

My diagnosis of #1237 is that for whatever reason, the ci test fails because it is curling the /version endpoint before it is ready. Further, from testing in various dummy PRs, that this is an outcome influenced quite reliably based on if the binder image has been recently rebuilt or not.

When the binder image has been rebuilt recently, it means that it will be available without a download step that risks queing behind other image downloads etc, which makes pods startup in sequence. When pods startup in sequence, its plausible that the binder pod starts late and won't get a chance to startup properly.

With this PR, we introduce a readinessProbe that will never make the pod unready before the livenessProbe triggers a restart. Due to this, the purpose becomes solely to ensure that the pod is started properly before accepting network traffic.

Benefits:

  • Provides a clear criteria to know if it has properly started up useful in CI system, closes ci system following two merged PRs get stuck #1237.
  • Helps when multiple replicas are used to ensure traffic isn't redirected to a recently started byt not yet responsive binder pod.

Drawbacks:

  • Can clutter logs, but only if we use debug logging because of an exception to /health and /version is made to only log successful requests if debug mode is enabled.
  • Can delay readiness with up to 5 seconds because periodSeconds: 5 of actual readiness. This value is a compromise of not cluttering logs and not waiting too long for readiness on startup.

@consideRatio consideRatio changed the title helm chart: add readinessProbe and let probes be configurable [MRG] helm chart: add readinessProbe and let probes be configurable Jan 6, 2021
enabled: true
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 1000 # we rely on the liveness probe to resolve issues if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education: what does the comment mean/try to say? Why not use a startup probe which seems to be made for exactly our case of "pod can take time to start up"?

I think setting the failure threshold to 1000 means that the readiness probe will never(*) become "not ready" again after it becomes ready for the first time.

Why not use an initial delay similar to the liveness probe to cover the expected time the process needs to start up (I think the initial delay only starts counting after the container image has been pulled but couldn't confirm that).

(*) in reality it would become "not ready" if the probe fails for ~5000s, so about 1.3hours

Copy link
Member Author

@consideRatio consideRatio Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting the failure threshold to 1000 means that the readiness probe will never(*) become "not ready" again after it becomes ready for the first time.

Yepp!

Why not use an initial delay similar to the liveness probe to cover the expected time the process needs to start up (I think the initial delay only starts counting after the container image has been pulled but couldn't confirm that).

An initial delay of the readinessProbe would delay the first probe. If you know that the pod startup time is between 1-2 seconds, then it could make sense to set the initial delay to 2 seconds for example in order to optimize the speed to become ready. If you would set the startup time to 1 second, and it typically takes 1-2 seconds to startup the container, then you delayed the speed of getting the pod Ready initially.


readinessProbe

  • Required to make a pod Ready.
  • A failure of this -> the container is becoming Unready
  • Ready is required for the pod to receive traffic from k8s Services
  • Ready is required for kubectl rollout status --wait checks of a deployment resource.
  • A readinessProbe with a very lrage failureThreshold makes it act as a delay to become Ready during startup, but will then always remain Ready.
  • To let a readinessProbe fail following having become ready, the purpose would be to temporarily remove it from receiving traffic from k8s Services, perhaps to recover or similar, but this is only useful in specific situations I don't think is applicable to our Binder pod.

livenessProbe

  • A failure of this -> the container is getting restarted

startupProbe

  • k8s 1.16+ required
  • doesn't influence the pod's Ready status
  • serves the purpose only to complement the livenessProbe for slow startups, see this docs

Meaning of comment?
I meant to convey that we rely on the livenessProbe's functionality to recover from bad states, not by simply making the pod Unready which can be unhelpful.

Reference
startupProbe and such is discussed also in jupyterhub/zero-to-jupyterhub-k8s#1732 (comment)

@consideRatio
Copy link
Member Author

Merging to close #1237 and test if the updated henchbot (henchbot/mybinder.org-upgrades#5) works as it should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ci system following two merged PRs get stuck
2 participants