-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default settings of the Hub pod's readinessProbe are problematic #1732
Comments
This allows one to set failureThreshold and timeoutSeconds for the hub probes. Related: jupyterhub#1732
lololol, I just made #1800 and came here to file this exact issue :D
I agree! I think we should get rid of the readiness probe, and instead use a startupProbe |
:D @yuvipanda I agree, but that will require k8s 1.16+ so we need to introduce a requirement on that or make it conditional until it's resolved. +1 for adding a startupProbe Helm chart configuration with it enabled, but letting it be disabled for k8s clusters not supporting a startupProbe. I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though. I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe. We want the hub pod to quickly become k8s Pod Ready when it is, but we also don't want it to not be ready after it has become ready at any time in the past, because that doesn't make sense for a single replica deployment that doesn't even do rolling upgrades. But, if we remove it, I think it means it will be Ready from second 0, which makes us unable to wait for its status as an indicator of the hub pods successful startup I think, but perhaps the startupProbe influence also the readiness state, but I don't think so - pods will be considered Ready whenever they start running without a readinessProbe I believe. |
I just learned about startupProbes a minute ago and came here to check if we already have an Issue for it and happy to see you folks are on it! ❤️ I agree 100% that we should use a startupProbe to solve the hard problem, whether that's version-checking or waiting until we can bump minimum supported k8s to 1.16 (probably soon?).
It used to happen quite a bit, and this is part of what drove the Hub's internal consecutive failure limit which prompts a restart (the Hub's own liveness check, essentially, which still causes daily restarts on mybinder.org). So I think it's reasonable to keep a simple livenessProbe around that restarts a stuck hub once in a while is still a good thing. It is important that liveness check doesn't fail just because the Hub is busy, though, since that's the last time we want to be restarting the Hub! For that reason, I'd be okay disabling it by default, too.
From the docs I was just reading:
I think readinessProbes are not very useful when we have a startupProbe for the Hub in the absence of replicas and rolling upgrades. I think readiness is only useful then if we want a Hub pod to go back into an unready state at some point after becoming ready following the startupProbe's success. I think we want either readinessProbe or startupProbe, but not both. The only use I can see for readiness the is if there's a flood of traffic resulting in DOS of the Hub. If this causes a readiness failure, traffic might stop in kubernetes rather than continuing to pile up on the Hub. It's still a failure mode, but might behave better sometimes? I'm not sure. |
It is my understanding that the startupProbe won't influence readiness at all, only delay the livenessProbe. In other words, the startupProbe won't do anything for us other than help us avoid potential issues during startup from a livenessProbe.
To conclude, I think using all three makes sense, and that we should increase the readinessProbe's failureThreashold to a big number. Notes on startupProbe not being related to readinessProbeThis is documentation from kubernetes docs, and I see no indication that a startupProbe would influence readiness.
|
Action points to close this issue
|
While the action point above to add a startupProbe to reduce/remove the livenessProbe's initialDelaySeconds, I don't find that important at all and doesn't have anything to do with the readinessProbe issues etc. There is feature creep and issue creep i guess :) This was a bit of an issue creep, closing! |
We have a k8s readinessProbe defined, which means that we declare the hub only to be "Ready" to receive traffic when it has passed the test of responding to a HTTP request. This is only used as an indicator of successfully starting up currently. For example, the CI tests should not try send traffic to the hub pod before the hub pod has started.
To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.
Instead of disabling the readinessProbe entirely, I think we should try to make it not fail, this can be done using failureThreshold for example: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#probe-v1-core
503 (Service unavailable) errors
These happen when traffic is routed to a k8s Service that direct traffic to pods without any pod in a Ready status, which is influenced by the readinessProbe.
Current default configuration and its use
zero-to-jupyterhub-k8s/jupyterhub/values.yaml
Lines 90 to 97 in 0c88d30
zero-to-jupyterhub-k8s/jupyterhub/templates/hub/deployment.yaml
Lines 224 to 246 in d080b1d
The text was updated successfully, but these errors were encountered: