Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default settings of the Hub pod's readinessProbe are problematic #1732

Closed
consideRatio opened this issue Jul 24, 2020 · 6 comments
Closed

Default settings of the Hub pod's readinessProbe are problematic #1732

consideRatio opened this issue Jul 24, 2020 · 6 comments
Labels

Comments

@consideRatio
Copy link
Member

consideRatio commented Jul 24, 2020

We have a k8s readinessProbe defined, which means that we declare the hub only to be "Ready" to receive traffic when it has passed the test of responding to a HTTP request. This is only used as an indicator of successfully starting up currently. For example, the CI tests should not try send traffic to the hub pod before the hub pod has started.

NOTE: that there is also a livenessProbe which if that fails, it would trigger the container in the pod to restart.

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

Instead of disabling the readinessProbe entirely, I think we should try to make it not fail, this can be done using failureThreshold for example: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#probe-v1-core

503 (Service unavailable) errors

These happen when traffic is routed to a k8s Service that direct traffic to pods without any pod in a Ready status, which is influenced by the readinessProbe.

image

Current default configuration and its use

livenessProbe:
enabled: false
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
enabled: true
initialDelaySeconds: 0
periodSeconds: 2

{{- if .Values.hub.livenessProbe.enabled }}
# livenessProbe notes:
# We don't know how long hub database upgrades could take
# so having a liveness probe could be a bit risky unless we put
# a initialDelaySeconds value with long enough margin for that
# to not be an issue. If it is too short, we could end up aborting
# database upgrades midway or ending up in an infinite restart
# loop.
livenessProbe:
initialDelaySeconds: {{ .Values.hub.livenessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.hub.livenessProbe.periodSeconds }}
httpGet:
path: {{ .Values.hub.baseUrl | trimSuffix "/" }}/hub/health
port: http
{{- end }}
{{- if .Values.hub.readinessProbe.enabled }}
readinessProbe:
initialDelaySeconds: {{ .Values.hub.readinessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.hub.readinessProbe.periodSeconds }}
httpGet:
path: {{ .Values.hub.baseUrl | trimSuffix "/" }}/hub/health
port: http
{{- end }}

@consideRatio consideRatio changed the title Default settings of the Hub pod's readinessProbe is problematic Default settings of the Hub pod's readinessProbe are problematic Jul 24, 2020
rmoe added a commit to rmoe/zero-to-jupyterhub-k8s that referenced this issue Sep 3, 2020
This allows one to set failureThreshold and timeoutSeconds
for the hub probes.

Related: jupyterhub#1732
@yuvipanda
Copy link
Collaborator

lololol, I just made #1800 and came here to file this exact issue :D

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

I agree! I think we should get rid of the readiness probe, and instead use a startupProbe

@consideRatio
Copy link
Member Author

consideRatio commented Oct 5, 2020

:D @yuvipanda I agree, but that will require k8s 1.16+ so we need to introduce a requirement on that or make it conditional until it's resolved.

+1 for adding a startupProbe Helm chart configuration with it enabled, but letting it be disabled for k8s clusters not supporting a startupProbe.

I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.

I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe. We want the hub pod to quickly become k8s Pod Ready when it is, but we also don't want it to not be ready after it has become ready at any time in the past, because that doesn't make sense for a single replica deployment that doesn't even do rolling upgrades. But, if we remove it, I think it means it will be Ready from second 0, which makes us unable to wait for its status as an indicator of the hub pods successful startup I think, but perhaps the startupProbe influence also the readiness state, but I don't think so - pods will be considered Ready whenever they start running without a readinessProbe I believe.

@minrk
Copy link
Member

minrk commented Oct 6, 2020

I just learned about startupProbes a minute ago and came here to check if we already have an Issue for it and happy to see you folks are on it! ❤️

I agree 100% that we should use a startupProbe to solve the hard problem, whether that's version-checking or waiting until we can bump minimum supported k8s to 1.16 (probably soon?).

I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.

It used to happen quite a bit, and this is part of what drove the Hub's internal consecutive failure limit which prompts a restart (the Hub's own liveness check, essentially, which still causes daily restarts on mybinder.org). So I think it's reasonable to keep a simple livenessProbe around that restarts a stuck hub once in a while is still a good thing. It is important that liveness check doesn't fail just because the Hub is busy, though, since that's the last time we want to be restarting the Hub! For that reason, I'd be okay disabling it by default, too.

I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe

From the docs I was just reading:

Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

I think readinessProbes are not very useful when we have a startupProbe for the Hub in the absence of replicas and rolling upgrades. I think readiness is only useful then if we want a Hub pod to go back into an unready state at some point after becoming ready following the startupProbe's success. I think we want either readinessProbe or startupProbe, but not both. The only use I can see for readiness the is if there's a flood of traffic resulting in DOS of the Hub. If this causes a readiness failure, traffic might stop in kubernetes rather than continuing to pile up on the Hub. It's still a failure mode, but might behave better sometimes? I'm not sure.

@consideRatio
Copy link
Member Author

It is my understanding that the startupProbe won't influence readiness at all, only delay the livenessProbe. In other words, the startupProbe won't do anything for us other than help us avoid potential issues during startup from a livenessProbe.

  1. It can be nice to have a livenessProbe still to recover from some unknown failure event.
  2. It can be nice to have a startupProbe as a complement to livenessProbe configuration, allowing us to have a separate logic for it during startup.
  3. It can be nice to have a readinessProbe to help us get informed by the readiness state, which can for example help us do kubectl rollout status deploy/hub which will be like "sleep until hub is ready".
  4. It may make sense to not have a readinessProbe configured if we only have one replica anyhow, as there is no fallback, but at the same time, it could be good to use a "service unavailable" fallback response which comes from a k8s Service not having any ready pods to send traffic to as a fallback as well, to reduce the load so it can recover. If we want, we can let the failureThreashold be 1000 while the successThreashold be 1, which makes us almost never become unready after having been ready once.

To conclude, I think using all three makes sense, and that we should increase the readinessProbe's failureThreashold to a big number.

Notes on startupProbe not being related to readinessProbe

This is documentation from kubernetes docs, and I see no indication that a startupProbe would influence readiness.

Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.

@consideRatio
Copy link
Member Author

consideRatio commented Oct 8, 2020

Action points to close this issue

@consideRatio
Copy link
Member Author

While the action point above to add a startupProbe to reduce/remove the livenessProbe's initialDelaySeconds, I don't find that important at all and doesn't have anything to do with the readinessProbe issues etc.

There is feature creep and issue creep i guess :) This was a bit of an issue creep, closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants