Default settings of the Hub pod's readinessProbe are problematic #1732

consideRatio · 2020-07-24T13:23:11Z

We have a k8s readinessProbe defined, which means that we declare the hub only to be "Ready" to receive traffic when it has passed the test of responding to a HTTP request. This is only used as an indicator of successfully starting up currently. For example, the CI tests should not try send traffic to the hub pod before the hub pod has started.

NOTE: that there is also a livenessProbe which if that fails, it would trigger the container in the pod to restart.

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

Instead of disabling the readinessProbe entirely, I think we should try to make it not fail, this can be done using failureThreshold for example: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#probe-v1-core

503 (Service unavailable) errors

These happen when traffic is routed to a k8s Service that direct traffic to pods without any pod in a Ready status, which is influenced by the readinessProbe.

Current default configuration and its use

zero-to-jupyterhub-k8s/jupyterhub/values.yaml

Lines 90 to 97 in 0c88d30

    
           livenessProbe: 
        
             enabled: false 
        
             initialDelaySeconds: 60 
        
             periodSeconds: 10 
        
           readinessProbe: 
        
             enabled: true 
        
             initialDelaySeconds: 0 
        
             periodSeconds: 2

zero-to-jupyterhub-k8s/jupyterhub/templates/hub/deployment.yaml

Lines 224 to 246 in d080b1d

    
           {{- if .Values.hub.livenessProbe.enabled }} 
        
           # livenessProbe notes: 
        
           # We don't know how long hub database upgrades could take 
        
           # so having a liveness probe could be a bit risky unless we put 
        
           # a initialDelaySeconds value with long enough margin for that 
        
           # to not be an issue. If it is too short, we could end up aborting 
        
           # database upgrades midway or ending up in an infinite restart 
        
           # loop. 
        
           livenessProbe: 
        
             initialDelaySeconds: {{ .Values.hub.livenessProbe.initialDelaySeconds }} 
        
             periodSeconds: {{ .Values.hub.livenessProbe.periodSeconds }} 
        
             httpGet: 
        
               path: {{ .Values.hub.baseUrl | trimSuffix "/" }}/hub/health 
        
               port: http 
        
           {{- end }} 
        
           {{- if .Values.hub.readinessProbe.enabled }} 
        
           readinessProbe: 
        
             initialDelaySeconds: {{ .Values.hub.readinessProbe.initialDelaySeconds }} 
        
             periodSeconds: {{ .Values.hub.readinessProbe.periodSeconds }} 
        
             httpGet: 
        
               path: {{ .Values.hub.baseUrl | trimSuffix "/" }}/hub/health 
        
               port: http 
        
           {{- end }}

This allows one to set failureThreshold and timeoutSeconds for the hub probes. Related: jupyterhub#1732

yuvipanda · 2020-10-05T07:23:17Z

lololol, I just made #1800 and came here to file this exact issue :D

To make a pod not-ready at a later time doesn't make much sense - we don't have a pod to fall back to, and we cannot have multiple instances running. Due to this, the setting of the readinessProbe should be such that the hub pod becomes ready when it should be, but that it doesn't ever fail because it was slow for a while.

I agree! I think we should get rid of the readiness probe, and instead use a startupProbe

consideRatio · 2020-10-05T08:07:40Z

:D @yuvipanda I agree, but that will require k8s 1.16+ so we need to introduce a requirement on that or make it conditional until it's resolved.

+1 for adding a startupProbe Helm chart configuration with it enabled, but letting it be disabled for k8s clusters not supporting a startupProbe.

I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.

I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe. We want the hub pod to quickly become k8s Pod Ready when it is, but we also don't want it to not be ready after it has become ready at any time in the past, because that doesn't make sense for a single replica deployment that doesn't even do rolling upgrades. But, if we remove it, I think it means it will be Ready from second 0, which makes us unable to wait for its status as an indicator of the hub pods successful startup I think, but perhaps the startupProbe influence also the readiness state, but I don't think so - pods will be considered Ready whenever they start running without a readinessProbe I believe.

minrk · 2020-10-06T13:40:58Z

I just learned about startupProbes a minute ago and came here to check if we already have an Issue for it and happy to see you folks are on it! ❤️

I agree 100% that we should use a startupProbe to solve the hard problem, whether that's version-checking or waiting until we can bump minimum supported k8s to 1.16 (probably soon?).

I'm not sure what i think about deleting the livenessProbe, it can serve a purpose to reset the hub pod if it would get stuck following startup, but I don't ever see that happening though.

It used to happen quite a bit, and this is part of what drove the Hub's internal consecutive failure limit which prompts a restart (the Hub's own liveness check, essentially, which still causes daily restarts on mybinder.org). So I think it's reasonable to keep a simple livenessProbe around that restarts a stuck hub once in a while is still a good thing. It is important that liveness check doesn't fail just because the Hub is busy, though, since that's the last time we want to be restarting the Hub! For that reason, I'd be okay disabling it by default, too.

I think we should have the readinessProbe running still though, as I believe that is a separate function to both livenessProbe and startupProbe

From the docs I was just reading:

Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.

I think readinessProbes are not very useful when we have a startupProbe for the Hub in the absence of replicas and rolling upgrades. I think readiness is only useful then if we want a Hub pod to go back into an unready state at some point after becoming ready following the startupProbe's success. I think we want either readinessProbe or startupProbe, but not both. The only use I can see for readiness the is if there's a flood of traffic resulting in DOS of the Hub. If this causes a readiness failure, traffic might stop in kubernetes rather than continuing to pile up on the Hub. It's still a failure mode, but might behave better sometimes? I'm not sure.

consideRatio · 2020-10-06T13:59:29Z

It is my understanding that the startupProbe won't influence readiness at all, only delay the livenessProbe. In other words, the startupProbe won't do anything for us other than help us avoid potential issues during startup from a livenessProbe.

It can be nice to have a livenessProbe still to recover from some unknown failure event.
It can be nice to have a startupProbe as a complement to livenessProbe configuration, allowing us to have a separate logic for it during startup.
It can be nice to have a readinessProbe to help us get informed by the readiness state, which can for example help us do kubectl rollout status deploy/hub which will be like "sleep until hub is ready".
It may make sense to not have a readinessProbe configured if we only have one replica anyhow, as there is no fallback, but at the same time, it could be good to use a "service unavailable" fallback response which comes from a k8s Service not having any ready pods to send traffic to as a fallback as well, to reduce the load so it can recover. If we want, we can let the failureThreashold be 1000 while the successThreashold be 1, which makes us almost never become unready after having been ready once.

To conclude, I think using all three makes sense, and that we should increase the readinessProbe's failureThreashold to a big number.

Notes on startupProbe not being related to readinessProbe

This is documentation from kubernetes docs, and I see no indication that a startupProbe would influence readiness.

Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.

consideRatio · 2020-10-08T01:26:42Z

Action points to close this issue

Set a very large failureThreshold on the readinessProbe (Enable hub livenessProbe by default and relax hub/proxy probes #1941)
Set a very long initialDelaySeconds for the livenessProbe (Enable hub livenessProbe by default and relax hub/proxy probes #1941)
When we assume startupProbe to be available, we can reduce/remove the initialDelaySeconds of the livenessProbe and rely on the startupProbe to manage a slow startup.

consideRatio · 2021-01-16T08:19:00Z

While the action point above to add a startupProbe to reduce/remove the livenessProbe's initialDelaySeconds, I don't find that important at all and doesn't have anything to do with the readinessProbe issues etc.

There is feature creep and issue creep i guess :) This was a bit of an issue creep, closing!

consideRatio added the bug label Jul 24, 2020

consideRatio changed the title ~~Default settings of the Hub pod's readinessProbe is problematic~~ Default settings of the Hub pod's readinessProbe are problematic Jul 24, 2020

rmoe added a commit to rmoe/zero-to-jupyterhub-k8s that referenced this issue Sep 3, 2020

Add additional liveness and readiness probe properties

5bd6cbc

This allows one to set failureThreshold and timeoutSeconds for the hub probes. Related: jupyterhub#1732

rmoe mentioned this issue Sep 3, 2020

Add additional liveness and readiness probe properties #1767

Merged

consideRatio mentioned this issue Oct 7, 2020

Move JupyterHub database upgrade to an initContainer #1372

Closed

This was referenced Dec 10, 2020

hub pod's readinessProbe too strict #1936

Closed

Enable hub livenessProbe by default and relax hub/proxy probes #1941

Merged

consideRatio mentioned this issue Jan 7, 2021

[MRG] helm chart: add readinessProbe and let probes be configurable jupyterhub/binderhub#1242

Merged

consideRatio closed this as completed Jan 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default settings of the Hub pod's readinessProbe are problematic #1732

Default settings of the Hub pod's readinessProbe are problematic #1732

consideRatio commented Jul 24, 2020 •

edited

Loading

yuvipanda commented Oct 5, 2020

consideRatio commented Oct 5, 2020 •

edited

Loading

minrk commented Oct 6, 2020

consideRatio commented Oct 6, 2020

consideRatio commented Oct 8, 2020 •

edited

Loading

consideRatio commented Jan 16, 2021

Default settings of the Hub pod's readinessProbe are problematic #1732

Default settings of the Hub pod's readinessProbe are problematic #1732

Comments

consideRatio commented Jul 24, 2020 • edited Loading

503 (Service unavailable) errors

Current default configuration and its use

yuvipanda commented Oct 5, 2020

consideRatio commented Oct 5, 2020 • edited Loading

minrk commented Oct 6, 2020

consideRatio commented Oct 6, 2020

Notes on startupProbe not being related to readinessProbe

consideRatio commented Oct 8, 2020 • edited Loading

Action points to close this issue

consideRatio commented Jan 16, 2021

consideRatio commented Jul 24, 2020 •

edited

Loading

consideRatio commented Oct 5, 2020 •

edited

Loading

consideRatio commented Oct 8, 2020 •

edited

Loading