Readiness probe fails if the container restarts after a liveness probe fail #12571

Shashankft9 · 2022-01-31T12:38:25Z

What version of Knative?

v1.2

Expected Behavior

After a liveness probe fail, the container should restart and ideally should start serving the traffic again, just like how it would happen for a k8s deployment.

Actual Behavior

After the restart from liveness probe fail, the pod starts to fail with readiness probe, and serves no traffic.

Steps to Reproduce the Problem

ksvc:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: healthprobes-go
spec:
  template:
    spec:
      containers:
        - image: shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17ad4772c39fe841b98fb9
          env:
            - name: TARGET
              value: "Go Sample v1"
          livenessProbe:
            httpGet:
              path: /healthz/liveness
              port: 0
            periodSeconds: 15
            failureThreshould: 1

the image here is produced from the code here: https://github.com/knative/serving/blob/db4c85b641703f94a700ada8cc074d28b423b5eb/test/test_images/healthprobes/health_probes.go

Hit the /start-failing path

relevant events of the ksvc pod:

  Normal   Created    2m                 kubelet            Created container queue-proxy
  Normal   Started    2m                 kubelet            Started container queue-proxy
  Normal   Killing    60s                kubelet            Container user-container failed liveness probe, will be restarted
  Warning  Unhealthy  60s (x3 over 90s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Started    9s (x2 over 2m)    kubelet            Started container user-container
  Normal   Created    9s (x2 over 2m)    kubelet            Created container user-container
  Normal   Pulled     9s                 kubelet            Container image "shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17a1b98fb9" already present on machine
  Warning  Unhealthy  0s (x6 over 50s)   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

I tried to run a k8s deployment with the same image and liveness probe configuration, and it seemed to behave as I hoped, whenever I explicitly hit the /start-failing path, in some seconds liveness probe will report failure and the container will restart and start serving the traffic again.

The text was updated successfully, but these errors were encountered:

skonto · 2022-02-02T12:41:34Z

@Shashankft9 could you check the kubelet logs about the failure? Reason of request failure should be recorded in there. You need to set log level>=4 btw and restart the kubelet service.

Shashankft9 · 2022-02-08T07:05:51Z

I can see the following logs in kubelet:

I0208 01:54:37.906694   12942 http.go:137] Probe failed for http://10.233.65.50:8012/ with request headers map[Accept:[*/*] K-Network-Probe:[queue] User-Agent:[kube-probe/1.20]], response body: shutting down
I0208 01:54:37.906752   12942 prober.go:117] Readiness probe for "healthprobes-go-00001-deployment-55cf484684-dhgct_default(b0570c26-41ea-4156-9367-372f5775415e):queue-proxy" failed (failure): HTTP probe failed with statuscode: 503

skonto · 2022-02-09T10:53:03Z

@Shashankft9 it seems your are hitting this line:

serving/vendor/knative.dev/pkg/network/handlers/drain.go

Line 102 in 52f07b0

http.Error(w, "shutting down", http.StatusServiceUnavailable)

StatusServiceUnavailable            = 503 // RFC 7231, 6.6.4

Due to the user container restart queue proxy goes in draining mode, as described here:

serving/vendor/knative.dev/pkg/network/handlers/drain.go

Lines 51 to 57 in 52f07b0

    
           // Drainer wraps an inner http.Handler to support responding to kubelet 
        
           // probes and KProbes with a "200 OK" until the handler is told to Drain, 
        
           // or Drainer will optionally run the HealthCheck if it is defined. 
        
           // When the Drainer is told to Drain, it will immediately start to fail 
        
           // probes with a "500 shutting down", and the call will block until no 
        
           // requests have been received for QuietPeriod (defaults to 
        
           // network.DefaultDrainTimeout).

In the comments it should be: "503 shutting down". DefaultDrainTimeout = 45s. K8s defaults for readiness probes are here.

".Spec.Containers[0].ReadinessProbe.FailureThreshold": 3

Knative applies some revision defaults but not in this case (in the ksvc above only liveness probes are set):

serving/pkg/apis/serving/v1/revision_defaults.go

Lines 136 to 158 in 52f07b0

    
           func (*RevisionSpec) applyProbes(container *corev1.Container) { 
        
           	if container.ReadinessProbe == nil { 
        
           		container.ReadinessProbe = &corev1.Probe{} 
        
           	} 
        
           	if container.ReadinessProbe.TCPSocket == nil && 
        
           		container.ReadinessProbe.HTTPGet == nil && 
        
           		container.ReadinessProbe.Exec == nil { 
        
           		container.ReadinessProbe.TCPSocket = &corev1.TCPSocketAction{} 
        
           	} 
        
           	if container.ReadinessProbe.SuccessThreshold == 0 { 
        
           		container.ReadinessProbe.SuccessThreshold = 1 
        
           	} 
        
           	// Apply k8s defaults when ReadinessProbe.PeriodSeconds property is set 
        
           	if container.ReadinessProbe.PeriodSeconds != 0 { 
        
           		if container.ReadinessProbe.FailureThreshold == 0 { 
        
           			container.ReadinessProbe.FailureThreshold = 3 
        
           		} 
        
           		if container.ReadinessProbe.TimeoutSeconds == 0 { 
        
           			container.ReadinessProbe.TimeoutSeconds = 1 
        
           		} 
        
           	}

One thing to consider although it does not seem to be the case is that:

Caution: Liveness probes do not wait for readiness probes to succeed. If you want to wait before executing a liveness probe you should use initialDelaySeconds or a startupProbe.

What happens if you increase the readiness failure threshold to 5? Right now it seems Kubelet will try for 3*10=30 seconds and then fail which is lower than the draining period. That probably explains what you see in the events: 6 failures and 2 container restarts. My rough guess.

Shashankft9 · 2022-02-09T13:59:08Z

hey, thanks for digging in, just to be sure - are you suggesting that I should try changing the readiness failure threshold of the queue-proxy container? I can see what you are saying if so - basically since the 3 checks end before the draining period, queue-proxy keeps on giving 503 for the whole duration of those 3 checks, correct?

Another scenario I tested, might be relevant - this time I added a readiness probe in the ksvc:

          readinessProbe:
            httpGet:
              path: /healthz/readiness
              port: 0
            periodSeconds: 15
            failureThreshould: 5

and I expected the same failure log in kubelet, but I got this:

I0209 08:54:41.457686   12942 http.go:137] Probe failed for http://10.233.65.149:8012/ with request headers map[Accept:[*/*] K-Network-Probe:[queue] User-Agent:[kube-probe/1.20]], response body:
I0209 08:54:41.457730   12942 prober.go:117] Readiness probe for "healthprobes-go-00001-deployment-6877485457-cw7lz_default(2f04f8af-f9aa-4bb9-938e-a2a1d5b8cb28):queue-proxy" failed (failure): HTTP probe failed with statuscode: 503

I will try changing the default queue-proxy readiness probe's failureThreshold in serving code.

skonto · 2022-02-10T23:42:43Z

@Shashankft9 I meant the user container let me explain though because eventually the issue is different. I debug the scenario
with the following settings (using your code):

          readinessProbe:
            httpGet:
              path: /healthz/readiness
              port: 0
            periodSeconds: 15
            failureThreshould: 5
          livenessProbe:
            httpGet:
              path: /healthz/liveness
              port: 0
            periodSeconds: 15
            failureThreshould: 1

Here is what happens after a while:

$ kubectl get pods
NAME                                                READY   STATUS    RESTARTS   AGE
healthprobes-go-00001-deployment-74f6787d8f-sz6mq   1/2     Running   1          3m43s

pod descr

...
  Normal   Scheduled  5m48s                 default-scheduler  Successfully assigned default/healthprobes-go-00001-deployment-74f6787d8f-sz6mq to minikube
  Normal   Pulling    5m46s                 kubelet            Pulling image "shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17ad4772c39fe841b98fb9"
  Normal   Pulled     5m28s                 kubelet            Successfully pulled image "shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17ad4772c39fe841b98fb9" in 17.522624817s
  Normal   Pulling    5m27s                 kubelet            Pulling image "docker.io/skonto/queue-39be6f1d08a095bd076a71d288d295b6@sha256:c75f19f83cddb57026535d0069689760554e867c32aea20e51850cc9b894fbd1"
  Normal   Pulled     5m15s                 kubelet            Successfully pulled image "docker.io/skonto/queue-39be6f1d08a095bd076a71d288d295b6@sha256:c75f19f83cddb57026535d0069689760554e867c32aea20e51850cc9b894fbd1" in 12.338262103s
  Normal   Created    5m14s                 kubelet            Created container queue-proxy
  Normal   Started    5m14s                 kubelet            Started container queue-proxy
  Normal   Killing    4m33s                 kubelet            Container user-container failed liveness probe, will be restarted
  Warning  Unhealthy  4m33s (x3 over 5m3s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Pulled     4m3s                  kubelet            Container image "shashankft/healthprobes@sha256:cc1cb4323e4d1cdee62e3ae7623ff6a213db72e89c17ad4772c39fe841b98fb9" already present on machine
  Normal   Started    4m2s (x2 over 5m27s)  kubelet            Started container user-container
  Normal   Created    4m2s (x2 over 5m28s)  kubelet            Created container user-container
  Warning  Unhealthy  33s (x19 over 5m3s)   kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

So if you check the link above, implementation makes sure readiness is set only to queue container on behalf of the user container while liveness is set on the user container, although all http probe calls at first go through to queue proxy anyway.
This is important because the only container that is restarted is the user container that after the initial call to /start-failing path
it will be restarted due to the liveness failure. Once it is restarted liveness check will not fail anymore. However, the queue container is stuck in the draining phase and readiness probes will fail until the pod is unhealthy and eventually it will be removed. Here is why (btw here are the latest entries of the kubelet log):

the drain function will only reset the timer if a normal call is made, otherwise it will keep returning "shutting down" to probes (that cant happen because the unhealthy container cannot be accessed).

serving/vendor/knative.dev/pkg/network/handlers/drain.go

Lines 88 to 111 in 52f07b0

    
           func (d *Drainer) ServeHTTP(w http.ResponseWriter, r *http.Request) { 
        
           	// Respond to probes regardless of path. 
        
           	if d.isHealthCheckRequest(r) { 
        
           		if d.draining() { 
        
           			http.Error(w, "shutting down", http.StatusServiceUnavailable) 
        
           		} else if d.HealthCheck != nil { 
        
           			d.HealthCheck(w, r) 
        
           		} else { 
        
           			w.WriteHeader(http.StatusOK) 
        
           		} 
        
           		return 
        
           	} 
        
           	if isKProbe(r) { 
        
           		if d.draining() { 
        
           			http.Error(w, "shutting down", http.StatusServiceUnavailable) 
        
           		} else { 
        
           			serveKProbe(w, r) 
        
           		} 
        
           		return 
        
           	} 
        
           	d.reset() 
        
           	d.Inner.ServeHTTP(w, r) 
        
           }

I modified the func above by adding some comments and as you can see from the queue proxy logs it is stuck.

Btw readiness probes are managed by queue proxy which every 50ms will try poll the user container trying the readiness url passed. It respects the readiness probe settings passed and these are passed via an env var to the queue container which is later used to set up the logic of the prober:

      SERVING_READINESS_PROBE:           {"httpGet":{"path":"/healthz/readiness","port":8080,"host":"127.0.0.1","scheme":"HTTP","httpHeaders":[{"name":"K-Kubelet-Probe","value":"queue"}]},"timeoutSeconds":1,"periodSeconds":15,"successThreshold":1,"failureThreshold":3}

/cc @julz is this behavior expected (by design)?

skonto · 2022-02-21T14:15:22Z

@Shashankft9 @julz gentle ping

Shashankft9 · 2022-03-11T07:19:47Z

It seems to be the case that

serving/vendor/knative.dev/pkg/network/handlers/drain.go

Lines 165 to 170 in 52f07b0

    
           // draining returns whether we are draining the handler. 
        
           func (d *Drainer) draining() bool { 
        
           	d.RLock() 
        
           	defer d.RUnlock() 
        
           	return d.timer != nil 
        
           }

never returns false, since once the d.timer is set, its never being set to nil again and so the queue-proxy is staying in drain mode forever because this condition is true always.

If the idea of drain mode is to be there for some QuietPeriod which by default is 45s, would it make sense to actually add d.timer = nil after the time period has elapsed? after this line:

serving/vendor/knative.dev/pkg/network/handlers/drain.go

Line 129 in 52f07b0

<-t.tickChan()

skonto · 2022-03-11T14:54:15Z

The Quiet period is overriden at the queue proxy side:

The knative/pkg sets this by default to be 45 here:

serving/vendor/knative.dev/pkg/network/handlers/drain.go

Line 123 in 52f07b0

d.QuietPeriod = network.DefaultDrainTimeout

However queue main sets that to 30s

serving/cmd/queue/main.go

Line 64 in 52f07b0

drainSleepDuration = 30 * time.Second

https://github.com/knative/serving/blob/52f07b02254b42f18fa0f71dbb0462410a5bc7b1/cmd/queue/main.go

Some more context:

The drain call here is what we experience.

serving/cmd/queue/main.go

Line 354 in 52f07b0

drain()

Line 16 in the queue proxy logs:
https://gist.github.com/skonto/749b5a34b464a786d2d3f473da0453d2
This drain is called because we use preStop handlers on the user container.

serving/pkg/reconciler/revision/resources/deploy.go

Line 80 in 9fb7372

userLifecycle = &corev1.Lifecycle{

The user container was restarted.

Kubernetes sends the preStop event immediately before the Container is terminated. Kubernetes' management of the Container blocks until the preStop handler completes, unless the Pod's grace period expires.
So we have a blocking situation because preStop returns at some point (as described next) actually but does not set the condition to be non draining
or timer set to nil.

Now when we drain for the first time we set a timer, that timer is continously being reset if there is
any user request in the mean time. This is the key idea:
// When the Drainer is told to Drain, it will immediately start to fail
// probes with a "500 shutting down", and the call will block until no
// requests have been received for QuietPeriod (defaults to
// network.DefaultDrainTimeout).

Now if we set that timer to nil after draining is done I suspect that we will be able to move on with readiness probes.
The invariant is that pods should finish the requests before shutting down and in our case it should be draining all requests before restarting and not accepting any new before container is ready.
I am wondering if there is a design doc for the behavior.

pramenn · 2022-03-18T17:41:23Z

@skonto I am running into this issue on knative v1.1.0, wondering if there is a fix WIP.

skonto · 2022-03-21T10:50:22Z

@pramenn afaik @Shashankft9 was planning to do a PR to restart the timer as a candidate fix.

Shashankft9 · 2022-03-21T11:37:06Z

hey yeah, I have done the change in my dev and the readiness probe seems to work fine with it - waiting on @dprotaso's opinions on the suggested solution couple of comments above and also (same solution) here: https://knative.slack.com/archives/CA4DNJ9A4/p1646983509263459

marc-laroche · 2022-04-12T12:14:21Z

We are also running on this issue, and were wondering if there was any recent update ? ( @Shashankft9 )

Thanks.

dprotaso · 2022-05-10T22:50:25Z

/triage accepted

Thanks for the deep breakdown everyone - I think what we really need is a clear signal to distinguish between the Pod shutting down and either container (queue-proxy/user-container) restarting.

I'm not sure that's straightforward. Will dig into this for the next release.

I worry about the setting the timer to nil - because I believe in the case the Pod is shutting down it may let a request through rather than returning 503 as expected

/assign @dprotaso

langelini-photosi · 2022-07-11T14:45:21Z

Hi, we are also facing this issue, is there any workaround?

faxioman · 2022-07-12T08:19:55Z

I see @dprotaso changed the milestone for this issue to v1.7.0 but I was wondering if anyone is aware of a workaround, as @langelini-photosi is also asking. I think this issue makes impossible to use all the 1.x versions in a production environment.

pbenini · 2022-07-12T14:31:56Z

It would be very interesting to know how google cloud got around this issue.

I tried a quick setup of cloud run for anthos and I have seen that it does not present this issue.

Anyway, on cloud run for anthos they provide their own queue-proxy: gke.gcr.io/knative/queue@sha256:2148fc4af5f5cda0594d97c3027b316bcba3c8f400d3d57087807f90e80641d3

I tried to use this queue-proxy image on my knative 1.3.1 in self managed cluster ad it seems to solve the issue.

I hope this could helps

dprotaso · 2022-07-12T15:24:44Z

Hey @faxioman - the 1.6 release is today. I'll be taking a look at this for 1.7.

It would be very interesting to know how google cloud got around this issue.

Google's Cloud Run is a separate implementation and they've stopped contributing to upstream

faxioman · 2022-07-12T15:36:36Z

Thanks @dprotaso!
Until then ... if anyone knows a workaround is well accepted.
Currently we found only a "workaround": to convince the developers to build services where the liveness never fails ... 😅

**What this PR does / why we need it**:  Update end to end test to use knative 1.7.4 to test its compatibility against Merlin and kserve 0.9.0. The PR also update the k8s version that is used for testing from 1.21 to 1.22. We need to update to knative 1.7.4 because current version (1.3.3) suffers from issue knative/serving#12571 **Does this PR introduce a user-facing change?**:  None **Checklist** - [X] Added unit test, integration, and/or e2e tests - [X] Tested locally

Shashankft9 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 31, 2022

Shashankft9 mentioned this issue Jan 31, 2022

Adding tests for Liveness probes #12497

Closed

skonto mentioned this issue Feb 14, 2022

Not possible to define livenessProbes or readinessProbes on multi-container #12615

Closed

pramenn mentioned this issue Mar 15, 2022

knative-serving + contour: knative svc endpoints return 404 #12717

Closed

Shashankft9 mentioned this issue Apr 12, 2022

timer set to nil after draining period knative/pkg#2491

Closed

psschwei mentioned this issue May 4, 2022

queue-proxy stops serving traffic after application pod restarts in Knative 1.x #12912

Closed

knative-prow bot assigned dprotaso May 10, 2022

knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label May 10, 2022

dprotaso added this to the v1.5.0 milestone May 10, 2022

dprotaso modified the milestones: v1.5.0, v1.7.0 Jul 11, 2022

dprotaso mentioned this issue Aug 15, 2022

allow drainer to be reset knative/pkg#2572

Merged

dprotaso mentioned this issue Aug 18, 2022

stop the queue-proxy drain if we don't receive a TERM signal #13229

Merged

knative-prow bot closed this as completed in #13229 Aug 20, 2022

skonto mentioned this issue Sep 8, 2022

[release-v1.4][SRVKS-961] Reset drainer to avoid deadlock when liveness probe fails openshift/knative-serving#1242

Merged

This was referenced Jan 19, 2023

Update knative version to 1.7.4 for e2e testing caraml-dev/merlin#309

Merged

feat: Update knative-serving-core to 1.7.4 caraml-dev/helm-charts#162

Merged

feat: Update knative-serving-istio chart to 1.7.1 caraml-dev/helm-charts#161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness probe fails if the container restarts after a liveness probe fail #12571

Readiness probe fails if the container restarts after a liveness probe fail #12571

Shashankft9 commented Jan 31, 2022 •

edited

skonto commented Feb 2, 2022 •

edited

Shashankft9 commented Feb 8, 2022

skonto commented Feb 9, 2022 •

edited

Shashankft9 commented Feb 9, 2022

skonto commented Feb 10, 2022 •

edited

skonto commented Feb 21, 2022

Shashankft9 commented Mar 11, 2022 •

edited

skonto commented Mar 11, 2022

pramenn commented Mar 18, 2022 •

edited

skonto commented Mar 21, 2022

Shashankft9 commented Mar 21, 2022

marc-laroche commented Apr 12, 2022

dprotaso commented May 10, 2022

langelini-photosi commented Jul 11, 2022

faxioman commented Jul 12, 2022 •

edited

pbenini commented Jul 12, 2022

dprotaso commented Jul 12, 2022

faxioman commented Jul 12, 2022

Readiness probe fails if the container restarts after a liveness probe fail #12571

Readiness probe fails if the container restarts after a liveness probe fail #12571

Comments

Shashankft9 commented Jan 31, 2022 • edited

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

skonto commented Feb 2, 2022 • edited

Shashankft9 commented Feb 8, 2022

skonto commented Feb 9, 2022 • edited

Shashankft9 commented Feb 9, 2022

skonto commented Feb 10, 2022 • edited

skonto commented Feb 21, 2022

Shashankft9 commented Mar 11, 2022 • edited

skonto commented Mar 11, 2022

pramenn commented Mar 18, 2022 • edited

skonto commented Mar 21, 2022

Shashankft9 commented Mar 21, 2022

marc-laroche commented Apr 12, 2022

dprotaso commented May 10, 2022

langelini-photosi commented Jul 11, 2022

faxioman commented Jul 12, 2022 • edited

pbenini commented Jul 12, 2022

dprotaso commented Jul 12, 2022

faxioman commented Jul 12, 2022

Shashankft9 commented Jan 31, 2022 •

edited

skonto commented Feb 2, 2022 •

edited

skonto commented Feb 9, 2022 •

edited

skonto commented Feb 10, 2022 •

edited

Shashankft9 commented Mar 11, 2022 •

edited

pramenn commented Mar 18, 2022 •

edited

faxioman commented Jul 12, 2022 •

edited