liveness/readiness probe is executed and failed while pod is terminated #52817

sunao-uehara · 2017-09-20T22:23:30Z

What happened:
liveness/readiness probe fails while pod is terminated. Also it happened only once during the pod termination. The issue started happening after upgrading version to v1.7 from v1.6.X

How to reproduce it (as minimally and precisely as possible):
execute kubectl delete pod nginx-A1 to delete pod, so status of the nginx-podA1 is changed to Terminating, right after that it seems Liveness and Readiness Probe is executed and failed, but only once.
Nginx reverse proxy is running in the pod. so I just use httpGetmethod for liveness and readiness

Here is my Deployment config.

   ...
   spec:
      terminationGracePeriodSeconds: 60
        ...
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          timeoutSeconds: 3
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          timeoutSeconds: 3

Here is Events log by kubectl describe pod nginx-A1

Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason		Message
  ---------	--------	-----	----					-------------			--------	------		-------
  14s		14s		1	kubelet, *****	spec.containers{dnsmasq}	Normal		Killing		Killing container with id docker://dnsmasq:Need to kill Pod
  9s		9s		1	kubelet, *****	spec.containers{nginx}		Warning		Unhealthy	Liveness probe failed: Get http://100.*.*.*:8080/healthz: dial tcp 100.*.*.*:8080: getsockopt: connection refused
  9s		9s		1	kubelet, *****	spec.containers{nginx}		Warning		Unhealthy	Readiness probe failed: Get http://100.*.*.*:8080/healthz: dial tcp 100.*.*.*:8080: getsockopt: connection refused

Environment:

Kubernetes version: 1.7.2

The text was updated successfully, but these errors were encountered:

sunao-uehara · 2017-09-21T06:41:52Z

/kind bug

sunao-uehara · 2017-09-21T06:42:05Z

/sig node

sunao-uehara · 2017-09-26T23:59:57Z

PING

hosh · 2017-10-12T17:46:36Z

Since the upgrade to 1.7, it seems our deployment rollouts have a higher failure rate. Occasionally, the pod would come up, but no readiness probe ever gets started. It stays in that state, blocking the entire deployment. I usually have to delete the pod so it is rescheduled, and a new readiness probe is fired to check.

I wonder if these are related issues here.

fejta-bot · 2018-01-11T21:40:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-11T21:10:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-13T21:55:57Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

cpnielsen · 2018-11-20T12:43:20Z

/reopen
/remove-lifecycle rotten

We see this consistently with all pods who define a liveness or readiness probe. Whenever we roll out a new deployment, the pods who are terminated will emit a failed liveness/readiness probe AFTER they have been terminated. We have considered adding a preStop hook that just sleeps for 2-3 seconds, but it seems like a band-aid solution to something that should not happen in the first place.

Is this an impossible-to-solve race condition between kubernetes moving parts?

k8s-ci-robot · 2018-11-20T12:43:27Z

@cpnielsen: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

We see this consistently with all pods who define a liveness or readiness probe. Whenever we roll out a new deployment, the pods who are terminated will emit a failed liveness/readiness probe AFTER they have been terminated. We have considered adding a preStop hook that just sleeps for 2-3 seconds, but it seems like a band-aid solution to something that should not happen in the first place.

Is this an impossible-to-solve race condition between kubernetes moving parts?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cpnielsen · 2018-11-20T12:44:02Z

@sunao-uehara Do you still experience this? This happens to us running kubernetes 1.11.4.

sasacocic · 2019-05-15T17:11:10Z

On k8s 1.12.7 and this is happening to me as well.

rj03hou · 2019-05-24T03:57:30Z

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

still happen.

I think just add a phase "Terminateing" in pod, when an pod is in Terminateing just stop check.

axot · 2019-09-05T09:01:20Z

I think it still happening in 1.13, why this issue was closed?

Logs,

pigletfly · 2019-09-26T02:16:06Z

/reopen

k8s-ci-robot · 2019-09-26T02:16:08Z

@pigletfly: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pigletfly · 2019-09-26T02:23:11Z

still experience this on kubernetes 1.12.4.
delete event

[ 
   { 
      "firstTimestamp":"2019-09-26 03:56:54 +0800 CST",
      "reason":"SuccessfulDelete",
      "uid":"18904922-d9ee-11e9-96a1-525400c79bb4",
      "component":"replicaset-controller",
      "lastTimestamp":"2019-09-26 03:56:54 +0800 CST",
      "kind":"ReplicaSet",
      "host":"",
      "name":"xxx-6c4584fbc8",
      "namespace":"prod",
      "type":"Normal"
   }

and probe failed event

[ 
   { 
      "firstTimestamp":"2019-09-26 03:57:27 +0800 CST",
      "reason":"Unhealthy",
      "uid":"189550f1-d9ee-11e9-96a1-525400c79bb4",
      "component":"kubelet",
      "lastTimestamp":"2019-09-26 03:57:27 +0800 CST",
      "kind":"Pod",
      "host":"10.97.12.156",
      "name":"xxx6c4584fbc8-r767w",
      "namespace":"prod",
      "type":"Warning"
   }

so it seems that after the pod is deleted, the probe keeps running.I am wondering if the probe is started before the pod deletion.

pigletfly · 2019-09-26T02:46:56Z

/area kubelet

pigletfly · 2019-10-31T01:47:38Z

/priority important-soon

k8s-ci-robot · 2020-06-23T20:30:33Z

@matthyx: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

matthyx · 2020-06-23T20:31:15Z

/remove-lifecycle rotten
/assign @ashleyschuett

k8s-ci-robot · 2020-06-23T20:31:16Z

@matthyx: GitHub didn't allow me to assign the following users: ashleyschuett.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/remove-lifecycle rotten
/assign @ashleyschuett

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

semoac · 2020-06-24T00:40:41Z

1.16.9 here. Same issue.

ashleyschuett · 2020-06-24T08:09:07Z

/assign

nathanleyton · 2020-08-01T04:56:22Z

I am also facing this issue. We have pods which have a lot of cleanup to do during shutdown, it can take up to 5 mins to terminate gracefully. During this time the livelinessProbe is detecting failure and restarting the pod. not really what we want. I am unable to prevent the service that handles the liveliness check from stopping while the cleanup is happening. It would be better if the pod was immediately removed from the service and the probes stopped while the shutdown is performed. Basically this ends up that k8s never actually is able to terminate the pod.

matthyx · 2020-08-01T20:40:15Z

I will take that point and propose a shutdown probe to API sig.

kfirfer · 2020-08-27T09:02:07Z

Same here at 1.18.8

fejta-bot · 2020-11-25T09:45:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-12-25T10:29:58Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Samridhigupta786 · 2021-01-18T06:56:10Z

kubernetes_version 1.19.3. Same issue.

OneideLuizSchneider · 2021-01-18T12:29:59Z

kubernetes_version 1.19.3. Same issue.

Same here, kubernetes 1.19.4.

matthyx · 2021-01-18T12:52:32Z

/remove-lifecycle rotten

matthyx · 2021-01-29T12:28:36Z

/assign

matthyx · 2021-02-25T06:25:41Z

Should I consider a cherry-pick for 1.20 and 1.19? (maybe 1.18 too?)

opsidao · 2021-03-24T13:06:04Z

@matthyx we're running 1.16 and being hit by this continuously when some of our elixir apps shutdown so cherry-picking it in 1.18 would at least put it closer in our update path 👼

matthyx · 2021-03-24T14:31:07Z

@matthyx we're running 1.16 and being hit by this continuously when some of our elixir apps shutdown so cherry-picking it in 1.18 would at least put it closer in our update path

#100525
#100526
#100527

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 20, 2017

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 21, 2017

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 21, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 21, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018

k8s-ci-robot closed this as completed Mar 13, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 20, 2018

k8s-ci-robot reopened this Sep 26, 2019

k8s-ci-robot added the area/kubelet label Sep 26, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 31, 2019

k8s-ci-robot reopened this Jun 23, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 23, 2020

k8s-ci-robot assigned ashleyschuett Jun 24, 2020

pecigonzalo mentioned this issue Aug 28, 2020

Kubernetes probes report failures when doing deployments sourcegraph/sourcegraph#13453

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 18, 2021

matthyx mentioned this issue Jan 29, 2021

Stop probing a pod during graceful shutdown #98571

Merged

k8s-ci-robot assigned matthyx Jan 29, 2021

k8s-ci-robot closed this as completed in #98571 Feb 25, 2021

ynouri mentioned this issue Oct 22, 2021

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods GoogleContainerTools/skaffold#6758

Closed

reenakabra mentioned this issue Jan 17, 2024

Readiness probes are called even when pod is in terminating state #122824

Closed

liveness/readiness probe is executed and failed while pod is terminated #52817

liveness/readiness probe is executed and failed while pod is terminated #52817

Comments

sunao-uehara commented Sep 20, 2017 • edited

sunao-uehara commented Sep 21, 2017

sunao-uehara commented Sep 21, 2017

sunao-uehara commented Sep 26, 2017

hosh commented Oct 12, 2017

fejta-bot commented Jan 11, 2018

fejta-bot commented Feb 11, 2018

fejta-bot commented Mar 13, 2018

cpnielsen commented Nov 20, 2018

k8s-ci-robot commented Nov 20, 2018

cpnielsen commented Nov 20, 2018

sasacocic commented May 15, 2019

rj03hou commented May 24, 2019

axot commented Sep 5, 2019 • edited

pigletfly commented Sep 26, 2019

k8s-ci-robot commented Sep 26, 2019

pigletfly commented Sep 26, 2019 • edited

pigletfly commented Sep 26, 2019

pigletfly commented Oct 31, 2019

k8s-ci-robot commented Jun 23, 2020

matthyx commented Jun 23, 2020

k8s-ci-robot commented Jun 23, 2020

semoac commented Jun 24, 2020

ashleyschuett commented Jun 24, 2020

nathanleyton commented Aug 1, 2020

matthyx commented Aug 1, 2020

kfirfer commented Aug 27, 2020

fejta-bot commented Nov 25, 2020

fejta-bot commented Dec 25, 2020

Samridhigupta786 commented Jan 18, 2021

OneideLuizSchneider commented Jan 18, 2021

matthyx commented Jan 18, 2021

matthyx commented Jan 29, 2021

matthyx commented Feb 25, 2021

opsidao commented Mar 24, 2021

matthyx commented Mar 24, 2021

sunao-uehara commented Sep 20, 2017 •

edited

axot commented Sep 5, 2019 •

edited

pigletfly commented Sep 26, 2019 •

edited