Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness/readiness probe is executed and failed while pod is terminated #52817

Closed
sunao-uehara opened this issue Sep 20, 2017 · 41 comments · Fixed by #98571
Closed

liveness/readiness probe is executed and failed while pod is terminated #52817

sunao-uehara opened this issue Sep 20, 2017 · 41 comments · Fixed by #98571
Assignees
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@sunao-uehara
Copy link

sunao-uehara commented Sep 20, 2017

What happened:
liveness/readiness probe fails while pod is terminated. Also it happened only once during the pod termination. The issue started happening after upgrading version to v1.7 from v1.6.X

How to reproduce it (as minimally and precisely as possible):
execute kubectl delete pod nginx-A1 to delete pod, so status of the nginx-podA1 is changed to Terminating, right after that it seems Liveness and Readiness Probe is executed and failed, but only once.
Nginx reverse proxy is running in the pod. so I just use httpGetmethod for liveness and readiness

Here is my Deployment config.

   ...
   spec:
      terminationGracePeriodSeconds: 60
        ...
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          timeoutSeconds: 3
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          timeoutSeconds: 3

Here is Events log by kubectl describe pod nginx-A1

Events:
  FirstSeen	LastSeen	Count	From					SubObjectPath			Type		Reason		Message
  ---------	--------	-----	----					-------------			--------	------		-------
  14s		14s		1	kubelet, *****	spec.containers{dnsmasq}	Normal		Killing		Killing container with id docker://dnsmasq:Need to kill Pod
  9s		9s		1	kubelet, *****	spec.containers{nginx}		Warning		Unhealthy	Liveness probe failed: Get http://100.*.*.*:8080/healthz: dial tcp 100.*.*.*:8080: getsockopt: connection refused
  9s		9s		1	kubelet, *****	spec.containers{nginx}		Warning		Unhealthy	Readiness probe failed: Get http://100.*.*.*:8080/healthz: dial tcp 100.*.*.*:8080: getsockopt: connection refused

Environment:

  • Kubernetes version: 1.7.2
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 20, 2017
@sunao-uehara
Copy link
Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 21, 2017
@sunao-uehara
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 21, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Sep 21, 2017
@sunao-uehara
Copy link
Author

PING

@hosh
Copy link

hosh commented Oct 12, 2017

Since the upgrade to 1.7, it seems our deployment rollouts have a higher failure rate. Occasionally, the pod would come up, but no readiness probe ever gets started. It stays in that state, blocking the entire deployment. I usually have to delete the pod so it is rescheduled, and a new readiness probe is fired to check.

I wonder if these are related issues here.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 11, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@cpnielsen
Copy link

/reopen
/remove-lifecycle rotten

We see this consistently with all pods who define a liveness or readiness probe. Whenever we roll out a new deployment, the pods who are terminated will emit a failed liveness/readiness probe AFTER they have been terminated. We have considered adding a preStop hook that just sleeps for 2-3 seconds, but it seems like a band-aid solution to something that should not happen in the first place.

Is this an impossible-to-solve race condition between kubernetes moving parts?

@k8s-ci-robot
Copy link
Contributor

@cpnielsen: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

We see this consistently with all pods who define a liveness or readiness probe. Whenever we roll out a new deployment, the pods who are terminated will emit a failed liveness/readiness probe AFTER they have been terminated. We have considered adding a preStop hook that just sleeps for 2-3 seconds, but it seems like a band-aid solution to something that should not happen in the first place.

Is this an impossible-to-solve race condition between kubernetes moving parts?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 20, 2018
@cpnielsen
Copy link

@sunao-uehara Do you still experience this? This happens to us running kubernetes 1.11.4.

@sasacocic
Copy link

On k8s 1.12.7 and this is happening to me as well.

@rj03hou
Copy link

rj03hou commented May 24, 2019

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

still happen.

I think just add a phase "Terminateing" in pod, when an pod is in Terminateing just stop check.

@axot
Copy link

axot commented Sep 5, 2019

I think it still happening in 1.13, why this issue was closed?

Logs,
image

@pigletfly
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@pigletfly: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Sep 26, 2019
@pigletfly
Copy link
Member

pigletfly commented Sep 26, 2019

still experience this on kubernetes 1.12.4.
delete event

[ 
   { 
      "firstTimestamp":"2019-09-26 03:56:54 +0800 CST",
      "reason":"SuccessfulDelete",
      "uid":"18904922-d9ee-11e9-96a1-525400c79bb4",
      "component":"replicaset-controller",
      "lastTimestamp":"2019-09-26 03:56:54 +0800 CST",
      "kind":"ReplicaSet",
      "host":"",
      "name":"xxx-6c4584fbc8",
      "namespace":"prod",
      "type":"Normal"
   }

and probe failed event

[ 
   { 
      "firstTimestamp":"2019-09-26 03:57:27 +0800 CST",
      "reason":"Unhealthy",
      "uid":"189550f1-d9ee-11e9-96a1-525400c79bb4",
      "component":"kubelet",
      "lastTimestamp":"2019-09-26 03:57:27 +0800 CST",
      "kind":"Pod",
      "host":"10.97.12.156",
      "name":"xxx6c4584fbc8-r767w",
      "namespace":"prod",
      "type":"Warning"
   }

so it seems that after the pod is deleted, the probe keeps running.I am wondering if the probe is started before the pod deletion.

@pigletfly
Copy link
Member

/area kubelet

@pigletfly
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 31, 2019
@k8s-ci-robot
Copy link
Contributor

@matthyx: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jun 23, 2020
@matthyx
Copy link
Contributor

matthyx commented Jun 23, 2020

/remove-lifecycle rotten
/assign @ashleyschuett

@k8s-ci-robot
Copy link
Contributor

@matthyx: GitHub didn't allow me to assign the following users: ashleyschuett.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/remove-lifecycle rotten
/assign @ashleyschuett

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 23, 2020
@semoac
Copy link

semoac commented Jun 24, 2020

1.16.9 here. Same issue.

@ashleyschuett
Copy link

/assign

@nathanleyton
Copy link

I am also facing this issue. We have pods which have a lot of cleanup to do during shutdown, it can take up to 5 mins to terminate gracefully. During this time the livelinessProbe is detecting failure and restarting the pod. not really what we want. I am unable to prevent the service that handles the liveliness check from stopping while the cleanup is happening. It would be better if the pod was immediately removed from the service and the probes stopped while the shutdown is performed. Basically this ends up that k8s never actually is able to terminate the pod.

@matthyx
Copy link
Contributor

matthyx commented Aug 1, 2020

I will take that point and propose a shutdown probe to API sig.

@kfirfer
Copy link

kfirfer commented Aug 27, 2020

Same here at 1.18.8

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 25, 2020
@Samridhigupta786
Copy link

kubernetes_version 1.19.3. Same issue.

@OneideLuizSchneider
Copy link

kubernetes_version 1.19.3. Same issue.

Same here, kubernetes 1.19.4.

@matthyx
Copy link
Contributor

matthyx commented Jan 18, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 18, 2021
@matthyx
Copy link
Contributor

matthyx commented Jan 29, 2021

/assign

@matthyx
Copy link
Contributor

matthyx commented Feb 25, 2021

Should I consider a cherry-pick for 1.20 and 1.19? (maybe 1.18 too?)

@opsidao
Copy link

opsidao commented Mar 24, 2021

@matthyx we're running 1.16 and being hit by this continuously when some of our elixir apps shutdown so cherry-picking it in 1.18 would at least put it closer in our update path 👼

@matthyx
Copy link
Contributor

matthyx commented Mar 24, 2021

@matthyx we're running 1.16 and being hit by this continuously when some of our elixir apps shutdown so cherry-picking it in 1.18 would at least put it closer in our update path

#100525
#100526
#100527

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet