Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful set controller does not appear to backoff when pods are evicted #89067

Open
smarterclayton opened this issue Mar 11, 2020 · 13 comments
Open
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@smarterclayton
Copy link
Contributor

The SS controller appears to go into a very tight loop when the Should recreate evicted statefulset e2e test runs. The controller should go into backoff after an eviction for that pod instead of hammering the node.

 I0311 16:27:00.916503    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)"
 I0311 16:27:00.919123    1968 predicate.go:132] Predicate failed on Pod: ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf), for reason: Predicate PodFitsHostPorts failed
 I0311 16:27:00.928857    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)" with "{\"metadata\":{\"uid\":\"56d12171-0f09-48a1-bf6e-9bddbaaaccdf\"},\"status\":{\"message\":\"Pod Predicate PodFitsHostPorts failed\",\"phase\":\"Failed\",\"qosClass\":null
 I0311 16:27:00.928950    1968 status_manager.go:723] Status for pod "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)" updated successfully after 9ms: (1, {Phase:Failed Conditions:[] Message:Pod Predicate PodFitsHostPorts failed Reason:PodFitsHostPorts NominatedNodeName: HostIP: PodIP: PodIPs:[] Sta
 I0311 16:27:00.947085    1968 kubelet.go:1929] SyncLoop (DELETE, "api"): "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)"
 I0311 16:27:00.947306    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)" with "{\"metadata\":{\"uid\":\"56d12171-0f09-48a1-bf6e-9bddbaaaccdf\"}}"
 I0311 16:27:00.947326    1968 status_manager.go:721] Status for pod "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)" is up-to-date after 0s: (2)
 W0311 16:27:00.953181    1968 status_manager.go:746] Failed to delete status for pod "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)": pods "ss-0" not found
 I0311 16:27:00.955849    1968 kubelet.go:1923] SyncLoop (REMOVE, "api"): "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)"
 I0311 16:27:00.955893    1968 kubelet.go:2122] Failed to delete pod "ss-0_e2e-statefulset-1676(56d12171-0f09-48a1-bf6e-9bddbaaaccdf)", err: pod not found
 I0311 16:27:00.971667    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)"
 I0311 16:27:00.974231    1968 predicate.go:132] Predicate failed on Pod: ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7), for reason: Predicate PodFitsHostPorts failed
 I0311 16:27:00.981680    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)" with "{\"metadata\":{\"uid\":\"db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7\"},\"status\":{\"message\":\"Pod Predicate PodFitsHostPorts failed\",\"phase\":\"Failed\",\"qosClass\":null
 I0311 16:27:00.981731    1968 status_manager.go:723] Status for pod "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)" updated successfully after 7ms: (1, {Phase:Failed Conditions:[] Message:Pod Predicate PodFitsHostPorts failed Reason:PodFitsHostPorts NominatedNodeName: HostIP: PodIP: PodIPs:[] Sta
 I0311 16:27:01.004014    1968 kubelet.go:1929] SyncLoop (DELETE, "api"): "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)"
 I0311 16:27:01.004231    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)" with "{\"metadata\":{\"uid\":\"db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7\"}}"
 I0311 16:27:01.004252    1968 status_manager.go:721] Status for pod "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)" is up-to-date after 0s: (2)
 W0311 16:27:01.019111    1968 status_manager.go:746] Failed to delete status for pod "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)": pods "ss-0" not found
 I0311 16:27:01.022066    1968 kubelet.go:1923] SyncLoop (REMOVE, "api"): "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)"
 I0311 16:27:01.024450    1968 kubelet.go:2122] Failed to delete pod "ss-0_e2e-statefulset-1676(db1fdaf1-f0ee-4ce3-90a4-d81212ff05b7)", err: pod not found
 I0311 16:27:01.041625    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)"
 I0311 16:27:01.044126    1968 predicate.go:132] Predicate failed on Pod: ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1), for reason: Predicate PodFitsHostPorts failed
 I0311 16:27:01.054016    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)" with "{\"metadata\":{\"uid\":\"e4b733fb-dce6-4c25-8439-227ffea4f4c1\"},\"status\":{\"message\":\"Pod Predicate PodFitsHostPorts failed\",\"phase\":\"Failed\",\"qosClass\":null
 I0311 16:27:01.054064    1968 status_manager.go:723] Status for pod "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)" updated successfully after 9ms: (1, {Phase:Failed Conditions:[] Message:Pod Predicate PodFitsHostPorts failed Reason:PodFitsHostPorts NominatedNodeName: HostIP: PodIP: PodIPs:[] Sta
 I0311 16:27:01.073734    1968 kubelet.go:1929] SyncLoop (DELETE, "api"): "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)"
 I0311 16:27:01.074068    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)" with "{\"metadata\":{\"uid\":\"e4b733fb-dce6-4c25-8439-227ffea4f4c1\"}}"
 I0311 16:27:01.074108    1968 status_manager.go:721] Status for pod "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)" is up-to-date after 0s: (2)
 I0311 16:27:01.076842    1968 kubelet.go:1923] SyncLoop (REMOVE, "api"): "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)"
 I0311 16:27:01.076883    1968 kubelet.go:2122] Failed to delete pod "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)", err: pod not found
 W0311 16:27:01.078630    1968 status_manager.go:746] Failed to delete status for pod "ss-0_e2e-statefulset-1676(e4b733fb-dce6-4c25-8439-227ffea4f4c1)": pods "ss-0" not found
 I0311 16:27:01.096107    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)"
 I0311 16:27:01.102192    1968 predicate.go:132] Predicate failed on Pod: ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371), for reason: Predicate PodFitsHostPorts failed
 I0311 16:27:01.114357    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)" with "{\"metadata\":{\"uid\":\"44747d3d-745b-4495-93e0-4f72da04a371\"},\"status\":{\"message\":\"Pod Predicate PodFitsHostPorts failed\",\"phase\":\"Failed\",\"qosClass\":null
 I0311 16:27:01.114408    1968 status_manager.go:723] Status for pod "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)" updated successfully after 12ms: (1, {Phase:Failed Conditions:[] Message:Pod Predicate PodFitsHostPorts failed Reason:PodFitsHostPorts NominatedNodeName: HostIP: PodIP: PodIPs:[] St
 I0311 16:27:01.135433    1968 kubelet.go:1929] SyncLoop (DELETE, "api"): "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)"
 I0311 16:27:01.135659    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)" with "{\"metadata\":{\"uid\":\"44747d3d-745b-4495-93e0-4f72da04a371\"}}"
 I0311 16:27:01.135684    1968 status_manager.go:721] Status for pod "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)" is up-to-date after 0s: (2)
 I0311 16:27:01.140683    1968 kubelet.go:1923] SyncLoop (REMOVE, "api"): "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)"
 I0311 16:27:01.140738    1968 kubelet.go:2122] Failed to delete pod "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)", err: pod not found
 W0311 16:27:01.147333    1968 status_manager.go:746] Failed to delete status for pod "ss-0_e2e-statefulset-1676(44747d3d-745b-4495-93e0-4f72da04a371)": pods "ss-0" not found
 I0311 16:27:01.167503    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)"
 I0311 16:27:01.171201    1968 predicate.go:132] Predicate failed on Pod: ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec), for reason: Predicate PodFitsHostPorts failed
 I0311 16:27:01.182253    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)" with "{\"metadata\":{\"uid\":\"ff780fca-b041-4a49-8ee9-2e090a4ba4ec\"},\"status\":{\"message\":\"Pod Predicate PodFitsHostPorts failed\",\"phase\":\"Failed\",\"qosClass\":null
 I0311 16:27:01.182306    1968 status_manager.go:723] Status for pod "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)" updated successfully after 10ms: (1, {Phase:Failed Conditions:[] Message:Pod Predicate PodFitsHostPorts failed Reason:PodFitsHostPorts NominatedNodeName: HostIP: PodIP: PodIPs:[] St
 I0311 16:27:01.203646    1968 kubelet.go:1929] SyncLoop (DELETE, "api"): "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)"
 I0311 16:27:01.203906    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)" with "{\"metadata\":{\"uid\":\"ff780fca-b041-4a49-8ee9-2e090a4ba4ec\"}}"
 I0311 16:27:01.203955    1968 status_manager.go:721] Status for pod "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)" is up-to-date after 0s: (2)
 W0311 16:27:01.212792    1968 status_manager.go:746] Failed to delete status for pod "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)": pods "ss-0" not found
 I0311 16:27:01.215840    1968 kubelet.go:1923] SyncLoop (REMOVE, "api"): "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)"
 I0311 16:27:01.215899    1968 kubelet.go:2122] Failed to delete pod "ss-0_e2e-statefulset-1676(ff780fca-b041-4a49-8ee9-2e090a4ba4ec)", err: pod not found
 I0311 16:27:01.230930    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)"
 I0311 16:27:01.241160    1968 predicate.go:132] Predicate failed on Pod: ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3), for reason: Predicate PodFitsHostPorts failed
 I0311 16:27:01.251997    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)" with "{\"metadata\":{\"uid\":\"84703c07-ef17-41ad-a74e-f02616301cb3\"},\"status\":{\"message\":\"Pod Predicate PodFitsHostPorts failed\",\"phase\":\"Failed\",\"qosClass\":null
 I0311 16:27:01.252053    1968 status_manager.go:723] Status for pod "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)" updated successfully after 10ms: (1, {Phase:Failed Conditions:[] Message:Pod Predicate PodFitsHostPorts failed Reason:PodFitsHostPorts NominatedNodeName: HostIP: PodIP: PodIPs:[] St
 I0311 16:27:01.265596    1968 kubelet.go:1929] SyncLoop (DELETE, "api"): "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)"
 I0311 16:27:01.265830    1968 status_manager.go:696] Patch status for pod "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)" with "{\"metadata\":{\"uid\":\"84703c07-ef17-41ad-a74e-f02616301cb3\"}}"
 I0311 16:27:01.265857    1968 status_manager.go:721] Status for pod "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)" is up-to-date after 0s: (2)
 I0311 16:27:01.274502    1968 kubelet.go:1923] SyncLoop (REMOVE, "api"): "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)"
 I0311 16:27:01.274561    1968 kubelet.go:2122] Failed to delete pod "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)", err: pod not found
 W0311 16:27:01.277282    1968 status_manager.go:746] Failed to delete status for pod "ss-0_e2e-statefulset-1676(84703c07-ef17-41ad-a74e-f02616301cb3)": pods "ss-0" not found
 I0311 16:27:01.296172    1968 kubelet.go:1913] SyncLoop (ADD, "api"): "ss-0_e2e-statefulset-1676(4a76c75f-f835-4982-974e-097480c2d292)"

The surge in this one test was enough to cause the Kubelet to significantly slow down its processing of status (another bug I intend to fix).

/sig apps

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Mar 11, 2020
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Mar 11, 2020
@smarterclayton
Copy link
Contributor Author

Eviction is an usual action, no controller should try to force the pod back more than once or twice without slowing down.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2020
@gongguan
Copy link
Contributor

/remove-lifecycle rotten
Met the same issue, I'd like to fix it.
/assign

@kow3ns
Copy link
Member

kow3ns commented Aug 10, 2020

/assign @kow3ns

@gongguan
Copy link
Contributor

@kow3ns I have raised a pr #92966 which can fix it, PTAL.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 9, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton smarterclayton reopened this Apr 6, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 6, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 6, 2023
@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 6, 2023
@smarterclayton
Copy link
Contributor Author

Controllers are responsible for backing off when repeated operations fail.

Kubelet should also be defensive to loops like this (#89068)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Development

Successfully merging a pull request may close this issue.

5 participants