Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node is not ready because of kubelet report a meaningful conflict error #58002

Closed
foxyriver opened this issue Jan 9, 2018 · 18 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@foxyriver
Copy link
Contributor

foxyriver commented Jan 9, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

kubelet report an error.kubectl get node is not ready
there is a meaningful conflict (firstResourceVersion: "104201", currentResourceVersion: "4293"): diff1={"metadata":{"resourceVersion":"4293"},"status":{"conditions":[{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}} , diff2={"status":{"conditions":[{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet has no disk pressure","reason":"KubeletHasNoDiskPressure","status":"False","type":"DiskPressure"},{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet has sufficient memory available","reason":"KubeletHasSufficientMemory","status":"False","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet has sufficient disk space available","reason":"KubeletHasSufficientDisk","status":"False","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet is posting ready status","reason":"KubeletReady","status":"True","type":"Ready"}],"nodeInfo":{"gpus":[]}}} E0104 17:31:09.779522 7223 kubelet_node_status.go:318] Unable to update node status: update node status exceeds retry count

What you expected to happen:

I found a PR about this issue,
#44788
it has been picked to 1.6. I want to know why this issue still happend.

How to reproduce it (as minimally and precisely as possible):

this issue is incidental, bug I found an issue still meet this problem
#52498
he solves it by changing etcd 3.1.10 instead of 3.2.7.
when I change the leader of etcd cluster, this issue will be gone.
is this a known incompatibility between k8s and etcd or a etcd bug?

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.6.9
  • Etcd version: 3.0.17
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jan 9, 2018
@foxyriver
Copy link
Contributor Author

@kubernetes/sig-node-bugs
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 9, 2018
@k8s-ci-robot
Copy link
Contributor

@foxyriver: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs
/sig node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SleepyBrett
Copy link

I've seen several of these bug reports, and am having one myself, but they all get ignored.

@foxyriver
Copy link
Contributor Author

@SleepyBrett Have you solved this issue? and have any solution on it?

@brendandburns
Copy link
Contributor

It seems to me that this is caused by:

  • Kubelet stops posting status for some reason
  • NodeController updates node state
  • Kubelet starts posting status, but detects a conflict and can't resolve it.

I think we shouldn't treat the conflict as a conflict in this case...

Details from when I observed it:

Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]: eb475b5068e7defa4ca9e76516\",\"k8s-gcrio.azureedge.net/pause-amd64:3.0\"],\"sizeBytes\":746888}]}}" for node "aks-nodepool1-39841472-0": Operation cannot be fulfilled on nodes "aks-nodepool1-39841472-0": there is a meaningful conflict (firstResourceVersion: "8998382", currentResourceVersion: "8998272"):
Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]:  diff1={"metadata":{"resourceVersion":"8998272"},"status":{"$setElementOrder/conditions":[{"type":"NetworkUnavailable"},{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]: , diff2={"status":{"$setElementOrder/conditions":[{"type":"NetworkUnavailable"},{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet has sufficient disk space available","reason":"KubeletHasSufficientDisk","status":"False","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet has sufficient memory available","reason":"KubeletHasSufficientMemory","status":"False","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet has no disk pressure","reason":"KubeletHasNoDiskPressure","status":"False","type":"DiskPressure"},{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet is posting ready status","reason":"KubeletReady","status":"True","type":"Ready"}],"images":[{"names":["gcrio.azureedge.net/google_containers/hyperkube-amd64@sha256:345b60af69c148833a1f351455fdface31df199183fca6228c1af844b5d8d7b7","gcrio.azureedge.net/google_containers/hyperkube-amd64:v1.7.12"],"sizeBytes":615554537},{"names":["gcr.io/google_containers/nginx-ingress-controller@sha256:820c338dc22eda7ab6331001da3cccd43b1b7dcd179049d33a62ad6deaef8daf","gcr.io/google_containers/nginx-ingress-controller:0.8.3"],"sizeBytes":146844917},{"names":["debian@sha256:4fcd8c0b6f5e3bd44a3e63be259fd0c038476d432953d449ef34aedf16def331","debian:latest"],"sizeBytes":100124083},{"names":["gcrio.azureedge.net/google_containers/heapster-amd64@sha256:f58ded16b56884eeb73b1ba256bcc489714570bacdeca43d4ba3b91ef9897b20","gcrio.azureedge.net/google_containers/heapster-amd64:v1.4.2"],"sizeBytes":73404034},{"names":["gcr.io/kubernetes-helm/tiller@sha256:df7f227fa722afc4931c912c1cad2c47856ec94f4d052ccceebcb16dd483dad8","gcr.io/kubernetes-helm/tiller:v2.7.2"],"sizeBytes":56595240},{"names":["gcrio.azureedge.net/google_containers/k8s-
Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]: dns-kube-dns-amd64@sha256:1a3fc069de481ae690188f6f1ba4664b5cc7760af37120f70c86505c79eea61d","gcrio.azureedge.net/google_containers/k8s-dns-kube-dns-amd64:1.14.5"],"sizeBytes":49387411},{"names":["jetstack/kube-lego@sha256:571277952746b4bc241824811984a5b9f70e9d3fa1cc12a55b2cb20efe086bdf","jetstack/kube-lego:0.1.4"],"sizeBytes":44082083},{"names":["gcrio.azureedge.net/google_containers/k8s-dns-dnsmasq-nanny-amd64@sha256:46b933bb70270c8a02fa6b6f87d440f6f1fce1a5a2a719e164f83f7b109f7544","gcrio.azureedge.net/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.5"],"sizeBytes":41423617},{"names":["gcrio.azureedge.net/google_containers/addon-resizer@sha256:dcec9a5c2e20b8df19f3e9eeb87d9054a9e94e71479b935d5cfdbede9ce15895","gcrio.azureedge.net/google_containers/addon-resizer:1.7"],"sizeBytes":38983736},{"names":["brendanburns/metaparticle-site@sha256:62fd09905f732cb4eea9b0b2359ce9a2cb3de9b44e51bfdf85a3d6bd27b5dee5","brendanburns/metaparticle-site:2018-01-22.0"],"sizeBytes":17131590},{"names":["dockerio.azureedge.net/deis/kube-svc-redirect@sha256:ccc6b31039754db718dac8c5d723b9db6a4070a252deaf4ea2c14b018343627e","dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"],"sizeBytes":15058004},{"names":["gcrio.azureedge.net/google_containers/exechealthz-amd64@sha256:503e158c3f65ed7399f54010571c7c977ade7fe59010695f48d9650d83488c0a","gcrio.azureedge.net/google_containers/exechealthz-amd64:1.2"],"sizeBytes":8374840},{"names":["gcr.io/google_containers/defaultbackend@sha256:a64c8ed5df00c9f238ecdeb28eb4ed226faace573695e290a99d92d503593e87","gcr.io/google_containers/defaultbackend:1.2"],"sizeBytes":6939323},{"names":["k8s-gcrio.azureedge.net/pause-amd64@sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516","k8s-gcrio.azureedge.net/pause-amd64:3.0"],"sizeBytes":746888}]}}

@brendandburns
Copy link
Contributor

Here is the work-around to restore the node:

  1. SSH onto the affected node (somehow)
  2. Stop the kubelet (systemctl stop kubelet)
  3. Delete the node from Kubernetes kubectl delete nodes <node-name>
  4. Restart the kubelet, it will re-register itself and clear the conflict.

I still think this is a bug in the kubelet though, I'm going to investigate that code.

@liggitt
Copy link
Member

liggitt commented Mar 7, 2018

two things stand out to me:

  1. firstResourceVersion: "104201", currentResourceVersion: "4293" indicates the version of the object in etcd regressed, which should not be possible. can you verify you are using quorum reads in your apiservers (--etcd-quorum-read=true... the default became true in 1.9)
  2. when I change the leader of etcd cluster, this issue will be gone... can you describe your etcd cluster setup? are you using the etcd v2 or v3 schema? was data migrated from etcd v2 to v3? if so, do all etcd servers agree on raft index and store revision (what does ETCDCTL_API=3 etcdctl endpoint status -w json show)?

@feiskyer
Copy link
Member

feiskyer commented Mar 8, 2018

can you verify you are using quorum reads in your apiservers (--etcd-quorum-read=true... the default became true in 1.9)

Seems @foxyriver is running v1.6.9 with etcd v3.0.17. For clusters older than 1.9, are there any possibility of hitting this problem?

@liggitt
Copy link
Member

liggitt commented Mar 8, 2018

Yes, running against an etcd cluster without that set to true is known to lead to correctness issues.

@foxyriver
Copy link
Contributor Author

foxyriver commented Mar 8, 2018

@liggitt

can you verify you are using quorum reads in your apiservers (--etcd-quorum-read=true... the default became true in 1.9)

using default value in --etcd-quorum-read configuration, v1.6.9 is false

can you describe your etcd cluster setup? are you using the etcd v2 or v3 schema?

using etcd v3 schema. no migration

@foxyriver
Copy link
Contributor Author

foxyriver commented Mar 8, 2018

@liggitt

Yes, running against an etcd cluster without that set to true is known to lead to correctness issues.

thx :), I will try to set --etcd-quorum-read=true in kube-apiserver

@brendandburns
Copy link
Contributor

@liggitt in my case it was a single etcd server (rpi cluster) does that flag still apply?

@brendandburns brendandburns reopened this Mar 12, 2018
@foxyriver
Copy link
Contributor Author

@brendandburns
In a single etcd server, quorum and non-quorum reads are equivalent

@liggitt
Copy link
Member

liggitt commented May 3, 2018

the server-side patch application logic that could lead to this error was removed by #63146

k8s-github-robot pushed a commit that referenced this issue Jul 21, 2018
Automatic merge from submit-queue.

Remove patch retry conflict detection

Minimal backport of #63146
Fixes #58002

Fixes spurious patch errors for CRDs
Fixes patch errors for nodes when the watch cache has a persistently stale version of an object

```release-note
fixes spurious "meaningful conflict" error encountered by nodes attempting to update status, which could cause them to be considered unready
```
k8s-github-robot pushed a commit that referenced this issue Jul 21, 2018
Automatic merge from submit-queue.

Remove patch retry conflict detection

Minimal backport of #63146
Fixes #58002

Fixes spurious patch errors for CRDs
Fixes patch errors for nodes when the watch cache has a persistently stale version of an object

```release-note
fixes spurious "meaningful conflict" error encountered by nodes attempting to update status, which could cause them to be considered unready
```
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 31, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants