node is not ready because of kubelet report a meaningful conflict error #58002

foxyriver · 2018-01-09T12:54:49Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

kubelet report an error.kubectl get node is not ready
there is a meaningful conflict (firstResourceVersion: "104201", currentResourceVersion: "4293"): diff1={"metadata":{"resourceVersion":"4293"},"status":{"conditions":[{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-01-03T07:38:24Z","lastTransitionTime":"2018-01-03T07:42:59Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}} , diff2={"status":{"conditions":[{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet has no disk pressure","reason":"KubeletHasNoDiskPressure","status":"False","type":"DiskPressure"},{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet has sufficient memory available","reason":"KubeletHasSufficientMemory","status":"False","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet has sufficient disk space available","reason":"KubeletHasSufficientDisk","status":"False","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-01-04T09:31:09Z","lastTransitionTime":"2018-01-04T09:31:09Z","message":"kubelet is posting ready status","reason":"KubeletReady","status":"True","type":"Ready"}],"nodeInfo":{"gpus":[]}}} E0104 17:31:09.779522 7223 kubelet_node_status.go:318] Unable to update node status: update node status exceeds retry count

What you expected to happen:

I found a PR about this issue,
#44788
it has been picked to 1.6. I want to know why this issue still happend.

How to reproduce it (as minimally and precisely as possible):

this issue is incidental, bug I found an issue still meet this problem
#52498
he solves it by changing etcd 3.1.10 instead of 3.2.7.
when I change the leader of etcd cluster, this issue will be gone.
is this a known incompatibility between k8s and etcd or a etcd bug?

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.6.9
Etcd version: 3.0.17

The text was updated successfully, but these errors were encountered:

foxyriver · 2018-01-09T12:55:03Z

@kubernetes/sig-node-bugs
/sig node

k8s-ci-robot · 2018-01-09T12:55:11Z

@foxyriver: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs
/sig node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SleepyBrett · 2018-01-19T17:30:26Z

I've seen several of these bug reports, and am having one myself, but they all get ignored.

foxyriver · 2018-01-22T09:21:37Z

@SleepyBrett Have you solved this issue? and have any solution on it?

brendandburns · 2018-03-06T05:04:36Z

It seems to me that this is caused by:

Kubelet stops posting status for some reason
NodeController updates node state
Kubelet starts posting status, but detects a conflict and can't resolve it.

I think we shouldn't treat the conflict as a conflict in this case...

Details from when I observed it:

Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]: eb475b5068e7defa4ca9e76516\",\"k8s-gcrio.azureedge.net/pause-amd64:3.0\"],\"sizeBytes\":746888}]}}" for node "aks-nodepool1-39841472-0": Operation cannot be fulfilled on nodes "aks-nodepool1-39841472-0": there is a meaningful conflict (firstResourceVersion: "8998382", currentResourceVersion: "8998272"):
Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]:  diff1={"metadata":{"resourceVersion":"8998272"},"status":{"$setElementOrder/conditions":[{"type":"NetworkUnavailable"},{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastHeartbeatTime":"2018-02-28T20:05:21Z","lastTransitionTime":"2018-02-28T20:00:45Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]: , diff2={"status":{"$setElementOrder/conditions":[{"type":"NetworkUnavailable"},{"type":"OutOfDisk"},{"type":"MemoryPressure"},{"type":"DiskPressure"},{"type":"Ready"}],"conditions":[{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet has sufficient disk space available","reason":"KubeletHasSufficientDisk","status":"False","type":"OutOfDisk"},{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet has sufficient memory available","reason":"KubeletHasSufficientMemory","status":"False","type":"MemoryPressure"},{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet has no disk pressure","reason":"KubeletHasNoDiskPressure","status":"False","type":"DiskPressure"},{"lastHeartbeatTime":"2018-03-06T04:52:54Z","lastTransitionTime":"2018-03-06T04:52:54Z","message":"kubelet is posting ready status","reason":"KubeletReady","status":"True","type":"Ready"}],"images":[{"names":["gcrio.azureedge.net/google_containers/hyperkube-amd64@sha256:345b60af69c148833a1f351455fdface31df199183fca6228c1af844b5d8d7b7","gcrio.azureedge.net/google_containers/hyperkube-amd64:v1.7.12"],"sizeBytes":615554537},{"names":["gcr.io/google_containers/nginx-ingress-controller@sha256:820c338dc22eda7ab6331001da3cccd43b1b7dcd179049d33a62ad6deaef8daf","gcr.io/google_containers/nginx-ingress-controller:0.8.3"],"sizeBytes":146844917},{"names":["debian@sha256:4fcd8c0b6f5e3bd44a3e63be259fd0c038476d432953d449ef34aedf16def331","debian:latest"],"sizeBytes":100124083},{"names":["gcrio.azureedge.net/google_containers/heapster-amd64@sha256:f58ded16b56884eeb73b1ba256bcc489714570bacdeca43d4ba3b91ef9897b20","gcrio.azureedge.net/google_containers/heapster-amd64:v1.4.2"],"sizeBytes":73404034},{"names":["gcr.io/kubernetes-helm/tiller@sha256:df7f227fa722afc4931c912c1cad2c47856ec94f4d052ccceebcb16dd483dad8","gcr.io/kubernetes-helm/tiller:v2.7.2"],"sizeBytes":56595240},{"names":["gcrio.azureedge.net/google_containers/k8s-
Mar 06 04:52:54 aks-nodepool1-39841472-0 docker[7343]: dns-kube-dns-amd64@sha256:1a3fc069de481ae690188f6f1ba4664b5cc7760af37120f70c86505c79eea61d","gcrio.azureedge.net/google_containers/k8s-dns-kube-dns-amd64:1.14.5"],"sizeBytes":49387411},{"names":["jetstack/kube-lego@sha256:571277952746b4bc241824811984a5b9f70e9d3fa1cc12a55b2cb20efe086bdf","jetstack/kube-lego:0.1.4"],"sizeBytes":44082083},{"names":["gcrio.azureedge.net/google_containers/k8s-dns-dnsmasq-nanny-amd64@sha256:46b933bb70270c8a02fa6b6f87d440f6f1fce1a5a2a719e164f83f7b109f7544","gcrio.azureedge.net/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.5"],"sizeBytes":41423617},{"names":["gcrio.azureedge.net/google_containers/addon-resizer@sha256:dcec9a5c2e20b8df19f3e9eeb87d9054a9e94e71479b935d5cfdbede9ce15895","gcrio.azureedge.net/google_containers/addon-resizer:1.7"],"sizeBytes":38983736},{"names":["brendanburns/metaparticle-site@sha256:62fd09905f732cb4eea9b0b2359ce9a2cb3de9b44e51bfdf85a3d6bd27b5dee5","brendanburns/metaparticle-site:2018-01-22.0"],"sizeBytes":17131590},{"names":["dockerio.azureedge.net/deis/kube-svc-redirect@sha256:ccc6b31039754db718dac8c5d723b9db6a4070a252deaf4ea2c14b018343627e","dockerio.azureedge.net/deis/kube-svc-redirect:v0.0.3"],"sizeBytes":15058004},{"names":["gcrio.azureedge.net/google_containers/exechealthz-amd64@sha256:503e158c3f65ed7399f54010571c7c977ade7fe59010695f48d9650d83488c0a","gcrio.azureedge.net/google_containers/exechealthz-amd64:1.2"],"sizeBytes":8374840},{"names":["gcr.io/google_containers/defaultbackend@sha256:a64c8ed5df00c9f238ecdeb28eb4ed226faace573695e290a99d92d503593e87","gcr.io/google_containers/defaultbackend:1.2"],"sizeBytes":6939323},{"names":["k8s-gcrio.azureedge.net/pause-amd64@sha256:163ac025575b775d1c0f9bf0bdd0f086883171eb475b5068e7defa4ca9e76516","k8s-gcrio.azureedge.net/pause-amd64:3.0"],"sizeBytes":746888}]}}

brendandburns · 2018-03-06T05:24:22Z

Here is the work-around to restore the node:

SSH onto the affected node (somehow)
Stop the kubelet (systemctl stop kubelet)
Delete the node from Kubernetes kubectl delete nodes <node-name>
Restart the kubelet, it will re-register itself and clear the conflict.

I still think this is a bug in the kubelet though, I'm going to investigate that code.

liggitt · 2018-03-07T17:31:53Z

two things stand out to me:

firstResourceVersion: "104201", currentResourceVersion: "4293" indicates the version of the object in etcd regressed, which should not be possible. can you verify you are using quorum reads in your apiservers (--etcd-quorum-read=true... the default became true in 1.9)
when I change the leader of etcd cluster, this issue will be gone... can you describe your etcd cluster setup? are you using the etcd v2 or v3 schema? was data migrated from etcd v2 to v3? if so, do all etcd servers agree on raft index and store revision (what does ETCDCTL_API=3 etcdctl endpoint status -w json show)?

feiskyer · 2018-03-08T02:18:04Z

can you verify you are using quorum reads in your apiservers (--etcd-quorum-read=true... the default became true in 1.9)

Seems @foxyriver is running v1.6.9 with etcd v3.0.17. For clusters older than 1.9, are there any possibility of hitting this problem?

liggitt · 2018-03-08T03:06:01Z

Yes, running against an etcd cluster without that set to true is known to lead to correctness issues.

foxyriver · 2018-03-08T07:51:45Z

@liggitt

can you verify you are using quorum reads in your apiservers (--etcd-quorum-read=true... the default became true in 1.9)

using default value in --etcd-quorum-read configuration, v1.6.9 is false

can you describe your etcd cluster setup? are you using the etcd v2 or v3 schema?

using etcd v3 schema. no migration

foxyriver · 2018-03-08T08:00:26Z

@liggitt

Yes, running against an etcd cluster without that set to true is known to lead to correctness issues.

thx :), I will try to set --etcd-quorum-read=true in kube-apiserver

brendandburns · 2018-03-12T17:30:33Z

@liggitt in my case it was a single etcd server (rpi cluster) does that flag still apply?

foxyriver · 2018-03-13T03:09:41Z

@brendandburns
In a single etcd server, quorum and non-quorum reads are equivalent

liggitt · 2018-05-03T17:26:39Z

the server-side patch application logic that could lead to this error was removed by #63146

Automatic merge from submit-queue. Remove patch retry conflict detection Minimal backport of #63146 Fixes #58002 Fixes spurious patch errors for CRDs Fixes patch errors for nodes when the watch cache has a persistently stale version of an object ```release-note fixes spurious "meaningful conflict" error encountered by nodes attempting to update status, which could cause them to be considered unready ```

fejta-bot · 2018-08-01T17:41:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-08-31T18:27:48Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2018-09-30T18:49:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2018-09-30T18:50:06Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jan 9, 2018

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 9, 2018

brendandburns mentioned this issue Mar 6, 2018

If a node status is unknown, overwrite, don't patch. #60827

Closed

feiskyer assigned brendandburns Mar 6, 2018

foxyriver closed this as completed Mar 12, 2018

brendandburns reopened this Mar 12, 2018

doggy8088 mentioned this issue Apr 26, 2018

there is a meaningful conflict (firstResourceVersion: "14906857", currentResourceVersion: "12395851") rootsongjc/kubernetes-handbook#203

Closed

This was referenced Jul 13, 2018

Remove patch retry conflict detection #66171

Merged

Remove patch retry conflict detection #66169

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 31, 2018

k8s-ci-robot closed this as completed Sep 30, 2018

foxyriver mentioned this issue Feb 27, 2019

REQUEST: New membership for foxyriver kubernetes/org#531

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node is not ready because of kubelet report a meaningful conflict error #58002

node is not ready because of kubelet report a meaningful conflict error #58002

foxyriver commented Jan 9, 2018 •

edited

Loading

foxyriver commented Jan 9, 2018

k8s-ci-robot commented Jan 9, 2018

SleepyBrett commented Jan 19, 2018

foxyriver commented Jan 22, 2018

brendandburns commented Mar 6, 2018

brendandburns commented Mar 6, 2018

liggitt commented Mar 7, 2018

feiskyer commented Mar 8, 2018

liggitt commented Mar 8, 2018

foxyriver commented Mar 8, 2018 •

edited

Loading

foxyriver commented Mar 8, 2018 •

edited

Loading

brendandburns commented Mar 12, 2018

foxyriver commented Mar 13, 2018

liggitt commented May 3, 2018 •

edited

Loading

fejta-bot commented Aug 1, 2018

fejta-bot commented Aug 31, 2018

fejta-bot commented Sep 30, 2018

k8s-ci-robot commented Sep 30, 2018

node is not ready because of kubelet report a meaningful conflict error #58002

node is not ready because of kubelet report a meaningful conflict error #58002

Comments

foxyriver commented Jan 9, 2018 • edited Loading

foxyriver commented Jan 9, 2018

k8s-ci-robot commented Jan 9, 2018

SleepyBrett commented Jan 19, 2018

foxyriver commented Jan 22, 2018

brendandburns commented Mar 6, 2018

brendandburns commented Mar 6, 2018

liggitt commented Mar 7, 2018

feiskyer commented Mar 8, 2018

liggitt commented Mar 8, 2018

foxyriver commented Mar 8, 2018 • edited Loading

foxyriver commented Mar 8, 2018 • edited Loading

brendandburns commented Mar 12, 2018

foxyriver commented Mar 13, 2018

liggitt commented May 3, 2018 • edited Loading

fejta-bot commented Aug 1, 2018

fejta-bot commented Aug 31, 2018

fejta-bot commented Sep 30, 2018

k8s-ci-robot commented Sep 30, 2018

foxyriver commented Jan 9, 2018 •

edited

Loading

foxyriver commented Mar 8, 2018 •

edited

Loading

foxyriver commented Mar 8, 2018 •

edited

Loading

liggitt commented May 3, 2018 •

edited

Loading