kube-controller-manager spamming errors in log #42438

felipejfc · 2017-03-02T21:53:51Z

kubernetes version: 1.5.3
platform: aws
deploy tool: kops

my master's /var/log/kube-controller-manager.log

is receiving this type of log:

I0302 21:51:11.715415       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:11.815662       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:11.915905       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.016217       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.116508       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.216725       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.316962       6 node_status_updater.go:74] Could not update node status ...

lots of times per second, this node "ip-172-20-114-85.ec2.internal" does not exist in the cluster anymore and is not shown if I kubectl get nodes

shouldn't it be removed from status_updater?

The text was updated successfully, but these errors were encountered:

0xmichalis · 2017-03-03T13:00:42Z

@kubernetes/sig-node-bugs

damienjoldersma · 2017-03-07T19:20:53Z

Any work around? Any way to purge those old nodes from the NodeInformer cache?

yujuhong · 2017-03-07T19:34:29Z

AFAIK, sig-node doesn't maintain the node controller.

Based on the OWNERS files, the only approver is @gmarek

gmarek · 2017-03-08T05:31:30Z

It has nothing to do with the NodeController:

find . -name node_status_updater.go
./pkg/controller/volume/attachdetach/statusupdater/node_status_updater.go

cc @saad-ali

blakebarnett · 2017-03-23T23:34:15Z

This was happening in our clusters as well, only the leader was doing it and and killing the container so it restarted made it stop happening.

steinnes · 2017-04-05T14:09:51Z

@blakebarnett did you kill the gcr.io/google_containers/kube-controller-manager container, which then gets restarted by the scheduler, and that fixed it?

blakebarnett · 2017-04-05T17:28:07Z

correct

steinnes · 2017-04-07T14:30:28Z

Works for me™ @blakebarnett

nabeelamjad · 2017-04-24T09:43:27Z

Same controller-manager log here using k8s 1.5. I deleted one of the nodes which ran out of resources and became stuck, something I'm still trying to pinpoint the cause of but it appears to be related to volumes.

xuwang · 2017-04-26T20:50:56Z

This happens on k8s 1.6.1 as well, that replaced old nodes won't get cleaned up, generating large error logs. On CoreOS installation, I did systemctl restart kube-controller-manager. That fixed it.

gmarek · 2017-04-27T08:04:07Z

Ping @saad-ali

r4j4h · 2017-05-09T00:22:14Z

This is happening on k8s 1.4.12 as well, restarting kube-controller-manager does not seem to fix it for me. It is rotating through two nodes that no longer exist, so I am hoping the cache will purge itself at some point and they will stop?

r4j4h · 2017-05-10T00:13:57Z

Just following up that the old nodes still not cleaned up

saad-ali · 2017-05-10T23:47:00Z

@verult will work on a fix for this.

saad-ali · 2017-05-10T23:49:01Z

@verult Let's make sure the fix also gets ported back to all affected branches (1.6, 1.5, and 1.4).

CC @kubernetes/sig-storage-bugs

verult · 2017-05-11T00:09:01Z

I was able to reproduce the problem by force deleting a node containing a pod with some volume attached to it. Did anyone run into a different failure mode?

steinnes · 2017-05-11T21:13:26Z

@verult I think that's how the bug surfaced for me. When I force killed my node we had just started running a StatefulSet with an attached persistent volume, and I think there was one running on the very node I killed.

verult · 2017-05-17T23:37:27Z

The NodeInformer cache is actually reporting the correct state. The problem is that inside the node status updater, there is another data structure keeping track of nodes that require a status update (see ActualStateOfWorld.GetVolumesToReportAttached(), called inside UpdateNodeStatuses() in node_status_updater.go). The entry for the deleted node is never removed from the data structure; rather, the updater is set to try to update the dead node again next time, which leads to the message being logged every 100ms.

The solution is to remove the corresponding node entry once the node is deleted. A fix is on its way.

… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.

…erFix Automatic merge from submit-queue (batch tested with PRs 46383, 45645, 45923, 44884, 46294) Node status updater now deletes the node entry in attach updates... … when node is missing in NodeInformer cache. - Added RemoveNodeFromAttachUpdates as part of node status updater operations. **What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#42438 **Special notes for your reviewer**: Unit tested added, but a more comprehensive test involving the attach detach controller requires certain testing functionality that is currently absent, and will require larger effort. Will be added at a later time. There is an edge case caused by the following steps: 1) A node is deleted and restarted. The node exists, but is not yet recognized by Kubernetes. 2) A pod requiring a volume attach with nodeName specifically set to this node. This would make the pod stuck in ContainerCreating state. This is low-pri since it's a specific edge case that can be avoided. **Release note**: ```release-note NONE ```

verult · 2017-05-26T20:35:48Z

A fix for 1.5 is on the way (PR #46301 )

… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.

r4j4h · 2017-06-05T22:05:54Z

Will 1.4 be included as well?

… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.

verult · 2017-06-06T00:12:28Z

Here you go, my apologies.

r4j4h · 2017-06-06T21:31:26Z

Thank you very much! :)

… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.

Automatic merge from submit-queue Node status updater now deletes the node entry in attach updates when… - Added RemoveNodeFromAttachUpdates as part of node status updater operations. **What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: cherry-pick of the fix for #42438 ```release-note-none ```

Automatic merge from submit-queue Node status updater now deletes the node entry in attach updates when node is missing in NodeInformer cache. - Added RemoveNodeFromAttachUpdates as part of node status updater operations. **What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #42438 **Special notes for your reviewer**: v1.5 version of the fix addressed by PR #45923. This is necessary because NodeLister did not exist prior to 1.6, thus node status updater requires a slightly different node existence check. **Release note**: ```release-note NONE ```

saad-ali · 2017-06-29T18:41:12Z

To summarize, the fix for this is in

1.7 branch
- Releases 1.7.0 and greater
1.6 branch
- Releases 1.6.5 and greater
1.5 branch
- Releases 1.5.8 and greater (1.5.8 has not yet been cut, CC @mwielgus for ETA of that release).
1.4 branch
- Releases 1.4.13 and greater (1.4.13 has not yet been cut, CC @jessfraz for ETA of that release).

saad-ali · 2017-08-22T21:07:53Z

There is an edge case that the above fix did not handle: see #50721
PR #50806 is fixing that

@jingxu97

Automatic merge from submit-queue (batch tested with PRs 50806, 48789, 49922, 49935, 50438) On AttachDetachController node status update, do not retry when node … …doesn't exist but keep the node entry in cache. **What this PR does / why we need it**: An alternative fix for #42438 which also fixes #50721. Instead of removing the node entry entirely from the node status update cache (which prevents the node from ever being updated even when it recovers), here the node status updater does nothing, so that there won't be an update retry until the node is re-added, where the cache entry is set to true. Will cherry pick to prior versions after this is merged. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50721 **Release Note**: ``` release-note On AttachDetachController node status update, do not retry when node doesn't exist but keep the node entry in cache. ``` /assign @jingxu97 /cc @saad-ali /sig storage /release-note

yujuhong added area/controller-manager area/nodecontroller and removed area/controller-manager labels Mar 7, 2017

gmarek added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed area/nodecontroller labels Mar 8, 2017

gmarek assigned saad-ali Apr 3, 2017

saad-ali added this to the v1.7 milestone May 10, 2017

saad-ali added the priority/P1 label May 10, 2017

verult mentioned this issue May 17, 2017

Node status updater now deletes the node entry in attach updates... #45923

Merged

verult mentioned this issue May 23, 2017

Node status updater now deletes the node entry in attach updates when node is missing in NodeInformer cache. #46301

Merged

k8s-github-robot closed this as completed in #45923 May 26, 2017

verult mentioned this issue May 30, 2017

Automated cherry pick of #45923 #46541

Merged

verult mentioned this issue Jun 6, 2017

Node status updater now deletes the node entry in attach updates when… #47007

Merged

andrewklau mentioned this issue Aug 12, 2017

Log spam with "Could not update node status." openshift/origin#15757

Closed

verult mentioned this issue Aug 16, 2017

On AttachDetachController node status update, do not retry when node … #50806

Merged

This was referenced Sep 11, 2020

1.5 #94706

Closed

1.5 #94707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-controller-manager spamming errors in log #42438

kube-controller-manager spamming errors in log #42438

felipejfc commented Mar 2, 2017

0xmichalis commented Mar 3, 2017

damienjoldersma commented Mar 7, 2017

yujuhong commented Mar 7, 2017

gmarek commented Mar 8, 2017

blakebarnett commented Mar 23, 2017 •

edited

steinnes commented Apr 5, 2017

blakebarnett commented Apr 5, 2017

steinnes commented Apr 7, 2017

nabeelamjad commented Apr 24, 2017 •

edited

xuwang commented Apr 26, 2017

gmarek commented Apr 27, 2017

r4j4h commented May 9, 2017

r4j4h commented May 10, 2017

saad-ali commented May 10, 2017

saad-ali commented May 10, 2017

verult commented May 11, 2017

steinnes commented May 11, 2017

verult commented May 17, 2017

verult commented May 26, 2017

r4j4h commented Jun 5, 2017

verult commented Jun 6, 2017

r4j4h commented Jun 6, 2017

saad-ali commented Jun 29, 2017

saad-ali commented Aug 22, 2017

kube-controller-manager spamming errors in log #42438

kube-controller-manager spamming errors in log #42438

Comments

felipejfc commented Mar 2, 2017

0xmichalis commented Mar 3, 2017

damienjoldersma commented Mar 7, 2017

yujuhong commented Mar 7, 2017

gmarek commented Mar 8, 2017

blakebarnett commented Mar 23, 2017 • edited

steinnes commented Apr 5, 2017

blakebarnett commented Apr 5, 2017

steinnes commented Apr 7, 2017

nabeelamjad commented Apr 24, 2017 • edited

xuwang commented Apr 26, 2017

gmarek commented Apr 27, 2017

r4j4h commented May 9, 2017

r4j4h commented May 10, 2017

saad-ali commented May 10, 2017

saad-ali commented May 10, 2017

verult commented May 11, 2017

steinnes commented May 11, 2017

verult commented May 17, 2017

verult commented May 26, 2017

r4j4h commented Jun 5, 2017

verult commented Jun 6, 2017

r4j4h commented Jun 6, 2017

saad-ali commented Jun 29, 2017

saad-ali commented Aug 22, 2017

blakebarnett commented Mar 23, 2017 •

edited

nabeelamjad commented Apr 24, 2017 •

edited