Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-controller-manager spamming errors in log #42438

Closed
felipejfc opened this issue Mar 2, 2017 · 24 comments · Fixed by #45923
Closed

kube-controller-manager spamming errors in log #42438

felipejfc opened this issue Mar 2, 2017 · 24 comments · Fixed by #45923
Assignees
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@felipejfc
Copy link

kubernetes version: 1.5.3
platform: aws
deploy tool: kops

my master's /var/log/kube-controller-manager.log

is receiving this type of log:

I0302 21:51:11.715415       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:11.815662       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:11.915905       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.016217       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.116508       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.216725       6 node_status_updater.go:74] Could not update node status. Failed to find node "ip-172-20-114-85.ec2.internal" in NodeInformer cache. <nil>
I0302 21:51:12.316962       6 node_status_updater.go:74] Could not update node status ...

lots of times per second, this node "ip-172-20-114-85.ec2.internal" does not exist in the cluster anymore and is not shown if I kubectl get nodes

shouldn't it be removed from status_updater?

@0xmichalis
Copy link
Contributor

@kubernetes/sig-node-bugs

@damienjoldersma
Copy link

Any work around? Any way to purge those old nodes from the NodeInformer cache?

@yujuhong
Copy link
Contributor

yujuhong commented Mar 7, 2017

AFAIK, sig-node doesn't maintain the node controller.

Based on the OWNERS files, the only approver is @gmarek

@gmarek
Copy link
Contributor

gmarek commented Mar 8, 2017

It has nothing to do with the NodeController:

find . -name node_status_updater.go
./pkg/controller/volume/attachdetach/statusupdater/node_status_updater.go

cc @saad-ali

@gmarek gmarek added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed area/nodecontroller labels Mar 8, 2017
@blakebarnett
Copy link

blakebarnett commented Mar 23, 2017

This was happening in our clusters as well, only the leader was doing it and and killing the container so it restarted made it stop happening.

@steinnes
Copy link

steinnes commented Apr 5, 2017

@blakebarnett did you kill the gcr.io/google_containers/kube-controller-manager container, which then gets restarted by the scheduler, and that fixed it?

@blakebarnett
Copy link

correct

@steinnes
Copy link

steinnes commented Apr 7, 2017

Works for me™ @blakebarnett

@nabeelamjad
Copy link

nabeelamjad commented Apr 24, 2017

Same controller-manager log here using k8s 1.5. I deleted one of the nodes which ran out of resources and became stuck, something I'm still trying to pinpoint the cause of but it appears to be related to volumes.

@xuwang
Copy link

xuwang commented Apr 26, 2017

This happens on k8s 1.6.1 as well, that replaced old nodes won't get cleaned up, generating large error logs. On CoreOS installation, I did systemctl restart kube-controller-manager. That fixed it.

@gmarek
Copy link
Contributor

gmarek commented Apr 27, 2017

Ping @saad-ali

@r4j4h
Copy link

r4j4h commented May 9, 2017

This is happening on k8s 1.4.12 as well, restarting kube-controller-manager does not seem to fix it for me. It is rotating through two nodes that no longer exist, so I am hoping the cache will purge itself at some point and they will stop?

@r4j4h
Copy link

r4j4h commented May 10, 2017

Just following up that the old nodes still not cleaned up

@saad-ali saad-ali added this to the v1.7 milestone May 10, 2017
@saad-ali
Copy link
Member

@verult will work on a fix for this.

@saad-ali
Copy link
Member

@verult Let's make sure the fix also gets ported back to all affected branches (1.6, 1.5, and 1.4).

CC @kubernetes/sig-storage-bugs

@verult
Copy link
Contributor

verult commented May 11, 2017

I was able to reproduce the problem by force deleting a node containing a pod with some volume attached to it. Did anyone run into a different failure mode?

@steinnes
Copy link

@verult I think that's how the bug surfaced for me. When I force killed my node we had just started running a StatefulSet with an attached persistent volume, and I think there was one running on the very node I killed.

@verult
Copy link
Contributor

verult commented May 17, 2017

The NodeInformer cache is actually reporting the correct state. The problem is that inside the node status updater, there is another data structure keeping track of nodes that require a status update (see ActualStateOfWorld.GetVolumesToReportAttached(), called inside UpdateNodeStatuses() in node_status_updater.go). The entry for the deleted node is never removed from the data structure; rather, the updater is set to try to update the dead node again next time, which leads to the message being logged every 100ms.

The solution is to remove the corresponding node entry once the node is deleted. A fix is on its way.

verult added a commit to verult/kubernetes that referenced this issue May 24, 2017
… node is missing in NodeInformer cache. Fixes kubernetes#42438.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.
verult added a commit to verult/kubernetes that referenced this issue May 25, 2017
… node is missing in NodeInformer cache. Fixes kubernetes#42438.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.
smarterclayton pushed a commit to smarterclayton/kubernetes that referenced this issue May 26, 2017
…erFix

Automatic merge from submit-queue (batch tested with PRs 46383, 45645, 45923, 44884, 46294)

Node status updater now deletes the node entry in attach updates...

… when node is missing in NodeInformer cache.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.



**What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#42438

**Special notes for your reviewer**: Unit tested added, but a more comprehensive test involving the attach detach controller requires certain testing functionality that is currently absent, and will require larger effort. Will be added at a later time.

There is an edge case caused by the following steps:
1) A node is deleted and restarted. The node exists, but is not yet recognized by Kubernetes.
2) A pod requiring a volume attach with nodeName specifically set to this node.

This would make the pod stuck in ContainerCreating state. This is low-pri since it's a specific edge case that can be avoided.

**Release note**:

```release-note
NONE
```
@verult
Copy link
Contributor

verult commented May 26, 2017

A fix for 1.5 is on the way (PR #46301 )

dhilipkumars pushed a commit to huawei-cloudnative/kubernetes that referenced this issue May 28, 2017
… node is missing in NodeInformer cache. Fixes kubernetes#42438.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.
@r4j4h
Copy link

r4j4h commented Jun 5, 2017

Will 1.4 be included as well?

verult added a commit to verult/kubernetes that referenced this issue Jun 6, 2017
… node is missing in NodeInformer cache. Fixes kubernetes#42438.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.
@verult
Copy link
Contributor

verult commented Jun 6, 2017

Here you go, my apologies.

@r4j4h
Copy link

r4j4h commented Jun 6, 2017

Thank you very much! :)

verult added a commit to verult/kubernetes that referenced this issue Jun 10, 2017
… node is missing in NodeInformer cache. Fixes kubernetes#42438.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.
k8s-github-robot pushed a commit that referenced this issue Jun 16, 2017
Automatic merge from submit-queue

Node status updater now deletes the node entry in attach updates when…

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.



**What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: cherry-pick of the fix for #42438 


```release-note-none
```
k8s-github-robot pushed a commit that referenced this issue Jun 21, 2017
Automatic merge from submit-queue

Node status updater now deletes the node entry in attach updates when node is missing in NodeInformer cache.

- Added RemoveNodeFromAttachUpdates as part of node status updater operations.



**What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #42438

**Special notes for your reviewer**: v1.5 version of the fix addressed by PR #45923. This is necessary because NodeLister did not exist prior to 1.6, thus node status updater requires a slightly different node existence check.

**Release note**:

```release-note
NONE
```
@saad-ali
Copy link
Member

To summarize, the fix for this is in

  • 1.7 branch
    • Releases 1.7.0 and greater
  • 1.6 branch
    • Releases 1.6.5 and greater
  • 1.5 branch
    • Releases 1.5.8 and greater (1.5.8 has not yet been cut, CC @mwielgus for ETA of that release).
  • 1.4 branch
    • Releases 1.4.13 and greater (1.4.13 has not yet been cut, CC @jessfraz for ETA of that release).

@saad-ali
Copy link
Member

There is an edge case that the above fix did not handle: see #50721
PR #50806 is fixing that

k8s-github-robot pushed a commit that referenced this issue Aug 23, 2017
Automatic merge from submit-queue (batch tested with PRs 50806, 48789, 49922, 49935, 50438)

On AttachDetachController node status update, do not retry when node …

…doesn't exist but keep the node entry in cache.



**What this PR does / why we need it**: An alternative fix for #42438 which also fixes #50721.

Instead of removing the node entry entirely from the node status update cache (which prevents the node from ever being updated even when it recovers), here the node status updater does nothing, so that there won't be an update retry until the node is re-added, where the cache entry is set to true.

Will cherry pick to prior versions after this is merged.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50721 

**Release Note**:
``` release-note
On AttachDetachController node status update, do not retry when node doesn't exist but keep the node entry in cache.
```

/assign @jingxu97 
/cc @saad-ali 
/sig storage
/release-note
This was referenced Sep 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.