New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-controller-manager spamming errors in log #42438
Comments
@kubernetes/sig-node-bugs |
Any work around? Any way to purge those old nodes from the NodeInformer cache? |
It has nothing to do with the NodeController:
cc @saad-ali |
This was happening in our clusters as well, only the leader was doing it and and killing the container so it restarted made it stop happening. |
@blakebarnett did you kill the |
correct |
Works for me™ @blakebarnett |
Same controller-manager log here using k8s 1.5. I deleted one of the nodes which ran out of resources and became stuck, something I'm still trying to pinpoint the cause of but it appears to be related to volumes. |
This happens on k8s 1.6.1 as well, that replaced old nodes won't get cleaned up, generating large error logs. On CoreOS installation, I did systemctl restart kube-controller-manager. That fixed it. |
Ping @saad-ali |
This is happening on k8s 1.4.12 as well, restarting kube-controller-manager does not seem to fix it for me. It is rotating through two nodes that no longer exist, so I am hoping the cache will purge itself at some point and they will stop? |
Just following up that the old nodes still not cleaned up |
@verult will work on a fix for this. |
@verult Let's make sure the fix also gets ported back to all affected branches (1.6, 1.5, and 1.4). CC @kubernetes/sig-storage-bugs |
I was able to reproduce the problem by force deleting a node containing a pod with some volume attached to it. Did anyone run into a different failure mode? |
@verult I think that's how the bug surfaced for me. When I force killed my node we had just started running a StatefulSet with an attached persistent volume, and I think there was one running on the very node I killed. |
The NodeInformer cache is actually reporting the correct state. The problem is that inside the node status updater, there is another data structure keeping track of nodes that require a status update (see ActualStateOfWorld.GetVolumesToReportAttached(), called inside UpdateNodeStatuses() in node_status_updater.go). The entry for the deleted node is never removed from the data structure; rather, the updater is set to try to update the dead node again next time, which leads to the message being logged every 100ms. The solution is to remove the corresponding node entry once the node is deleted. A fix is on its way. |
… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.
… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.
…erFix Automatic merge from submit-queue (batch tested with PRs 46383, 45645, 45923, 44884, 46294) Node status updater now deletes the node entry in attach updates... … when node is missing in NodeInformer cache. - Added RemoveNodeFromAttachUpdates as part of node status updater operations. **What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#42438 **Special notes for your reviewer**: Unit tested added, but a more comprehensive test involving the attach detach controller requires certain testing functionality that is currently absent, and will require larger effort. Will be added at a later time. There is an edge case caused by the following steps: 1) A node is deleted and restarted. The node exists, but is not yet recognized by Kubernetes. 2) A pod requiring a volume attach with nodeName specifically set to this node. This would make the pod stuck in ContainerCreating state. This is low-pri since it's a specific edge case that can be avoided. **Release note**: ```release-note NONE ```
A fix for 1.5 is on the way (PR #46301 ) |
… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.
Will 1.4 be included as well? |
… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.
Here you go, my apologies. |
Thank you very much! :) |
… node is missing in NodeInformer cache. Fixes kubernetes#42438. - Added RemoveNodeFromAttachUpdates as part of node status updater operations.
Automatic merge from submit-queue Node status updater now deletes the node entry in attach updates when… - Added RemoveNodeFromAttachUpdates as part of node status updater operations. **What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: cherry-pick of the fix for #42438 ```release-note-none ```
Automatic merge from submit-queue Node status updater now deletes the node entry in attach updates when node is missing in NodeInformer cache. - Added RemoveNodeFromAttachUpdates as part of node status updater operations. **What this PR does / why we need it**: Fixes issue of unnecessary node status updates when node is deleted. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #42438 **Special notes for your reviewer**: v1.5 version of the fix addressed by PR #45923. This is necessary because NodeLister did not exist prior to 1.6, thus node status updater requires a slightly different node existence check. **Release note**: ```release-note NONE ```
To summarize, the fix for this is in |
Automatic merge from submit-queue (batch tested with PRs 50806, 48789, 49922, 49935, 50438) On AttachDetachController node status update, do not retry when node … …doesn't exist but keep the node entry in cache. **What this PR does / why we need it**: An alternative fix for #42438 which also fixes #50721. Instead of removing the node entry entirely from the node status update cache (which prevents the node from ever being updated even when it recovers), here the node status updater does nothing, so that there won't be an update retry until the node is re-added, where the cache entry is set to true. Will cherry pick to prior versions after this is merged. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #50721 **Release Note**: ``` release-note On AttachDetachController node status update, do not retry when node doesn't exist but keep the node entry in cache. ``` /assign @jingxu97 /cc @saad-ali /sig storage /release-note
kubernetes version: 1.5.3
platform: aws
deploy tool: kops
my master's /var/log/kube-controller-manager.log
is receiving this type of log:
lots of times per second, this node "ip-172-20-114-85.ec2.internal" does not exist in the cluster anymore and is not shown if I
kubectl get nodes
shouldn't it be removed from status_updater?
The text was updated successfully, but these errors were encountered: