New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failure updating labels" sometimes happens, exits daemonset-mode pod #122
Comments
I discovered the issue in a cluster with 4 nodes, running k8s version 1.9.3 and current master of NFD. |
Termination happens implicitly in stderrLogger.Fatalf(), |
I commented about this in the original PR: #105 (comment). IMO we should not allow low sleep intervals such as |
These errors correspond to a race between get and update of the node at |
1s is too short sure, it was just my trial to accelerate triggering of the condition. I now did set it to somewhat longer 4s and I see error triggered about once per 4h. But we cant forget that pods initially terminated even with official 60s. I can leave such system running with captured logs, to verify was it really same cause. My current trial with 4s interval has changed fatal to print, so that pod continues instead of termination, and seems the next try is successful, so in practice we can perhaps even accept such single failures (and avoid exit). |
Yes, there are occasional restarts (i.e. fatal exits) even with 60s interval. It would be good to understand the root cause of these failures. I don't think we should lower the log level from fatal because then we would hide the problem that labeling actually fails. And this, in turn, could hide some other problem that was causing labeling to fail every time. So, I think with NFD exiting with fatal is actually a good thing as it reveals us that a problem exists. In normal scenario the restarts are pretty rare so it doesn't end up in crashloopbackoff or anything. |
I agree, keeping exit is right thing, my first idea about changing exit to print is not very good |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen This is still happening. The label update fails due to the object changing before the update. 2020/08/11 09:07:47 Sending labeling request to nfd-master Like the log suggest, a simple limited round retry-loop with the object refresh would probably sort this out. |
@uniemimu: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for the heads up @uniemimu ! Need to fix this /reopen |
@marquiz: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@marquiz No issues seen after over 48h of running with something like 8 nodes. With the previous version it would have tipped over several times already. LGTM! |
Thanks @uniemimu for reporting back! Same here: I haven't seen any failures on my test cluster's either |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Fixed by #336 |
When one-shot job is turned into daemon, few things should get double-checked: resource leaking and stability (i.e. exit points) as these are minor issues for single shot but more serious for daemon.
K8s is good in restarting pods, but that's not excuse to allow low quality.
So I keep NFD daemonset running to check for both.
There is weak sign of slow mem.growth pattern, but it may get leveled at longer run, so I planned to keep it up for long monitoring. But no luck there, as nfd pod exits and gets restarted about once per 24h. Next, I tried shortened cycle time from 60 seconds in hope I see mem.pattern more quickly, but that does not work as the exit rate follows: with 1-sec cycle rate I see one exit in about 1h.
I ran pod log in -f mode to capture the exit messages, and these are:
2018/04/18 11:00:27 can't update node: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 failed to advertise labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 error occurred while updating node with feature labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
That issue has likely always been there, but probability hitting it was low, it has not surfaced before.
Seems there is window during which node labels update is sentenced to fail?
Can that state be detected somehow so that we don't try to update?
Or, should we retry after some delay once such condition is hit?
Or, is this sign of some other problem somewhere else?
In any case, we should try to do better than exit, now that we can run as daemonset.
The text was updated successfully, but these errors were encountered: