Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

Closed
okartau opened this issue Apr 18, 2018 · 20 comments
Closed

"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

okartau opened this issue Apr 18, 2018 · 20 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@okartau
Copy link
Contributor

okartau commented Apr 18, 2018

When one-shot job is turned into daemon, few things should get double-checked: resource leaking and stability (i.e. exit points) as these are minor issues for single shot but more serious for daemon.
K8s is good in restarting pods, but that's not excuse to allow low quality.
So I keep NFD daemonset running to check for both.
There is weak sign of slow mem.growth pattern, but it may get leveled at longer run, so I planned to keep it up for long monitoring. But no luck there, as nfd pod exits and gets restarted about once per 24h. Next, I tried shortened cycle time from 60 seconds in hope I see mem.pattern more quickly, but that does not work as the exit rate follows: with 1-sec cycle rate I see one exit in about 1h.

I ran pod log in -f mode to capture the exit messages, and these are:

2018/04/18 11:00:27 can't update node: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 failed to advertise labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 error occurred while updating node with feature labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again

That issue has likely always been there, but probability hitting it was low, it has not surfaced before.
Seems there is window during which node labels update is sentenced to fail?
Can that state be detected somehow so that we don't try to update?
Or, should we retry after some delay once such condition is hit?
Or, is this sign of some other problem somewhere else?
In any case, we should try to do better than exit, now that we can run as daemonset.

@okartau
Copy link
Contributor Author

okartau commented Apr 18, 2018

I discovered the issue in a cluster with 4 nodes, running k8s version 1.9.3 and current master of NFD.
I plan to verify with recent k8s version before exploring more.

@okartau
Copy link
Contributor Author

okartau commented Apr 19, 2018

Termination happens implicitly in stderrLogger.Fatalf(),
which gets called from main loop in case of error.
Note that before "run as daemonset" change, this was last line of main(), means forced exit
did not change the outcome of one-shot run at all.
Now it makes big difference in daemonset mode.
One simple improvement approach would be: use less aggressive log level and avoid forced exit.
Then there will remain another area for study: why updating labels fails sometimes.

@balajismaniam
Copy link
Contributor

balajismaniam commented Apr 19, 2018

I commented about this in the original PR: #105 (comment). IMO we should not allow low sleep intervals such as 1s. We should throw a warning if it is set to such low levels and default to some high sleep intervals such as 30s. Node features are not expected to change at such low intervals.

@balajismaniam
Copy link
Contributor

2018/04/18 11:00:27 can't update node: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 failed to advertise labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 error occurred while updating node with feature labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again

These errors correspond to a race between get and update of the node at 1s interval, I think.

@okartau
Copy link
Contributor Author

okartau commented Apr 19, 2018

1s is too short sure, it was just my trial to accelerate triggering of the condition. I now did set it to somewhat longer 4s and I see error triggered about once per 4h. But we cant forget that pods initially terminated even with official 60s. I can leave such system running with captured logs, to verify was it really same cause. My current trial with 4s interval has changed fatal to print, so that pod continues instead of termination, and seems the next try is successful, so in practice we can perhaps even accept such single failures (and avoid exit).

@marquiz
Copy link
Contributor

marquiz commented Apr 26, 2018

Yes, there are occasional restarts (i.e. fatal exits) even with 60s interval.

It would be good to understand the root cause of these failures.

I don't think we should lower the log level from fatal because then we would hide the problem that labeling actually fails. And this, in turn, could hide some other problem that was causing labeling to fail every time. So, I think with NFD exiting with fatal is actually a good thing as it reveals us that a problem exists. In normal scenario the restarts are pretty rare so it doesn't end up in crashloopbackoff or anything.

@marquiz marquiz mentioned this issue May 4, 2018
10 tasks
@okartau
Copy link
Contributor Author

okartau commented Jul 3, 2018

I agree, keeping exit is right thing, my first idea about changing exit to print is not very good

@marquiz marquiz mentioned this issue Aug 17, 2018
10 tasks
@marquiz marquiz added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 20, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@uniemimu
Copy link
Contributor

/reopen
/remove-lifecycle rotten

This is still happening. The label update fails due to the object changing before the update.

2020/08/11 09:07:47 Sending labeling request to nfd-master
2020/08/11 09:07:47 failed to set node labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again
2020/08/11 09:07:47 ERROR: failed to advertise labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again

Like the log suggest, a simple limited round retry-loop with the object refresh would probably sort this out.

@k8s-ci-robot
Copy link
Contributor

@uniemimu: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

This is still happening. The label update fails due to the object changing before the update.

2020/08/11 09:07:47 Sending labeling request to nfd-master
2020/08/11 09:07:47 failed to set node labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again
2020/08/11 09:07:47 ERROR: failed to advertise labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again

Like the log suggest, a simple limited round retry-loop with the object refresh would probably sort this out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 11, 2020
@marquiz
Copy link
Contributor

marquiz commented Aug 11, 2020

Thanks for the heads up @uniemimu ! Need to fix this

/reopen

@k8s-ci-robot
Copy link
Contributor

@marquiz: Reopened this issue.

In response to this:

Thanks for the heads up @uniemimu ! Need to fix this

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marquiz
Copy link
Contributor

marquiz commented Aug 20, 2020

@uniemimu could you try out #336 if that fixes the problem for you?

@uniemimu
Copy link
Contributor

@marquiz No issues seen after over 48h of running with something like 8 nodes. With the previous version it would have tipped over several times already. LGTM!

@marquiz
Copy link
Contributor

marquiz commented Aug 22, 2020

Thanks @uniemimu for reporting back! Same here: I haven't seen any failures on my test cluster's either

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2020
@marquiz
Copy link
Contributor

marquiz commented Nov 24, 2020

Fixed by #336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants