"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

okartau · 2018-04-18T20:20:35Z

When one-shot job is turned into daemon, few things should get double-checked: resource leaking and stability (i.e. exit points) as these are minor issues for single shot but more serious for daemon.
K8s is good in restarting pods, but that's not excuse to allow low quality.
So I keep NFD daemonset running to check for both.
There is weak sign of slow mem.growth pattern, but it may get leveled at longer run, so I planned to keep it up for long monitoring. But no luck there, as nfd pod exits and gets restarted about once per 24h. Next, I tried shortened cycle time from 60 seconds in hope I see mem.pattern more quickly, but that does not work as the exit rate follows: with 1-sec cycle rate I see one exit in about 1h.

I ran pod log in -f mode to capture the exit messages, and these are:

2018/04/18 11:00:27 can't update node: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 failed to advertise labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 error occurred while updating node with feature labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again

That issue has likely always been there, but probability hitting it was low, it has not surfaced before.
Seems there is window during which node labels update is sentenced to fail?
Can that state be detected somehow so that we don't try to update?
Or, should we retry after some delay once such condition is hit?
Or, is this sign of some other problem somewhere else?
In any case, we should try to do better than exit, now that we can run as daemonset.

okartau · 2018-04-18T20:24:02Z

I discovered the issue in a cluster with 4 nodes, running k8s version 1.9.3 and current master of NFD.
I plan to verify with recent k8s version before exploring more.

okartau · 2018-04-19T05:32:26Z

Termination happens implicitly in stderrLogger.Fatalf(),
which gets called from main loop in case of error.
Note that before "run as daemonset" change, this was last line of main(), means forced exit
did not change the outcome of one-shot run at all.
Now it makes big difference in daemonset mode.
One simple improvement approach would be: use less aggressive log level and avoid forced exit.
Then there will remain another area for study: why updating labels fails sometimes.

balajismaniam · 2018-04-19T16:29:54Z

I commented about this in the original PR: #105 (comment). IMO we should not allow low sleep intervals such as 1s. We should throw a warning if it is set to such low levels and default to some high sleep intervals such as 30s. Node features are not expected to change at such low intervals.

balajismaniam · 2018-04-19T16:31:51Z

2018/04/18 11:00:27 can't update node: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 failed to advertise labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again
2018/04/18 11:00:27 error occurred while updating node with feature labels: Operation cannot be fulfilled on nodes "k8-vm-2": the object has been modified; please apply your changes to the latest version and try again

These errors correspond to a race between get and update of the node at 1s interval, I think.

okartau · 2018-04-19T17:14:54Z

1s is too short sure, it was just my trial to accelerate triggering of the condition. I now did set it to somewhat longer 4s and I see error triggered about once per 4h. But we cant forget that pods initially terminated even with official 60s. I can leave such system running with captured logs, to verify was it really same cause. My current trial with 4s interval has changed fatal to print, so that pod continues instead of termination, and seems the next try is successful, so in practice we can perhaps even accept such single failures (and avoid exit).

marquiz · 2018-04-26T11:43:48Z

Yes, there are occasional restarts (i.e. fatal exits) even with 60s interval.

It would be good to understand the root cause of these failures.

I don't think we should lower the log level from fatal because then we would hide the problem that labeling actually fails. And this, in turn, could hide some other problem that was causing labeling to fail every time. So, I think with NFD exiting with fatal is actually a good thing as it reveals us that a problem exists. In normal scenario the restarts are pretty rare so it doesn't end up in crashloopbackoff or anything.

okartau · 2018-07-03T06:29:45Z

I agree, keeping exit is right thing, my first idea about changing exit to print is not very good

fejta-bot · 2019-04-27T02:43:02Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-27T03:25:38Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-26T04:16:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-26T04:17:06Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

uniemimu · 2020-08-11T09:39:23Z

/reopen
/remove-lifecycle rotten

This is still happening. The label update fails due to the object changing before the update.

2020/08/11 09:07:47 Sending labeling request to nfd-master
2020/08/11 09:07:47 failed to set node labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again
2020/08/11 09:07:47 ERROR: failed to advertise labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again

Like the log suggest, a simple limited round retry-loop with the object refresh would probably sort this out.

k8s-ci-robot · 2020-08-11T09:39:29Z

@uniemimu: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

This is still happening. The label update fails due to the object changing before the update.

2020/08/11 09:07:47 Sending labeling request to nfd-master
2020/08/11 09:07:47 failed to set node labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again
2020/08/11 09:07:47 ERROR: failed to advertise labels: rpc error: code = Unknown desc = Operation cannot be fulfilled on nodes "cfl-nuci5-2": the object has been modified; please apply your changes to the latest version and try again

Like the log suggest, a simple limited round retry-loop with the object refresh would probably sort this out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

marquiz · 2020-08-11T11:29:24Z

Thanks for the heads up @uniemimu ! Need to fix this

/reopen

k8s-ci-robot · 2020-08-11T11:29:31Z

@marquiz: Reopened this issue.

In response to this:

Thanks for the heads up @uniemimu ! Need to fix this

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

marquiz · 2020-08-20T11:08:49Z

@uniemimu could you try out #336 if that fixes the problem for you?

uniemimu · 2020-08-22T12:35:26Z

@marquiz No issues seen after over 48h of running with something like 8 nodes. With the previous version it would have tipped over several times already. LGTM!

marquiz · 2020-08-22T18:37:31Z

Thanks @uniemimu for reporting back! Same here: I haven't seen any failures on my test cluster's either

fejta-bot · 2020-11-20T18:57:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

marquiz · 2020-11-24T15:02:01Z

Fixed by #336

marquiz mentioned this issue May 4, 2018

Release v0.2.0 #123

Closed

10 tasks

marquiz mentioned this issue Aug 17, 2018

Release v0.3.0 #155

Closed

10 tasks

marquiz added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 20, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 27, 2019

k8s-ci-robot closed this as completed Jun 26, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 11, 2020

k8s-ci-robot reopened this Aug 11, 2020

marquiz mentioned this issue Aug 20, 2020

nfd-master: patch node object instead of rewriting it #336

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2020

marquiz closed this as completed Nov 24, 2020

marquiz mentioned this issue Oct 18, 2022

Allow optionally setting node taints defined on the NodeFeatureRule CR #910

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

okartau commented Apr 18, 2018

okartau commented Apr 18, 2018

okartau commented Apr 19, 2018

balajismaniam commented Apr 19, 2018 •

edited

Loading

balajismaniam commented Apr 19, 2018

okartau commented Apr 19, 2018

marquiz commented Apr 26, 2018

okartau commented Jul 3, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

fejta-bot commented Jun 26, 2019

k8s-ci-robot commented Jun 26, 2019

uniemimu commented Aug 11, 2020

k8s-ci-robot commented Aug 11, 2020

marquiz commented Aug 11, 2020

k8s-ci-robot commented Aug 11, 2020

marquiz commented Aug 20, 2020

uniemimu commented Aug 22, 2020

marquiz commented Aug 22, 2020

fejta-bot commented Nov 20, 2020

marquiz commented Nov 24, 2020

"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

"Failure updating labels" sometimes happens, exits daemonset-mode pod #122

Comments

okartau commented Apr 18, 2018

okartau commented Apr 18, 2018

okartau commented Apr 19, 2018

balajismaniam commented Apr 19, 2018 • edited Loading

balajismaniam commented Apr 19, 2018

okartau commented Apr 19, 2018

marquiz commented Apr 26, 2018

okartau commented Jul 3, 2018

fejta-bot commented Apr 27, 2019

fejta-bot commented May 27, 2019

fejta-bot commented Jun 26, 2019

k8s-ci-robot commented Jun 26, 2019

uniemimu commented Aug 11, 2020

k8s-ci-robot commented Aug 11, 2020

marquiz commented Aug 11, 2020

k8s-ci-robot commented Aug 11, 2020

marquiz commented Aug 20, 2020

uniemimu commented Aug 22, 2020

marquiz commented Aug 22, 2020

fejta-bot commented Nov 20, 2020

marquiz commented Nov 24, 2020

balajismaniam commented Apr 19, 2018 •

edited

Loading