Performance problems #282

zuzzas · 2019-05-29T20:15:40Z

NPD Version

v0.6.3

Kernel version

4.15.0

What happened?

Kernel log got hammered with a lot of OOM kills. Totaling to about 4500 in a minute. This led to an extremely high CPU usage by node-problem-detector.

At a cursory glance, most of the time is spent in the kmsg's watchLoop. I've used Linux's perf utility, because I had no time to debug the issue properly with go tool pprof.

I'll try to find some time for the benchmarks, just leaving this here for now.

The text was updated successfully, but these errors were encountered:

xueweiz · 2019-06-04T21:22:11Z

/assign @xueweiz
Thanks for filing the issue! I wanted to do some performance testing & improvement on NPD as well.

I think we should make sure NPD's resource consumption (cpu, mem, IO) always get bounded.
i.e. If there is a million problem going on, NPD should restrain from using too much resource detecting & reporting these problems. It's better to detect and report the issues slowly, rather than further punish the system by taking more resource away.

We could start with putting a CPU usage capacity on NPD. I think we could either use cgroup/cpulimit to limit NPD's CPU usage. We could add a systemd unit for starting NPD with recommended configuration under node-problem-detector/deployment/systemd/

@wangzhen127 WDYT?

@zuzzas Hi, I wonder if you are running NPD as a standalone daemon or as a Kubernetes DaemonSet? And if you are running it as a standalone daemon, I wonder if your OS has system? Thanks!

k8s-ci-robot · 2019-06-04T21:22:12Z

@xueweiz: GitHub didn't allow me to assign the following users: xueweiz.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @xueweiz
Thanks for filing the issue! I wanted to do some performance testing & improvement on NPD as well.

I think we should make sure NPD's resource consumption (cpu, mem, IO) always get bounded.
i.e. If there is a million problem going on, NPD should restrain from using too much resource detecting & reporting these problems. It's better to detect and report the issues slowly, rather than further punish the system by taking more resource away.

We could start with putting a CPU usage capacity on NPD. I think we could either use cgroup/cpulimit to limit NPD's CPU usage. We could add a systemd unit for starting NPD with recommended configuration under node-problem-detector/deployment/systemd/

@wangzhen127 WDYT?

@zuzzas Hi, I wonder if you are running NPD as a standalone daemon or as a Kubernetes DaemonSet? And if you are running it as a standalone daemon, I wonder if your OS has system? Thanks!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wangzhen127 · 2019-06-05T19:19:02Z

I think we should make sure NPD's resource consumption (cpu, mem, IO) always get bounded.
i.e. If there is a million problem going on, NPD should restrain from using too much resource detecting & reporting these problems. It's better to detect and report the issues slowly, rather than further punish the system by taking more resource away.

Totally agree.

We could start with putting a CPU usage capacity on NPD. I think we could either use cgroup/cpulimit to limit NPD's CPU usage. We could add a systemd unit for starting NPD with recommended configuration under node-problem-detector/deployment/systemd/

This can be the starting point. I think the problem is that NPD still needs to that much amount of work, it just becomes slower.

I think a better (and long term) way is to define the priority of the problems. When NPD found itself approaching the resource usage limit, it should try to complete the high priority problem detentions first, and drop low priority things. Or have a "degrade" mode, where NPD can drop log entries in order to catch up with the log production within CPU limit.

cc @Random-Liu @andyxning @dchen1107 for thoughts.

zuzzas · 2019-06-06T09:34:04Z

@xueweiz
As a DaemonSet. CPU Limits, obviously, worked, but the container got throttled into oblivion.

fejta-bot · 2019-09-04T09:56:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wangzhen127 · 2019-09-04T16:54:56Z

/remove-lifecycle stale

fejta-bot · 2019-12-03T17:11:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-01-02T17:57:03Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

zuzzas · 2020-01-13T11:06:39Z

/remove-lifecycle rotten

fejta-bot · 2020-04-12T11:15:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-05-12T11:59:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-06-11T12:42:45Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-06-11T12:42:59Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wangzhen127 · 2020-06-11T16:00:56Z

/reopen

k8s-ci-robot · 2020-06-11T16:01:11Z

@wangzhen127: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wangzhen127 · 2020-06-11T16:01:13Z

/remove-lifecycle rotten

fejta-bot · 2020-09-09T16:50:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-10-09T17:32:47Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-11-08T18:15:11Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-11-08T18:15:25Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 13, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2020

k8s-ci-robot closed this as completed Jun 11, 2020

k8s-ci-robot reopened this Jun 11, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 11, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 9, 2020

k8s-ci-robot closed this as completed Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problems #282

Performance problems #282

zuzzas commented May 29, 2019

xueweiz commented Jun 4, 2019

k8s-ci-robot commented Jun 4, 2019

wangzhen127 commented Jun 5, 2019

zuzzas commented Jun 6, 2019

fejta-bot commented Sep 4, 2019

wangzhen127 commented Sep 4, 2019

fejta-bot commented Dec 3, 2019

fejta-bot commented Jan 2, 2020

zuzzas commented Jan 13, 2020

fejta-bot commented Apr 12, 2020

fejta-bot commented May 12, 2020

fejta-bot commented Jun 11, 2020

k8s-ci-robot commented Jun 11, 2020

wangzhen127 commented Jun 11, 2020

k8s-ci-robot commented Jun 11, 2020

wangzhen127 commented Jun 11, 2020

fejta-bot commented Sep 9, 2020

fejta-bot commented Oct 9, 2020

fejta-bot commented Nov 8, 2020

k8s-ci-robot commented Nov 8, 2020

Performance problems #282

Performance problems #282

Comments

zuzzas commented May 29, 2019

NPD Version

Kernel version

What happened?

xueweiz commented Jun 4, 2019

k8s-ci-robot commented Jun 4, 2019

wangzhen127 commented Jun 5, 2019

zuzzas commented Jun 6, 2019

fejta-bot commented Sep 4, 2019

wangzhen127 commented Sep 4, 2019

fejta-bot commented Dec 3, 2019

fejta-bot commented Jan 2, 2020

zuzzas commented Jan 13, 2020

fejta-bot commented Apr 12, 2020

fejta-bot commented May 12, 2020

fejta-bot commented Jun 11, 2020

k8s-ci-robot commented Jun 11, 2020

wangzhen127 commented Jun 11, 2020

k8s-ci-robot commented Jun 11, 2020

wangzhen127 commented Jun 11, 2020

fejta-bot commented Sep 9, 2020

fejta-bot commented Oct 9, 2020

fejta-bot commented Nov 8, 2020

k8s-ci-robot commented Nov 8, 2020