New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance problems #282
Comments
/assign @xueweiz I think we should make sure NPD's resource consumption (cpu, mem, IO) always get bounded. We could start with putting a CPU usage capacity on NPD. I think we could either use cgroup/cpulimit to limit NPD's CPU usage. We could add a systemd unit for starting NPD with recommended configuration under node-problem-detector/deployment/systemd/ @wangzhen127 WDYT? @zuzzas Hi, I wonder if you are running NPD as a standalone daemon or as a Kubernetes DaemonSet? And if you are running it as a standalone daemon, I wonder if your OS has system? Thanks! |
@xueweiz: GitHub didn't allow me to assign the following users: xueweiz. Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Totally agree.
This can be the starting point. I think the problem is that NPD still needs to that much amount of work, it just becomes slower. I think a better (and long term) way is to define the priority of the problems. When NPD found itself approaching the resource usage limit, it should try to complete the high priority problem detentions first, and drop low priority things. Or have a "degrade" mode, where NPD can drop log entries in order to catch up with the log production within CPU limit. cc @Random-Liu @andyxning @dchen1107 for thoughts. |
@xueweiz |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@wangzhen127: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
NPD Version
v0.6.3
Kernel version
4.15.0
What happened?
Kernel log got hammered with a lot of OOM kills. Totaling to about 4500 in a minute. This led to an extremely high CPU usage by node-problem-detector.
At a cursory glance, most of the time is spent in the kmsg's watchLoop. I've used Linux's
perf
utility, because I had no time to debug the issue properly withgo tool pprof
.I'll try to find some time for the benchmarks, just leaving this here for now.
The text was updated successfully, but these errors were encountered: