Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't fix persistent 'netdev budget ran outs' with 45k alerts, need help #4624

Closed
rootella opened this Issue Nov 12, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@rootella
Copy link

rootella commented Nov 12, 2018

Community 馃憢
I'm a new happy netdata user, deployed on some servers. On some of those i'm experiencing the alert in topic. I tried to follow some hints but cannot get this alert away (those servers manage tcp streaming traffic in a low latency manner)

Software specs: Ubuntu server 16.04 with BBR congestion algo
Hardware: Ryzen 1700x, 64gb, intel nic
Background: Managing latency sensitive tcp streams, 15-20k sockets with low throughput (30-50Mbs total) through Docker

Any help will be appreciated!

ethtool -c eth0

Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
rx-usecs: 3
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

ethtool -g eth0 (aware of bufferbloat, set to 1/2 of the total.. worth it?)

Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 2048
RX Mini: 0
RX Jumbo: 0
TX: 2048

Testing sysctl (raised some values with no success)

net.core.somaxconn = 8192
net.core.netdev_max_backlog = 10000
net.core.netdev_budget = 65535
net.core.netdev_budget_usecs = 2000

@ktsaou

This comment has been minimized.

Copy link
Member

ktsaou commented Nov 12, 2018

Try increasing net.core.netdev_budget_usecs to 20000.
On my systems this works.

@ktsaou ktsaou added the question label Nov 12, 2018

@ktsaou

This comment has been minimized.

Copy link
Member

ktsaou commented Nov 12, 2018

Explanation: net.core.netdev_budget_usecs controls the time in microseconds, the kernel will allow for de-queuing the network device buffer. If the CPU is slow, 1/500th of a second is just too small for this kind of work. So, increase it to 1/50th.

@Ferroin

This comment has been minimized.

Copy link
Collaborator

Ferroin commented Nov 13, 2018

Note that dependent on your exact hardware, software versions, amount of network traffic, and hardware settings, it may not be possible to adjust things such that these alerts stop without severely impacting networking performance.

Alternatively, you can adjust the alarm to not be so sensitive by running /etc/netdata/edit-config health.d/softnet.conf as root, and modifying the line that says warn: $this > (($status >= $WARNING) ? (0) : (10)) in the section that starts with alarm: 10min_netdev_budget_ran_outs. The first number in the parenthesis there should be just above the normal spikes you see in the value of the alarm, and the second number in parenthesis should be 10-20 above the first one.

@Saruspete

This comment has been minimized.

Copy link

Saruspete commented Nov 23, 2018

This sysctl net.core.netdev_budget_usecs is not available on all distributions.
For a long time, it was fixed at 2HZ, and RHEL/CentOS does not have this sysctl to tune it (kernel 3.10). Even if increasing the netdev_budget, this timeout will still go out of the loop, and the alert cannot be fixed.

Is there any way to check for sysctl /proc/sys/net/core/netdev_budget_usecs presency and adjust the value accordingly ?

@cakrit cakrit added the area/health label Nov 25, 2018

@cakrit

This comment has been minimized.

Copy link
Contributor

cakrit commented Nov 29, 2018

Is there any way to check for sysctl /proc/sys/net/core/netdev_budget_usecs presency and adjust the value accordingly ?

If you mean to adjust the threshold for the alert automatically, then the answer is no. What you will probably want to do is to modify the alarm trigger for the alarms you're receiving, by configuring the relevant conf file. You can see the name of that file in the "source" of your alarm on the netdata UI (last line of the table at the right of your active alarm). In this case, it is net.conf.

@cakrit cakrit closed this Nov 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.