New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netdata causing load average increase every ~20 minutes #5234

Open
Daniel15 opened this Issue Jan 21, 2019 · 12 comments

Comments

Projects
None yet
6 participants
@Daniel15
Copy link

Daniel15 commented Jan 21, 2019

Bug report summary

I noticed on two of my servers that the load average was increasing approximately every 20 minutes:

I suspected a cronjob, but I don't have any cronjobs that run every 20 mins. However, CPU usage doesn't actually increase during that period:

I did some digging and it took a long time to work out what was happening.

Linux updates the load average every 5 seconds. In fact, it actually updates every 5 seconds plus one "tick"

sched/loadavg.h:

#define LOAD_FREQ	(5*HZ+1) /* 5 sec intervals */

sched/loadavg.c

 * The global load average is an exponentially decaying average of nr_running +
 * nr_uninterruptible.
 *
 * Once every LOAD_FREQ:
 *
 *   nr_active = 0;
 *   for_each_possible_cpu(cpu)
 *	nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
 *
 *   avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)

HZ is the kernel timer frequency, which is defined when compiling the kernel. On my system, it's 250:

% grep "CONFIG_HZ=" /boot/config-$(uname -r)
CONFIG_HZ=250

This means that every 5.004 seconds (5 + 1/250), Linux calculates the load average. It checks how many processes are actively running plus how many processes are in uninterruptable wait (eg. waiting for disk IO) states, and uses that to compute the load average, smoothing it exponentially over time.

Say you have a process that starts a bunch of subprocesses every second. For example, Netdata collecting data from some apps. Normally, the process will be very fast and won't overlap with the load average check, so everything is fine. However, every 1251 seconds (5.004 * 250), the load average update interval will be an exact multiple of one second (that is, 1251 is the least common multiple of 5.004 and 1). 1251 seconds is 20.85 minutes, which is exactly the interval I was seeing the load average increase. My educated guess here is that every 20.85 minutes, Linux is checking the load average at the exact time that several processes are being started and are in the queue to run.

I confirmed this by disabling netdata and manually watching the load average:

while true; do uptime; sleep 5; done

After 1.5 hours, I did not see any similar spikes. The spikes only occur when Netdata is running.

It turns out other people have hit similar issues in the past, albeit with different intervals. The following posts were extremely helpful:

In the end, I'm not sure if I'd call this a bug, but perhaps netdata could implement some jitter so that it doesn't perform checks every one second exactly.

OS / Environment

Debian buster

Netdata version (ouput of netdata -V)
netdata v1.12.0-rc2-51-nightly
Component Name

N/A

@cakrit

This comment has been minimized.

Copy link
Contributor

cakrit commented Jan 21, 2019

I wrote a small script to calculate exactly when we would expect the same condition to occur based on the polling period and the kernel frequency. The results are here. It's impossible to replicate on my Manjaro which has CONFIG_HZ=300, will try to see it on a VM that has CONFIG_HZ=1000. Expected period to see the spike will be 16min, 40sec, for a poll period of 1sec.

If anyone wants to redo the calculations to find the minimum period for a different frequency, the script it's this:

#!/bin/bash

for hz in 100 200 250 1000; do
 echo "Freq=$hz"
 timer=$(echo "scale=5;5+1/$hz" | bc)
 echo "timer=$timer"
  for poll in 1 2 3 4 5 6 7 8 9 10; do
    i=1
    while [ $i -lt 86401 ] ; do
        if [ $(echo "($timer*$i)%$poll>0" | bc) -eq 0 ] ; then
                echo "$i"
                break
        fi
        i=$((i+1))
    done
 done
done
@mfundul

This comment has been minimized.

Copy link
Contributor

mfundul commented Jan 21, 2019

Well, if netdata really is doing CPU work the load average is kind of accurate. One idea is to scatter thread scheduling in netdata so that not everything wakes up at once.

@Ferroin

This comment has been minimized.

Copy link
Member

Ferroin commented Jan 21, 2019

Confirmed that it's correlated to collection by netdata, I'm not seeing any such spikes on systems that have only a few collectors and almost nothing that triggers new process creation.

I think adding some jitter to the data collection probably is the best option here, though I really don't know what to suggest for the range of jitter.

@cakrit

This comment has been minimized.

Copy link
Contributor

cakrit commented Jan 21, 2019

A random jitter wouldn't help, it would make things worse. At least know the occurence is predictable and very well documented on the mackerel link @Daniel15 provided. Ideally, we would take the polling period and CONFIG_HZ and be able to predictably add a 1ms delay once every X sec, where X is the time window from one artificial spike to the next. The only other (simpler) alternative I can think of, is to sample twice before sending the information (sample, wait 1msec, sample again). Presumably, the two values would be significantly different only at these points, so we could throw away the artificially high value.

@mfundul

This comment has been minimized.

Copy link
Contributor

mfundul commented Jan 21, 2019

CONFIG_HZ hacks are only valid for linux. There is also the so called race-to-idle approach in doing things, meaning that you may even want the data collectors to run concurrently and quickly so as to improve the power efficiency of the system. I like the idea of sampling the load average twice though.

@Daniel15

This comment has been minimized.

Copy link
Author

Daniel15 commented Jan 21, 2019

I wrote a small script to calculate exactly when we would expect the same condition to occur based on the polling period and the kernel frequency. The results are here.

@cakrit You don't even need a script if you're using Google Sheets 😃. This is a mathematical concept called least common multiple, and Google Sheets has a function called LCM to calculate it.

In arithmetic and number theory, the least common multiple, usually denoted by LCM(a, b), is the smallest positive integer that is divisible by both a and b

Google Sheets' implementation only supports integers so you'd need to work in milliseconds:

=LCM(1000, (5 + 1/250) * 1000) / 1000

will give you the 1251 number I mentioned in my original post. The first 1000 is the Netdata polling period in milliseconds, the 250 is the CONFIG_HZ value.

I used this site when writing my original post: https://www.hackmath.net/en/calculator/least-common-multiple?input=5.004+1%0D%0A&submit=Calculate+LCM

The only other (simpler) alternative I can think of, is to sample twice before sending the information (sample, wait 1msec, sample again). Presumably, the two values would be significantly different only at these points

Do you mean netdata's sampling of the load average? Linux only updates the load average once every five seconds, and spikes take a while to decay since it uses the previous load average as part of the computation for the current load average, so that wouldn't work. Once this "spike" happens, it can take 1-2 minutes for it to fully decay.

I guess one option is to modify the Linux kernel itself to sample the load average twice and avoid outliers, but that's likely outside the scope of what anyone in this issue wants to work on 😛

@cakrit

This comment has been minimized.

Copy link
Contributor

cakrit commented Jan 22, 2019

Of course @Daniel15, you're too kind, what I wrote was a bit idiotic. :-)
I updated the sheet to use LCM and you can just play with the inputs. Unfortunately LCM works only for integers, so you'll see that for freq 300Hz, the cycles are not exact, but the time still shows up as an integer.

It looks like anything we could do on the netdata side would be very involved. Unless someone can suggest an algorithm that predictably avoids this issue, my only suggestion would be to use CONFIG_HZ=300, which at least in theory wouldn't have a least common multiplier with integer polling periods. In practice there's always rounding, so you can't avoid it unless if you're explicitly introducing a delay at the right moment.

Regarding possibilities for such an algorithm, I don't know if /proc/timer_list can help at all, I don't know what I'm reading there.

@mfundul

This comment has been minimized.

Copy link
Contributor

mfundul commented Jan 22, 2019

The issue is inflating the load average, but even if it can be avoided maybe the trade-off is not worth it. Even if netdata threads are uniformly distributed in a large enough time slice that leaves the system less time to go to deep sleep states.

@Daniel15

This comment has been minimized.

Copy link
Author

Daniel15 commented Jan 22, 2019

Unless someone can suggest an algorithm that predictably avoids this issue

Could it check the value of CONFIG_HZ in /boot/config-$(uname -r)? That might not work for all Linux distros, but it's there for Debian at least. I did read something about getconf CLK_TCK, but that returns 100 for me, so I guess it's something else.

Even if netdata threads are uniformly distributed in a large enough time slice that leaves the system less time to go to deep sleep states.

It would still be good to have the option for it. Right now, Netdata shows that my server has a large load spike every 20 minutes, but it's really just an artificial thing. I'd like to able to configure alarms without this artificial load influencing them.

I'm not sure if virtualized servers even go into deep sleep states?

It looks like anything we could do on the netdata side would be very involved

For what it's worth, Telegraf supports jitter, which could be used to make the issue less likely to occur:

collection_jitter: Collection jitter is used to jitter the collection by a random interval. Each plugin will sleep for a random time within jitter before collecting. This can be used to avoid many plugins querying things like sysfs at the same time, which can have a measurable effect on the system.
https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#agent
influxdata/telegraf#3465

@cakrit

This comment has been minimized.

Copy link
Contributor

cakrit commented Jan 23, 2019

Ok, so it's a random jitter per plugin, up to the cofigured maximum, which means that they will be collecting at slightly different offsets each time. This could work, with the caveat that we're already collecting per sec and going towards msec, so it's a little different than what telegraf does.
@ilyam8 and @vlvkobal is this something we could add easily? @ktsaou what do you think?

@ktsaou

This comment has been minimized.

Copy link
Member

ktsaou commented Jan 23, 2019

We could add an jitter. Is that a big deal.
But have you actually verified this is the case?

What I would like to know if this is triggered by the netdata daemon or the external plugins, or all of them.

If this is triggered by netdata daemon (internal data collection plugins), they all use the heartbeat functions

usec_t heartbeat_next(heartbeat_t *hb, usec_t tick) {
heartbeat_t now;
now.monotonic = now_monotonic_usec();
now.realtime = now_realtime_usec();
usec_t next_monotonic = now.monotonic - (now.monotonic % tick) + tick;
while(now.monotonic < next_monotonic) {
sleep_usec(next_monotonic - now.monotonic);
now.monotonic = now_monotonic_usec();
now.realtime = now_realtime_usec();
}
if(likely(hb->realtime != 0ULL)) {
usec_t dt_monotonic = now.monotonic - hb->monotonic;
usec_t dt_realtime = now.realtime - hb->realtime;
hb->monotonic = now.monotonic;
hb->realtime = now.realtime;
if(unlikely(dt_monotonic >= tick + tick / 2)) {
errno = 0;
error("heartbeat missed %llu monotonic microseconds", dt_monotonic - tick);
}
return dt_realtime;
}
else {
hb->monotonic = now.monotonic;
hb->realtime = now.realtime;
return 0ULL;
}
}

A simple offset there could solve the problem.

It may also be triggered by apps.plugin which also more resource hungry than anything else. This will also be influenced by the heartbeat functions.

So, can we test this?

@Ferroin

This comment has been minimized.

Copy link
Member

Ferroin commented Jan 23, 2019

@ktsaou I think it's probably a cumulative thing. Stuff that actually has to spawn a command to get data (the postfix module for example) is almost certainly the biggest offender, but everything that is waking up to fetch data is almost certainly contributing at least a little.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment