Please get rid of hard-coded timeouts #617

madduck · 2016-01-11T08:31:10Z

tl;dr: if a node takes a long time, then hard-coded timeouts take precedence over the --timeout command line option.

Details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786997

The text was updated successfully, but these errors were encountered:

shallot · 2016-02-06T10:51:36Z

It looks as if IO::Socket::INET6's Timeout setting is worthless, because if Munin can't connect to a dead node, /usr/share/perl5/Munin/Master/Node.pm will not actually emit the message

    if (! $self->{reader} ) {
            ERROR "Failed to connect to node $self->{address}:$self->{port}/tcp : $!";
            return 0;
    }

It just doesn't seem to happen.

shallot · 2016-02-06T10:58:48Z

In any event, a defensive approach would definitely be to clamp individual worker timeout with the global timeout. What would be the use case of allowing any worker to continue past the user-defined timeout?

shallot · 2017-09-08T09:06:02Z

It should also be noted that plugins/node.d/munin_stats and plugins/node.d/munin_update don't autodetect or allow update_rate changes to be configured, so their warning and critical thresholds are wrong after it's changed, as well as the graph_info text.

shallot · 2021-07-13T09:23:59Z

With munin-async, this gets better, because the master munin-update job seems to spend some time on the live nodes, and then about ~45 seconds into it, mass discards various dead nodes, which then doesn't interfere with a 60s update_rate

However, I'm still worried about the timings and will keep testing with more load

shallot · 2021-11-17T11:05:28Z

OK so after some time, I can say it's better, but not actually working. Clients that stall out ssh connections more than 60s are not actually timed out after the timeout set in the munin-update argument.

I tried adding this:

ssh_command "timeout --foreground --preserve-status --verbose --signal=TERM --kill-after=59s 55s ssh"

Unfortunately that just breaks everything immediately.

Can something be done about this? It's making the log full of FATAL messages because of overruns, when even a miniscule percentage of one's nodes are unreliable.

shallot · 2021-11-17T11:14:09Z

I suppose it needs to be said explicitly - Munin/Master/ProcessManager.pm containing hardcoded numbers for the 3 *timeout variables is clearly wrong given the configurability of update_rate, I don't see any reason why not to fix these to at least be derived from update_rate, if not properly configurable themselves.

This code seems to have been replaced completely in master in commit 5584ba3 in 2016 but that's still not released

It was made configurable in commit 4624c9e in 2020, probably need the Debian 11 version to test this :)

niclan · 2021-11-17T12:25:28Z

ssh -oConnectTimeout=55 ?

shallot · 2021-11-17T13:41:20Z

ssh -oConnectTimeout=55 ?

That's the initial connect timeout, not the session duration timeout

shallot · 2021-11-29T09:53:38Z

So I've been testing timeout_fetch_all_nodes 50 and timeout_fetch_one_node 40, but it's just not working well. Granted it's not hardcoded, but there's other circumstances in these code paths that obstruct it. Some observations:

The cron job is set to start on the dot, but the initial time to start connecting to clients is 10 seconds later. Guessing this is the bootstrap time, affected by this:

% wc -c /var/lib/munin/*.storable | grep total
129838443 total

Still, 10 seconds to parse 130 MB seems slow.

Then, the max_processes setting naturally stalls out the processing. I doubled it from 16 to 32 to try to improve this, but it's still using something like 2 vCPUs total and 5000 IOPS, and not actually substantially improving throughput much.

Then, the final time between the end of talking to the last worker and the master update process unlocking is also several seconds, so situations like these happen regularly:

2021/11/29 09:43:44 [INFO]: Munin-update finished for node adblockplus.org;filter-in-151.adblockplus.org (27.95 sec)
2021/11/29 09:43:54 [INFO] Remaining workers: adblockplus.org;filter-in-8.adblockplus.org, uplink.eyeo.it;filterlist-crumbs-org-1.uplink.eyeo.it
2021/11/29 09:43:54 [INFO] Reaping Munin::Master::UpdateWorker<adblockplus.org;filter-in-151.adblockplus.org>. Exit value/signal: 0/0
2021/11/29 09:44:01 [FATAL ERROR] Lock already exists: /var/run/munin/munin-update.lock. Dying.
2021/11/29 09:44:01 at /usr/share/perl5/Munin/Master/Update.pm line 127.
2021/11/29 09:44:02 [INFO]: Munin-update finished (60.68 sec)
2021/11/29 09:45:01 [INFO]: Starting munin-update

steveschnepp · 2021-12-07T21:33:01Z

Munin 2.0 does indeed hit a time wall at around 100+ nodes.

You might be able to push it to 150, but painfully.

The 2.999 branch should scale much better, provided a pgsql backend.

But that means that the nodes are async, to make the queries as quick as possible, and not waiting for every plugin to execute.

This sets up a Munin master server with a faster update rate (from the default 5 minutes to 1 minute) This in turn requires a number of changes, the most important being on the munin-async clients; if the clients aren't async, we'd have to patch /usr/share/perl5/Munin/Master/ProcessManager.pm to reduce worker_timeout and accept_timeout, per munin-monitoring/munin#617 and yet this would still remain pretty unreliable The graphs are set up to render through a CGI daemon, which required setting up systemd units for the CGI graphing service and a socket The HTML files are left to the static generation cron job Some basic limits email notification setup is done as well, as well as basic Apache configuration and an example:changeme htpasswd file for completeness A basic CI job is added as well, though it doesn't really test whether munin-update or munin-cgi-graph actually work A custom command is added to avoid the bootstrap race between munin-cgi-graph service needing datafile and cron job running munin-update to write it The Munin server test VM get the munin-node configuration as well, because the server defaults to wanting to connect to itself to monitor

ssm added [type] bug [component] master [affects] 2.0-stable [affects] branch master labels Jan 11, 2016

steveschnepp added this to the 3.0.0 milestone Jan 27, 2016

steveschnepp self-assigned this Jan 27, 2016

shallot mentioned this issue Mar 20, 2019

munin_update, munin_stats plugins warning and critical thresholds hardcoded, can't match update_rate changes #1171

Closed

steveschnepp modified the milestones: 3.0.0, 2.999.14 Jun 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please get rid of hard-coded timeouts #617

Please get rid of hard-coded timeouts #617

madduck commented Jan 11, 2016

shallot commented Feb 6, 2016

shallot commented Feb 6, 2016

shallot commented Sep 8, 2017 •

edited

shallot commented Jul 13, 2021 •

edited

shallot commented Nov 17, 2021

shallot commented Nov 17, 2021

niclan commented Nov 17, 2021

shallot commented Nov 17, 2021

shallot commented Nov 29, 2021

steveschnepp commented Dec 7, 2021

Please get rid of hard-coded timeouts #617

Please get rid of hard-coded timeouts #617

Comments

madduck commented Jan 11, 2016

shallot commented Feb 6, 2016

shallot commented Feb 6, 2016

shallot commented Sep 8, 2017 • edited

shallot commented Jul 13, 2021 • edited

shallot commented Nov 17, 2021

shallot commented Nov 17, 2021

niclan commented Nov 17, 2021

shallot commented Nov 17, 2021

shallot commented Nov 29, 2021

steveschnepp commented Dec 7, 2021

shallot commented Sep 8, 2017 •

edited

shallot commented Jul 13, 2021 •

edited