Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please get rid of hard-coded timeouts #617

Open
madduck opened this issue Jan 11, 2016 · 10 comments
Open

Please get rid of hard-coded timeouts #617

madduck opened this issue Jan 11, 2016 · 10 comments

Comments

@madduck
Copy link

madduck commented Jan 11, 2016

tl;dr: if a node takes a long time, then hard-coded timeouts take precedence over the --timeout command line option.

Details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786997

@shallot
Copy link

shallot commented Feb 6, 2016

It looks as if IO::Socket::INET6's Timeout setting is worthless, because if Munin can't connect to a dead node, /usr/share/perl5/Munin/Master/Node.pm will not actually emit the message

    if (! $self->{reader} ) {
            ERROR "Failed to connect to node $self->{address}:$self->{port}/tcp : $!";
            return 0;
    }

It just doesn't seem to happen.

@shallot
Copy link

shallot commented Feb 6, 2016

In any event, a defensive approach would definitely be to clamp individual worker timeout with the global timeout. What would be the use case of allowing any worker to continue past the user-defined timeout?

@shallot
Copy link

shallot commented Sep 8, 2017

It should also be noted that plugins/node.d/munin_stats and plugins/node.d/munin_update don't autodetect or allow update_rate changes to be configured, so their warning and critical thresholds are wrong after it's changed, as well as the graph_info text.

@shallot
Copy link

shallot commented Jul 13, 2021

With munin-async, this gets better, because the master munin-update job seems to spend some time on the live nodes, and then about ~45 seconds into it, mass discards various dead nodes, which then doesn't interfere with a 60s update_rate

However, I'm still worried about the timings and will keep testing with more load

@shallot
Copy link

shallot commented Nov 17, 2021

OK so after some time, I can say it's better, but not actually working. Clients that stall out ssh connections more than 60s are not actually timed out after the timeout set in the munin-update argument.

I tried adding this:

ssh_command "timeout --foreground --preserve-status --verbose --signal=TERM --kill-after=59s 55s ssh"

Unfortunately that just breaks everything immediately.

Can something be done about this? It's making the log full of FATAL messages because of overruns, when even a miniscule percentage of one's nodes are unreliable.

@shallot
Copy link

shallot commented Nov 17, 2021

I suppose it needs to be said explicitly - Munin/Master/ProcessManager.pm containing hardcoded numbers for the 3 *timeout variables is clearly wrong given the configurability of update_rate, I don't see any reason why not to fix these to at least be derived from update_rate, if not properly configurable themselves.

This code seems to have been replaced completely in master in commit 5584ba3 in 2016 but that's still not released

It was made configurable in commit 4624c9e in 2020, probably need the Debian 11 version to test this :)

@niclan
Copy link
Contributor

niclan commented Nov 17, 2021

ssh -oConnectTimeout=55 ?

@shallot
Copy link

shallot commented Nov 17, 2021

ssh -oConnectTimeout=55 ?

That's the initial connect timeout, not the session duration timeout

@shallot
Copy link

shallot commented Nov 29, 2021

So I've been testing timeout_fetch_all_nodes 50 and timeout_fetch_one_node 40, but it's just not working well. Granted it's not hardcoded, but there's other circumstances in these code paths that obstruct it. Some observations:

The cron job is set to start on the dot, but the initial time to start connecting to clients is 10 seconds later. Guessing this is the bootstrap time, affected by this:

% wc -c /var/lib/munin/*.storable | grep total
129838443 total

Still, 10 seconds to parse 130 MB seems slow.

Then, the max_processes setting naturally stalls out the processing. I doubled it from 16 to 32 to try to improve this, but it's still using something like 2 vCPUs total and 5000 IOPS, and not actually substantially improving throughput much.

Then, the final time between the end of talking to the last worker and the master update process unlocking is also several seconds, so situations like these happen regularly:

2021/11/29 09:43:44 [INFO]: Munin-update finished for node adblockplus.org;filter-in-151.adblockplus.org (27.95 sec)
2021/11/29 09:43:54 [INFO] Remaining workers: adblockplus.org;filter-in-8.adblockplus.org, uplink.eyeo.it;filterlist-crumbs-org-1.uplink.eyeo.it
2021/11/29 09:43:54 [INFO] Reaping Munin::Master::UpdateWorker<adblockplus.org;filter-in-151.adblockplus.org>. Exit value/signal: 0/0
2021/11/29 09:44:01 [FATAL ERROR] Lock already exists: /var/run/munin/munin-update.lock. Dying.
2021/11/29 09:44:01 at /usr/share/perl5/Munin/Master/Update.pm line 127.
2021/11/29 09:44:02 [INFO]: Munin-update finished (60.68 sec)
2021/11/29 09:45:01 [INFO]: Starting munin-update

@steveschnepp
Copy link
Member

Munin 2.0 does indeed hit a time wall at around 100+ nodes.

You might be able to push it to 150, but painfully.

The 2.999 branch should scale much better, provided a pgsql backend.

But that means that the nodes are async, to make the queries as quick as possible, and not waiting for every plugin to execute.

shallot pushed a commit to shallot/ansible-playbooks that referenced this issue Mar 31, 2023
This sets up a Munin master server with a faster update rate
(from the default 5 minutes to 1 minute)

This in turn requires a number of changes, the most important being on
the munin-async clients; if the clients aren't async, we'd have to patch
/usr/share/perl5/Munin/Master/ProcessManager.pm to reduce worker_timeout
and accept_timeout, per munin-monitoring/munin#617
and yet this would still remain pretty unreliable

The graphs are set up to render through a CGI daemon, which required setting
up systemd units for the CGI graphing service and a socket

The HTML files are left to the static generation cron job

Some basic limits email notification setup is done as well, as well as basic
Apache configuration and an example:changeme htpasswd file for completeness

A basic CI job is added as well, though it doesn't really test whether
munin-update or munin-cgi-graph actually work

A custom command is added to avoid the bootstrap race between
munin-cgi-graph service needing datafile and cron job running munin-update
to write it

The Munin server test VM get the munin-node configuration as well, because
the server defaults to wanting to connect to itself to monitor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment