New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please get rid of hard-coded timeouts #617
Comments
It looks as if IO::Socket::INET6's Timeout setting is worthless, because if Munin can't connect to a dead node, /usr/share/perl5/Munin/Master/Node.pm will not actually emit the message
It just doesn't seem to happen. |
In any event, a defensive approach would definitely be to clamp individual worker timeout with the global timeout. What would be the use case of allowing any worker to continue past the user-defined timeout? |
It should also be noted that |
With munin-async, this gets better, because the master munin-update job seems to spend some time on the live nodes, and then about ~45 seconds into it, mass discards various dead nodes, which then doesn't interfere with a 60s update_rate However, I'm still worried about the timings and will keep testing with more load |
OK so after some time, I can say it's better, but not actually working. Clients that stall out ssh connections more than 60s are not actually timed out after the timeout set in the munin-update argument. I tried adding this: ssh_command "timeout --foreground --preserve-status --verbose --signal=TERM --kill-after=59s 55s ssh" Unfortunately that just breaks everything immediately. Can something be done about this? It's making the log full of FATAL messages because of overruns, when even a miniscule percentage of one's nodes are unreliable. |
I suppose it needs to be said explicitly - Munin/Master/ProcessManager.pm containing hardcoded numbers for the 3 *timeout variables is clearly wrong given the configurability of update_rate, I don't see any reason why not to fix these to at least be derived from update_rate, if not properly configurable themselves. This code seems to have been replaced completely in master in commit 5584ba3 in 2016 but that's still not released It was made configurable in commit 4624c9e in 2020, probably need the Debian 11 version to test this :) |
ssh -oConnectTimeout=55 ? |
That's the initial connect timeout, not the session duration timeout |
So I've been testing timeout_fetch_all_nodes 50 and timeout_fetch_one_node 40, but it's just not working well. Granted it's not hardcoded, but there's other circumstances in these code paths that obstruct it. Some observations: The cron job is set to start on the dot, but the initial time to start connecting to clients is 10 seconds later. Guessing this is the bootstrap time, affected by this: % wc -c /var/lib/munin/*.storable | grep total Still, 10 seconds to parse 130 MB seems slow. Then, the max_processes setting naturally stalls out the processing. I doubled it from 16 to 32 to try to improve this, but it's still using something like 2 vCPUs total and 5000 IOPS, and not actually substantially improving throughput much. Then, the final time between the end of talking to the last worker and the master update process unlocking is also several seconds, so situations like these happen regularly: 2021/11/29 09:43:44 [INFO]: Munin-update finished for node adblockplus.org;filter-in-151.adblockplus.org (27.95 sec) |
Munin 2.0 does indeed hit a time wall at around 100+ nodes. You might be able to push it to 150, but painfully. The 2.999 branch should scale much better, provided a pgsql backend. But that means that the nodes are async, to make the queries as quick as possible, and not waiting for every plugin to execute. |
This sets up a Munin master server with a faster update rate (from the default 5 minutes to 1 minute) This in turn requires a number of changes, the most important being on the munin-async clients; if the clients aren't async, we'd have to patch /usr/share/perl5/Munin/Master/ProcessManager.pm to reduce worker_timeout and accept_timeout, per munin-monitoring/munin#617 and yet this would still remain pretty unreliable The graphs are set up to render through a CGI daemon, which required setting up systemd units for the CGI graphing service and a socket The HTML files are left to the static generation cron job Some basic limits email notification setup is done as well, as well as basic Apache configuration and an example:changeme htpasswd file for completeness A basic CI job is added as well, though it doesn't really test whether munin-update or munin-cgi-graph actually work A custom command is added to avoid the bootstrap race between munin-cgi-graph service needing datafile and cron job running munin-update to write it The Munin server test VM get the munin-node configuration as well, because the server defaults to wanting to connect to itself to monitor
tl;dr: if a node takes a long time, then hard-coded timeouts take precedence over the
--timeout
command line option.Details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=786997
The text was updated successfully, but these errors were encountered: