New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga2.service state is reloading in systemd after safe-reload until systemd time-out #6753
Comments
What do you mean with that? How do you reload/restart the Icinga daemon? |
We run Which is mapped in systemd to
|
I was able to reproduce this in the vagrant standalone box with the v2.10.1-1 stable packages and a high load on the system. With a normal load I was not able to reproduce this though.
I guess this has something to do with the IDO load, since I noticed that it happens when the WorkQueue for the IDO connection begins to grow.
In the next step I'll try to reproduce this in the snapshot packages. |
I would first analyse why the IDO queries are not processed this fast. This is blocking the final shutdown request. |
I tried to analyse this a bit further. AnalysisIn the log we see that it stops at stopping the components.
While running the daemon in foreground I observed that there are log entries that are not logged to the file.
This is possibly caused by the filelogger is stopped before all other components are fully stopped. Having a look with gdb to find out what's going on during the reload process and why it's never finish.
You can see thread 1 is handling the SigUsr2 signal which is used to hand over to the new process. Currently it tries to stop the GraphiteWriter and waits for it to finally finish. Having a look into thread 55, which is the thread which handles the GraphiteWriter. Thread 55 currently waits for a lock.
Finding out which thread holds the lock on which thread 55 waits.
Thread 55 waits for a lock that is hold by thread 1. So we have a deadlock. Possible solutionDo not lock the GraphiteWriter object in diff --git a/lib/perfdata/graphitewriter.cpp b/lib/perfdata/graphitewriter.cpp
index a01f6c48b..1b6461d51 100644
--- a/lib/perfdata/graphitewriter.cpp
+++ b/lib/perfdata/graphitewriter.cpp
@@ -299,7 +299,7 @@ void GraphiteWriter::SendMetric(const String& prefix, const String& name, double
msgbuf << "\n";
String metric = msgbuf.str();
- ObjectLock olock(this);
+ boost::mutex::scoped_lock lock(m_StreamMutex);
if (!GetConnected())
return;
diff --git a/lib/perfdata/graphitewriter.hpp b/lib/perfdata/graphitewriter.hpp
index 6934d13b3..ec4c5bbfe 100644
--- a/lib/perfdata/graphitewriter.hpp
+++ b/lib/perfdata/graphitewriter.hpp
@@ -27,6 +27,7 @@
#include "base/timer.hpp"
#include "base/workqueue.hpp"
#include <fstream>
+#include <boost/thread/mutex.hpp>
namespace icinga
{
@@ -54,6 +55,7 @@ protected:
private:
Stream::Ptr m_Stream;
+ boost::mutex m_StreamMutex;
WorkQueue m_WorkQueue{10000000, 1};
Timer::Ptr m_ReconnectTimer; This change removes the lock from the GraphiteWriter but it should lock the stream which actually send the metrics. Before putting this in a PR I'll leave this here for further feedback and to get a clear mind about the problem. |
Further analysis. The bug was introduced/got visible by af6c8a1. Reverting the change resolves the bug immediately.
I am pretty sure this not only happens in setups with much hosts/services, but also in setup with only a few hosts/services. It is more noticeable in a setup with many hosts/services since the GraphiteWriter writes more performance data, whereby the GraphiteWriter thread is more often busy with Reverting af6c8a1 is not really an option here. I'll prepare a PR for the proposed fix. |
Thank you all guys for quick debug and fix, works like a charm. :) |
After Icinga2 safe-reload icinga2.service is killed in 30 minutes by systemd, new process starts immediately without any issue. During that time, new icinga2 process is running without any write to icinga2.log and alerts are send for service/hosts in downtime.
Systemd reports icinga2 process as reloading.
We start observing this issue after upgrade to 2.10.1 from 2.9.2
Logs - Systemd
Nov 05 13:22:27 icinga2-master01 safe-reload[22235]: Validating config files: Done
Nov 05 13:22:27 icinga2-master01 safe-reload[22235]: Reloading Icinga 2: Done
Nov 05 13:23:58 icinga2-master01 systemd[1]: icinga2.service: Supervising process 22938 which is not our child. We'll most likely not notice when it exits.
Nov 05 13:53:58 icinga2-master01 systemd[1]: icinga2.service stop-sigterm timed out. Killing.
Nov 05 13:53:58 icinga2-master01 systemd[1]: Unit icinga2.service entered failed state.
Nov 05 13:53:58 icinga2-master01 systemd[1]: icinga2.service failed.
Nov 05 13:53:59 icinga2-master01 systemd[1]: icinga2.service holdoff time over, scheduling restart.
Nov 05 13:53:59 icinga2-master01 systemd[1]: Starting Icinga host/service/network monitoring system...
Logs - Icinga2
Service status in systemd
Expected Behavior
Icinga2 safe-reload is done in short time, icinga2 logs into log file, alerts are not send during reload.
Current Behavior
After icinga2 safe-reload, new icinga2 process is started, but no logs in icinga2/icinga.log and systemd kills icinga2.service after 30 minutes. Alerts are send for service/host in downtime during that period. Next started icinga2 process by systemd works as expected.
Possible Solution
Help us resolve strange race condition with icinga2 safe-reload and systemd.
Steps to Reproduce (for bugs)
Fails once/twice per day, not able to reproduce manually.
Context
During weird state of Icinga2 alerts are send for service/host with downtime set much longer before.
Systemd reports icinga2.service as reloading
Your Environment
icinga2 --version
):icinga2 feature list
):icinga2 daemon -C
):The text was updated successfully, but these errors were encountered: