-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IcingaDB HA crashs every night after upgrade to 1.2.0 #742
Comments
Thanks for coming forward with this issue. Based on your provided information and what was within the stack trace, it seems like your Redis instance is unavailable or at least unhealthy. The error "runtime-updates: Can't execute Redis pipeline" (source) indicates this. One change between v1.1.1 and v.1.2.0 was 81085c0, effectively don't retrying a failed HA but giving up after five minutes. Thus, it might be the case that this error was already there before upgrading Icinga DB to v1.2.0, but just invisible. Speaking of visibility, what is the other Icinga DB node doing? Could you please provide more extensive logs extending further into the past, from both Icinga DB nodes include their Redis? Please evaluate the logging level - all components included - to As there are additional information being logged as fields when using the What does the Bareos exactly do at this moment? Does it interact with the Redis instance or reconfigures the network? Is something else altering the system state at this moment? |
Hello I'd edit the config.yml of IcingaDB to logging "debug":
So correct? |
That's unfortunate. As it happens on both nodes, I would think about some database-related issue. Since v1.2.0 (779afd1) there are additional information attached to the "Handing over" message appearing in your posted log. There should also be fields present for the "runtime-updates: Can't execute Redis pipeline" line.
Okay. How does your database setup look? Is it a MySQL/MariaDB or PostgreSQL? Is it a single node, a federation or a cluster? Are there suspicious logging entries around the same time?
This looks good!
Looks also good. However, you can remove the whole |
I'll attach the JSON output of both nodes of the last 72 hours to the ticket |
Thanks for providing those logs. However, it seems like the logging level is too silent. A small inspection with
only reveals |
Please excuse the second post, but could you include all available log levels/priorities in your output. I misread the
Please set |
No problem... |
Thanks a lot. Regardless the size, I would like to inspect tomorrow's logs after the crash. Maybe you can reduce it to round about 30m before the crash. |
Good morning. I have great news.. There was no outage tonight. Here are the logs and the script: Edit: |
Thanks for your detailed report, your logs and your script. Based on your experiment, I would guess that your backup script is taking longer than the magic five minutes that Icinga DB now retires every database error since the latest v1.2.0 release. When it reaches this limit and there is still a LOCK from the Earlier you wrote that you have configured your database to "not lock". How did you do this? Could you try configuring your mysqldump command to not LOCK or use transactions as described in this StackOverflow thread? If everything else fails, you might wanna consider stopping and restarting Icinga DB during the time of your backup? |
It seems like mysqldump is setting a lock by default. I didn't do it in my script. Now I created a dump with "--single-transaction=TRUE" and there was no outage. |
I am glad to hear that. Unless you have an idea what to change, please feel free to close this issue. |
No - ICINGA is going great and I'm more than happy with it. Thanks again for your support, as it should have been in the forum. The only thing I didn't think about was the database dump at first |
Describe the bug
Every night the IcingaDB HA crashs
Expected behavior
No downtime like before the upgrade
Your Environment
Include as many relevant details about the environment you experienced the problem in
Additional context
We have a HA cluster with an external ICINGA Webui server.
On Monday, I'd upgraded IcingaDB to 1.2.0. Everything is installed via package manager.
After upgrading the packages, I imported the SQL file "1.2.0.sql" from one of the two ICINGA nodes. This was completed without any errors. Subsequently, the Icinga DB daemon was restarted on both nodes.
Icinga was running again and everything looked good.
On Tuesday morning, WebUi said that "IcingaDB" no longer writes data.
I have now stopped the daemon of "IcingaDB" and "IcingaDB-Redis" on both nodes and then restarted. Unfortunately, the problem remains. Every night at about 03:15 a.m. the "IcingaDB daemon" crashed. In the hournal I always find the same message:
At this time, Bareo's fuses are running. However, these had no effect on it until before the upgrade.
Regards
Sascha
The text was updated successfully, but these errors were encountered: