Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IcingaDB HA crashs every night after upgrade to 1.2.0 #742

Closed
TheCry opened this issue Apr 18, 2024 · 13 comments
Closed

IcingaDB HA crashs every night after upgrade to 1.2.0 #742

TheCry opened this issue Apr 18, 2024 · 13 comments

Comments

@TheCry
Copy link

TheCry commented Apr 18, 2024

Describe the bug

Every night the IcingaDB HA crashs

Expected behavior

No downtime like before the upgrade

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Icinga DB version: 1.2.0
  • Icinga DB Web: 1.1.2
  • Icinga 2 version: 2.14.2
  • Operating System and version: Debian 11

Additional context

We have a HA cluster with an external ICINGA Webui server.
On Monday, I'd upgraded IcingaDB to 1.2.0. Everything is installed via package manager.
After upgrading the packages, I imported the SQL file "1.2.0.sql" from one of the two ICINGA nodes. This was completed without any errors. Subsequently, the Icinga DB daemon was restarted on both nodes.
Icinga was running again and everything looked good.
On Tuesday morning, WebUi said that "IcingaDB" no longer writes data.
I have now stopped the daemon of "IcingaDB" and "IcingaDB-Redis" on both nodes and then restarted. Unfortunately, the problem remains. Every night at about 03:15 a.m. the "IcingaDB daemon" crashed. In the hournal I always find the same message:

  Apr 17 03:06:09 Node1 icingadb[1952232]: Handing over
Apr 17 03:06:10 Node1 icingadb[1952232]: runtime-updates: Can't execute Redis pipeline
Apr 17 03:13:00 Node1 icingadb[1952232]: retry deadline exceeded
                                            github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                    github.com/icinga/icingadb/pkg/icingadb/ha.go:163
                                            runtime.goexit
                                                    runtime/asm_amd64.s:1695
                                            HA aborted
                                            github.com/icinga/icingadb/pkg/icingadb.(*HA).abort.func1
                                                    github.com/icinga/icingadb/pkg/icingadb/ha.go:131
                                            sync.(*Once).doSlow
                                                    sync/once.go:74
                                            sync.(*Once).Do
                                                    sync/once.go:65
                                            github.com/icinga/icingadb/pkg/icingadb.(*HA).abort
                                                    github.com/icinga/icingadb/pkg/icingadb/ha.go:129
                                            github.com/icinga/icingadb/pkg/icingadb.(*HA).controller
                                                    github.com/icinga/icingadb/pkg/icingadb/ha.go:163
                                            runtime.goexit
                                                    runtime/asm_amd64.s:1695
                                            HA exited with an error
                                            main.run
                                                    github.com/icinga/icingadb/cmd/icingadb/main.go:335
                                            main.main
                                                    github.com/icinga/icingadb/cmd/icingadb/main.go:37
                                            runtime.main
                                                    runtime/proc.go:271
                                            runtime.goexit
                                                    runtime/asm_amd64.s:1695

At this time, Bareo's fuses are running. However, these had no effect on it until before the upgrade.
Regards
Sascha

@oxzi oxzi self-assigned this Apr 18, 2024
@oxzi oxzi added the bug Something isn't working label Apr 18, 2024
@oxzi
Copy link
Member

oxzi commented Apr 18, 2024

Thanks for coming forward with this issue. Based on your provided information and what was within the stack trace, it seems like your Redis instance is unavailable or at least unhealthy. The error "runtime-updates: Can't execute Redis pipeline" (source) indicates this.

One change between v1.1.1 and v.1.2.0 was 81085c0, effectively don't retrying a failed HA but giving up after five minutes. Thus, it might be the case that this error was already there before upgrading Icinga DB to v1.2.0, but just invisible.

Speaking of visibility, what is the other Icinga DB node doing? Could you please provide more extensive logs extending further into the past, from both Icinga DB nodes include their Redis? Please evaluate the logging level - all components included - to debug and restart the Icinga DBs before the next expected crash.

As there are additional information being logged as fields when using the systemd-journald output, you might wanna include those in your journalctl's output format, e.g., by using journalctl -o json.

What does the Bareos exactly do at this moment? Does it interact with the Redis instance or reconfigures the network? Is something else altering the system state at this moment?

@oxzi oxzi removed their assignment Apr 18, 2024
@TheCry
Copy link
Author

TheCry commented Apr 18, 2024

Hello
the other node has exaxt the same error message this morning. Bareos only performs a database backup. The database is not set to "lock". For tonight I deactivated the security of the 3 servers to make sure that it has nothing to do with it.
No - Bareos does not interact with Redis. There is also no error message from icingadb-redis in the journal.
If a JSON log with this call is sufficient:
journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json

I'd edit the config.yml of IcingaDB to logging "debug":

logging:
  level: debug

  options:
    config-sync:
    database:
    dump-signals:
    heartbeat:
    high-availability:
    history-sync:
    overdue-sync:
    redis:
    retention:
    runtime-updates:
    telemetry:

So correct?

@oxzi
Copy link
Member

oxzi commented Apr 18, 2024

Hello the other node has exaxt the same error message this morning. Bareos only performs a database backup. The database is not set to "lock".

That's unfortunate.

As it happens on both nodes, I would think about some database-related issue.

Since v1.2.0 (779afd1) there are additional information attached to the "Handing over" message appearing in your posted log. There should also be fields present for the "runtime-updates: Can't execute Redis pipeline" line.
If your log wasn't already rotated, could you post your logs with --output json, especially regarding those lines?

There is also no error message from icingadb-redis in the journal.

Okay. How does your database setup look? Is it a MySQL/MariaDB or PostgreSQL? Is it a single node, a federation or a cluster? Are there suspicious logging entries around the same time?

If a JSON log with this call is sufficient: journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json

This looks good!

I'd edit the config.yml of IcingaDB to logging "debug":

Looks also good. However, you can remove the whole options block as each component is empty, thus inheriting the default config options from above.

@TheCry
Copy link
Author

TheCry commented Apr 18, 2024

[error_icingadb_Node1.log](https://github.com/Icinga/icingadb/files/15023144/error_icingadb_Node1.log)
I don't think so. If only that is to the database of IcingaDB. As written.. Everything was stable until before the upgrade
IcingaDB's database runs on IcingaWeb's server.
So a remote database for the 2 nodes in the HA cluster. It's a MariaDB (V 10.11.7).
There are no abnormalities there.
I've now adjusted the config.yml and restarted the IcingaDB daemon.

I'll attach the JSON output of both nodes of the last 72 hours to the ticket
error_icingadb_Node1.log

error_icingadb_Node2.log

@oxzi
Copy link
Member

oxzi commented Apr 18, 2024

Thanks for providing those logs. However, it seems like the logging level is too silent.

A small inspection with

jq -s 'sort_by(.__REALTIME_TIMESTAMP | tonumber) | .[] | [._HOSTNAME, (.__REALTIME_TIMESTAMP | tonumber | . / 1000000 | strftime("%x %X %Z")), .MESSAGE, .ICINGADB_ERROR]' < error_icingadb_Node?.log

only reveals "context canceled" errors, but it misses why the context was canceled. Thus, we have to wait for the next error and have a look at the debug logs then.

@oxzi
Copy link
Member

oxzi commented Apr 18, 2024

Please excuse the second post, but could you include all available log levels/priorities in your output. I misread the --priority flag in your prior post.

If a JSON log with this call is sufficient: journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json

Please set --priority 0..7 or --priority emerg..debug.

@TheCry
Copy link
Author

TheCry commented Apr 18, 2024

No problem...
The logs will be huge if I start the DEBUG level until tomorrow. I attach the new logs "--priority emerg..debug"
error_icingadb_Node1_and_2.zip

@oxzi
Copy link
Member

oxzi commented Apr 18, 2024

Thanks a lot. Regardless the size, I would like to inspect tomorrow's logs after the crash. Maybe you can reduce it to round about 30m before the crash.

@TheCry
Copy link
Author

TheCry commented Apr 19, 2024

Good morning. I have great news.. There was no outage tonight.
I also had the Bareos backup disabled for the 3 servers yesterday.
So I tried to reproduce the bug this morning. And it's my database backupscript that triggers the error when dumping the databases. I have attached the logs of both nodes (1 hour backwards) with debug information.
I have my backup script for the databases attached, because I just don't understand that it didn't cause any problems before the upgrade.
The icingadb database is 47GB in size. The retention is set as follows:
history-days: 90
SLA-Days: 180

Here are the logs and the script:
error_icingadb_node1_and_2-1h.zip
DatabaseBackupScript.txt

Edit:
I'd tried to backup the icingadb database with the script from https://www.netways.de/blog/2018/06/01/icinga-backup-encrypted/
The same problem.. :(

@oxzi
Copy link
Member

oxzi commented Apr 19, 2024

Thanks for your detailed report, your logs and your script. Based on your experiment, I would guess that your backup script is taking longer than the magic five minutes that Icinga DB now retires every database error since the latest v1.2.0 release. When it reaches this limit and there is still a LOCK from the mysqldump, Icinga DB exits.

Earlier you wrote that you have configured your database to "not lock". How did you do this? Could you try configuring your mysqldump command to not LOCK or use transactions as described in this StackOverflow thread? If everything else fails, you might wanna consider stopping and restarting Icinga DB during the time of your backup?

@TheCry
Copy link
Author

TheCry commented Apr 19, 2024

It seems like mysqldump is setting a lock by default. I didn't do it in my script. Now I created a dump with "--single-transaction=TRUE" and there was no outage.
Thank you for your support.

@oxzi oxzi removed the bug Something isn't working label Apr 19, 2024
@oxzi
Copy link
Member

oxzi commented Apr 19, 2024

I am glad to hear that. Unless you have an idea what to change, please feel free to close this issue.

@TheCry
Copy link
Author

TheCry commented Apr 19, 2024

No - ICINGA is going great and I'm more than happy with it. Thanks again for your support, as it should have been in the forum. The only thing I didn't think about was the database dump at first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants