New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876
Comments
Additional information: I just reproduced the issue, and I find that Icinga 2 actually has a connection open with the database server (via haproxy):
And on the database server there is a transaction started by the Icinga 2 process (via HAproxy) from the secondary node:
The timestamps from the log and the database indicate that the connection has actually been opened after the IDO connection has been paused:
(all systems on NTP, .101 is master1, .102 is master2, database currently running on master2) |
I would believe that both instances think that they're not connected to each other, and therefore attempting to write themselves into the database. Do you have such an health check in place? Cheers, |
I don't have any special checks in place, and according to the logs both master servers see each other. Actually the secondary node does notice that the primary comes back, which is why it logs the "'ido-pgsql' paused" message (please correct me if my assumption is wrong). It just doesn't really stop trying to write to the database, even though it said so. |
Do you still need any feedback? BTW I'll be in Berlin tomorrow. Cheers, Pete. |
No, I would believe this is a bug and someone needs to look into a reproducer and a fix. @N-o-X any time slot where you could look into it? |
Hi, |
While implementing #2941 I've found out that the IDO feature incorrectly enqueues queries into the work queue. |
When the IDO connection switches between two Icinga 2 servers in a redundant cluster (e.g. after the master currently running the IDO database connection reboots), in some cases the other server does not properly yield the connection. This results in error messages in the log, and potentially to data loss.
Steps to reproduce
The setup is pretty classic: Two master servers running Icinga 2 (2.8.0, but the problem also occurs with older versions), IDO on PostgreSQL 9.6 (in a HA setup, but that should not be the cause of the problem):
Nothing fancy here.
Now stop the icinga2 service on the node currently holding the IDO connection and the connection is taken over by the other node after about one minute. No problems.
After bringing up the primary node, it takes back control of the IDO service:
No problems here as well. But the picture is different on the secondary node that formerly had the connection to the IDO database:
After correctly detecting that the other node has come back and pausing the connection to the database, the WorkQueue still grows longer and does not get emptied for more than 20 minutes, until finally there is a timeout and the connection drops. Afterwards, all is well again and the node continues to run normally.
Expected Behavior
After the IDO connection is paused on the secondary node, it should no longer put requests into the IDO work queue as the IDO connection is paused.
Current Behavior
The IDO work queue on the node with the paused IDO connection grows longer until a timeout on the database occurs. I can't say at this point whether data loss occurs or whether the data the secondary node tries to write have already been written by the primary node, but the state and history looks fine at first glance.
Your Environment
The text was updated successfully, but these errors were encountered: