Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876

Closed
peteeckel opened this issue Dec 17, 2017 · 7 comments · Fixed by #6740
Assignees
Labels
area/db-ido Database output bug Something isn't working
Milestone

Comments

@peteeckel
Copy link
Contributor

When the IDO connection switches between two Icinga 2 servers in a redundant cluster (e.g. after the master currently running the IDO database connection reboots), in some cases the other server does not properly yield the connection. This results in error messages in the log, and potentially to data loss.

Steps to reproduce

The setup is pretty classic: Two master servers running Icinga 2 (2.8.0, but the problem also occurs with older versions), IDO on PostgreSQL 9.6 (in a HA setup, but that should not be the cause of the problem):

library "db_ido_pgsql"

object IdoPgsqlConnection "ido-pgsql" {
    user = "icingaido"
    password = "xxxxxxx"
    host = "192.168.106.50"
    database = "icingaido"
}

Nothing fancy here.

Now stop the icinga2 service on the node currently holding the IDO connection and the connection is taken over by the other node after about one minute. No problems.

After bringing up the primary node, it takes back control of the IDO service:

[2017-12-16 21:23:12 +0000] information/DbConnection: 'ido-pgsql' started.
...
[2017-12-16 21:23:14 +0000] information/DbConnection: Resuming IDO connection: ido-pgsql
[2017-12-16 21:23:14 +0000] information/IdoPgsqlConnection: 'ido-pgsql' resumed.
...
[2017-12-16 21:23:18 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 0, rate: 0.2/s (12/min 12/5min 12/15min);

No problems here as well. But the picture is different on the secondary node that formerly had the connection to the IDO database:

[2017-12-16 21:23:14 +0000] information/IdoPgsqlConnection: 'ido-pgsql' paused.
[2017-12-16 21:23:14 +0000] information/DbConnection: Pausing IDO connection: ido-pgsql
[2017-12-16 21:23:16 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 122, rate: 12.0833/s (725/min 3421/5min 9032/15min); empty in 20 seconds
[2017-12-16 21:23:22 +0000] information/IdoPgsqlConnection: pgSQL IDO instance id: 1 (schema version: '1.14.3')
[2017-12-16 21:23:22 +0000] information/IdoPgsqlConnection: Finished reconnecting to PostgreSQL IDO database in 0.445331 second(s).
[2017-12-16 21:24:06 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 18, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 9 seconds
[2017-12-16 21:24:16 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 401, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 10 seconds
[2017-12-16 21:24:26 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 552, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 36 seconds
[2017-12-16 21:24:36 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 594, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 2 minutes and 21 seconds
[2017-12-16 21:24:46 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 699, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 1 minute and 6 seconds
...
[2017-12-16 21:41:26 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 10757, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 21 minutes and 35 seconds
[2017-12-16 21:41:33 +0000] critical/IdoPgsqlConnection: Error "SSL SYSCALL error: Connection timed out
" when executing query "COMMIT"
[2017-12-16 21:41:33 +0000] warning/IdoPgsqlConnection: Exception during database operation: Verify that your database is operational!

After correctly detecting that the other node has come back and pausing the connection to the database, the WorkQueue still grows longer and does not get emptied for more than 20 minutes, until finally there is a timeout and the connection drops. Afterwards, all is well again and the node continues to run normally.

Expected Behavior

After the IDO connection is paused on the secondary node, it should no longer put requests into the IDO work queue as the IDO connection is paused.

Current Behavior

The IDO work queue on the node with the paused IDO connection grows longer until a timeout on the database occurs. I can't say at this point whether data loss occurs or whether the data the secondary node tries to write have already been written by the primary node, but the state and history looks fine at first glance.

Your Environment

  • Icinga 2 version 2.8.0
  • CentOS 7.4
  • Icinga Web 2 version 2.4.2
object Zone "global" {
  global = true
}

object Endpoint "master1.example.com" {
    host = "master1.example.com"
}
object Endpoint "master2.example.com" {
    host = "master2.example.com"
}

object Zone "master" {
    endpoints = [ "master1.example.com", "master2.example.com" ]
}
@peteeckel
Copy link
Contributor Author

peteeckel commented Dec 17, 2017

Additional information:

I just reproduced the issue, and I find that Icinga 2 actually has a connection open with the database server (via haproxy):

[root@master2 ~]# netstat -antup | grep icinga | grep 5432
tcp        0    746 192.168.106.50:42776    192.168.106.50:5432     ESTABLISHED 22463/icinga2
[root@master2 ~]# netstat -antup | grep haproxy | grep 42776
tcp        0      0 192.168.106.50:5432     192.168.106.50:42776    ESTABLISHED 22795/haproxy       
[root@master2 ~]# netstat -antup | grep haproxy | grep 57608
tcp        0      0 192.168.106.102:57608   192.168.106.102:5432    ESTABLISHED 22795/haproxy       

And on the database server there is a transaction started by the Icinga 2 process (via HAproxy) from the secondary node:

postgres=# select * from pg_stat_activity where datname='icingaido';
 datid |  datname  |  pid  | usesysid |  usename  | application_name |   client_addr   | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event |        state        | backend_xid | backend_xmin |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              query                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
-------+-----------+-------+----------+-----------+------------------+-----------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+---------------------+-------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 17119 | icingaido | 24951 |    17118 | icingaido |                  | 192.168.106.102 |                 |       57608 | 2017-12-17 13:15:02.900508+00 | 2017-12-17 13:16:25.487813+00 | 2017-12-17 13:16:25.487813+00 | 2017-12-17 13:16:25.487834+00 |                 |            | idle in transaction |             |              | BEGIN
 17119 | icingaido | 26220 |    17118 | icingaido |                  | 192.168.106.101 |                 |       43632 | 2017-12-17 13:17:21.777732+00 | 2017-12-17 13:24:54.623683+00 | 2017-12-17 13:24:54.919536+00 | 2017-12-17 13:24:54.920958+00 |                 |            | idle in transaction |      329813 |              | UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'disk_workaround_rhel1484963',  check_source = E'master2.demo.hindenburgring.com',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'0',  endpoint_object_id = 225,  event_handler = E'',  event_handler_enabled = E'1',  execution_time = E'0.033505',  flap_detection_enabled = E'0',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1513517094) AT TIME ZONE 'UTC',  last_hard_state = E'0',  last_hard_state_change = TO_TIMESTAMP(1513516498) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1513516498) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1513517094) AT TIME ZONE 'UTC',  latency = E'0',  long_output = E'',  max_check_attempts = E'5',  next_check = TO_TIMESTAMP(1513517153) AT TIME ZONE 'UTC',  normal_check_interval = E'1',  notifications_enabled = E'1',  original_attributes

The timestamps from the log and the database indicate that the connection has actually been opened after the IDO connection has been paused:

[root@master2 ~]# grep paused /var/log/icinga2/icinga2.log 
[2017-12-17 13:14:56 +0000] information/IdoPgsqlConnection: 'ido-pgsql' paused.

(all systems on NTP, .101 is master1, .102 is master2, database currently running on master2)

@dnsmichi
Copy link
Contributor

dnsmichi commented Jan 8, 2018

I would believe that both instances think that they're not connected to each other, and therefore attempting to write themselves into the database. Do you have such an health check in place?

Cheers,
Michael

@dnsmichi dnsmichi added area/db-ido Database output needs feedback We'll only proceed once we hear from you again labels Jan 8, 2018
@peteeckel
Copy link
Contributor Author

I don't have any special checks in place, and according to the logs both master servers see each other.

Actually the secondary node does notice that the primary comes back, which is why it logs the "'ido-pgsql' paused" message (please correct me if my assumption is wrong). It just doesn't really stop trying to write to the database, even though it said so.

@peteeckel
Copy link
Contributor Author

Do you still need any feedback? BTW I'll be in Berlin tomorrow.

Cheers,

Pete.

@dnsmichi dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels Mar 7, 2018
@dnsmichi
Copy link
Contributor

dnsmichi commented Mar 7, 2018

No, I would believe this is a bug and someone needs to look into a reproducer and a fix. @N-o-X any time slot where you could look into it?

@matthiasritter
Copy link

Hi,
same here for us, except we use mysql.
If I could provide additional logs/data please let me know.
Cheers,
Matthias

@dnsmichi dnsmichi self-assigned this Oct 30, 2018
@dnsmichi dnsmichi added this to the 2.11.0 milestone Oct 30, 2018
@dnsmichi
Copy link
Contributor

While implementing #2941 I've found out that the IDO feature incorrectly enqueues queries into the work queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-ido Database output bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants