IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876

peteeckel · 2017-12-17T12:25:30Z

When the IDO connection switches between two Icinga 2 servers in a redundant cluster (e.g. after the master currently running the IDO database connection reboots), in some cases the other server does not properly yield the connection. This results in error messages in the log, and potentially to data loss.

Steps to reproduce

The setup is pretty classic: Two master servers running Icinga 2 (2.8.0, but the problem also occurs with older versions), IDO on PostgreSQL 9.6 (in a HA setup, but that should not be the cause of the problem):

library "db_ido_pgsql"

object IdoPgsqlConnection "ido-pgsql" {
    user = "icingaido"
    password = "xxxxxxx"
    host = "192.168.106.50"
    database = "icingaido"
}

Nothing fancy here.

Now stop the icinga2 service on the node currently holding the IDO connection and the connection is taken over by the other node after about one minute. No problems.

After bringing up the primary node, it takes back control of the IDO service:

[2017-12-16 21:23:12 +0000] information/DbConnection: 'ido-pgsql' started.
...
[2017-12-16 21:23:14 +0000] information/DbConnection: Resuming IDO connection: ido-pgsql
[2017-12-16 21:23:14 +0000] information/IdoPgsqlConnection: 'ido-pgsql' resumed.
...
[2017-12-16 21:23:18 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 0, rate: 0.2/s (12/min 12/5min 12/15min);

No problems here as well. But the picture is different on the secondary node that formerly had the connection to the IDO database:

[2017-12-16 21:23:14 +0000] information/IdoPgsqlConnection: 'ido-pgsql' paused.
[2017-12-16 21:23:14 +0000] information/DbConnection: Pausing IDO connection: ido-pgsql
[2017-12-16 21:23:16 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 122, rate: 12.0833/s (725/min 3421/5min 9032/15min); empty in 20 seconds
[2017-12-16 21:23:22 +0000] information/IdoPgsqlConnection: pgSQL IDO instance id: 1 (schema version: '1.14.3')
[2017-12-16 21:23:22 +0000] information/IdoPgsqlConnection: Finished reconnecting to PostgreSQL IDO database in 0.445331 second(s).
[2017-12-16 21:24:06 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 18, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 9 seconds
[2017-12-16 21:24:16 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 401, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 10 seconds
[2017-12-16 21:24:26 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 552, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 36 seconds
[2017-12-16 21:24:36 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 594, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 2 minutes and 21 seconds
[2017-12-16 21:24:46 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 699, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 1 minute and 6 seconds
...
[2017-12-16 21:41:26 +0000] information/WorkQueue: #7 (IdoPgsqlConnection, ido-pgsql) items: 10757, rate: 4.51667/s (271/min 2816/5min 9303/15min); empty in 21 minutes and 35 seconds
[2017-12-16 21:41:33 +0000] critical/IdoPgsqlConnection: Error "SSL SYSCALL error: Connection timed out
" when executing query "COMMIT"
[2017-12-16 21:41:33 +0000] warning/IdoPgsqlConnection: Exception during database operation: Verify that your database is operational!

After correctly detecting that the other node has come back and pausing the connection to the database, the WorkQueue still grows longer and does not get emptied for more than 20 minutes, until finally there is a timeout and the connection drops. Afterwards, all is well again and the node continues to run normally.

Expected Behavior

After the IDO connection is paused on the secondary node, it should no longer put requests into the IDO work queue as the IDO connection is paused.

Current Behavior

The IDO work queue on the node with the paused IDO connection grows longer until a timeout on the database occurs. I can't say at this point whether data loss occurs or whether the data the secondary node tries to write have already been written by the primary node, but the state and history looks fine at first glance.

Your Environment

Icinga 2 version 2.8.0
CentOS 7.4
Icinga Web 2 version 2.4.2

object Zone "global" {
  global = true
}

object Endpoint "master1.example.com" {
    host = "master1.example.com"
}
object Endpoint "master2.example.com" {
    host = "master2.example.com"
}

object Zone "master" {
    endpoints = [ "master1.example.com", "master2.example.com" ]
}

The text was updated successfully, but these errors were encountered:

peteeckel · 2017-12-17T13:33:49Z

Additional information:

I just reproduced the issue, and I find that Icinga 2 actually has a connection open with the database server (via haproxy):

[root@master2 ~]# netstat -antup | grep icinga | grep 5432
tcp        0    746 192.168.106.50:42776    192.168.106.50:5432     ESTABLISHED 22463/icinga2
[root@master2 ~]# netstat -antup | grep haproxy | grep 42776
tcp        0      0 192.168.106.50:5432     192.168.106.50:42776    ESTABLISHED 22795/haproxy       
[root@master2 ~]# netstat -antup | grep haproxy | grep 57608
tcp        0      0 192.168.106.102:57608   192.168.106.102:5432    ESTABLISHED 22795/haproxy

And on the database server there is a transaction started by the Icinga 2 process (via HAproxy) from the secondary node:

postgres=# select * from pg_stat_activity where datname='icingaido';
 datid |  datname  |  pid  | usesysid |  usename  | application_name |   client_addr   | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event |        state        | backend_xid | backend_xmin |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              query                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
-------+-----------+-------+----------+-----------+------------------+-----------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+---------------------+-------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 17119 | icingaido | 24951 |    17118 | icingaido |                  | 192.168.106.102 |                 |       57608 | 2017-12-17 13:15:02.900508+00 | 2017-12-17 13:16:25.487813+00 | 2017-12-17 13:16:25.487813+00 | 2017-12-17 13:16:25.487834+00 |                 |            | idle in transaction |             |              | BEGIN
 17119 | icingaido | 26220 |    17118 | icingaido |                  | 192.168.106.101 |                 |       43632 | 2017-12-17 13:17:21.777732+00 | 2017-12-17 13:24:54.623683+00 | 2017-12-17 13:24:54.919536+00 | 2017-12-17 13:24:54.920958+00 |                 |            | idle in transaction |      329813 |              | UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'disk_workaround_rhel1484963',  check_source = E'master2.demo.hindenburgring.com',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'0',  endpoint_object_id = 225,  event_handler = E'',  event_handler_enabled = E'1',  execution_time = E'0.033505',  flap_detection_enabled = E'0',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1513517094) AT TIME ZONE 'UTC',  last_hard_state = E'0',  last_hard_state_change = TO_TIMESTAMP(1513516498) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1513516498) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1513517094) AT TIME ZONE 'UTC',  latency = E'0',  long_output = E'',  max_check_attempts = E'5',  next_check = TO_TIMESTAMP(1513517153) AT TIME ZONE 'UTC',  normal_check_interval = E'1',  notifications_enabled = E'1',  original_attributes

The timestamps from the log and the database indicate that the connection has actually been opened after the IDO connection has been paused:

[root@master2 ~]# grep paused /var/log/icinga2/icinga2.log 
[2017-12-17 13:14:56 +0000] information/IdoPgsqlConnection: 'ido-pgsql' paused.

(all systems on NTP, .101 is master1, .102 is master2, database currently running on master2)

dnsmichi · 2018-01-08T12:47:08Z

I would believe that both instances think that they're not connected to each other, and therefore attempting to write themselves into the database. Do you have such an health check in place?

Cheers,
Michael

peteeckel · 2018-01-08T12:55:23Z

I don't have any special checks in place, and according to the logs both master servers see each other.

Actually the secondary node does notice that the primary comes back, which is why it logs the "'ido-pgsql' paused" message (please correct me if my assumption is wrong). It just doesn't really stop trying to write to the database, even though it said so.

peteeckel · 2018-03-07T10:40:41Z

Do you still need any feedback? BTW I'll be in Berlin tomorrow.

Cheers,

Pete.

dnsmichi · 2018-03-07T10:45:59Z

No, I would believe this is a bug and someone needs to look into a reproducer and a fix. @N-o-X any time slot where you could look into it?

matthiasritter · 2018-05-09T08:27:48Z

Hi,
same here for us, except we use mysql.
If I could provide additional logs/data please let me know.
Cheers,
Matthias

dnsmichi · 2018-10-30T12:02:37Z

While implementing #2941 I've found out that the IDO feature incorrectly enqueues queries into the work queue.

fixes #5876 refs #6739

fixes #5876 refs #6739 (cherry picked from commit 0046dca)

dnsmichi added area/db-ido Database output needs feedback We'll only proceed once we hear from you again labels Jan 8, 2018

dnsmichi added bug Something isn't working and removed needs feedback We'll only proceed once we hear from you again labels Mar 7, 2018

dnsmichi self-assigned this Oct 30, 2018

dnsmichi added this to the 2.11.0 milestone Oct 30, 2018

dnsmichi mentioned this issue Oct 30, 2018

IDODB HA , both masters write to DB -> locked tables etc. #6739

Closed

dnsmichi pushed a commit that referenced this issue Oct 30, 2018

DB IDO: Don't enqueue queries when the feature is paused (HA)

c0c6ea5

fixes #5876 refs #6739

dnsmichi mentioned this issue Oct 30, 2018

DB IDO: Don't enqueue queries when the feature is paused (HA) #6740

Merged

dnsmichi closed this as completed in #6740 Oct 30, 2018

dnsmichi pushed a commit that referenced this issue Nov 13, 2018

DB IDO: Don't enqueue queries when the feature is paused (HA)

0046dca

fixes #5876 refs #6739

dnsmichi modified the milestones: 2.11.0, 2.10.2 Nov 13, 2018

dnsmichi pushed a commit that referenced this issue Feb 8, 2019

DB IDO: Don't enqueue queries when the feature is paused (HA)

8b4903f

fixes #5876 refs #6739 (cherry picked from commit 0046dca)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876

IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876

peteeckel commented Dec 17, 2017

peteeckel commented Dec 17, 2017 •

edited

dnsmichi commented Jan 8, 2018

peteeckel commented Jan 8, 2018

peteeckel commented Mar 7, 2018

dnsmichi commented Mar 7, 2018

matthiasritter commented May 9, 2018

dnsmichi commented Oct 30, 2018

IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876

IDO Work queue on the inactive node growing when switching connection between redundant master servers #5876

Comments

peteeckel commented Dec 17, 2017

Steps to reproduce

Expected Behavior

Current Behavior

Your Environment

peteeckel commented Dec 17, 2017 • edited

dnsmichi commented Jan 8, 2018

peteeckel commented Jan 8, 2018

peteeckel commented Mar 7, 2018

dnsmichi commented Mar 7, 2018

matthiasritter commented May 9, 2018

dnsmichi commented Oct 30, 2018

peteeckel commented Dec 17, 2017 •

edited