Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Key on MySQL after upgrading to v2.7.1 #5623

Closed
robbsie opened this issue Sep 26, 2017 · 11 comments · Fixed by #5636
Closed

Duplicate Key on MySQL after upgrading to v2.7.1 #5623

robbsie opened this issue Sep 26, 2017 · 11 comments · Fixed by #5636
Assignees
Labels
area/db-ido Database output bug Something isn't working
Milestone

Comments

@robbsie
Copy link

robbsie commented Sep 26, 2017

After upgrading to the latest v2.7.1 IdoMysqlConnection throws a duplicate key error every 10 seconds. Because of this no updates are written to the MySQL DB.

Expected Behavior

Icinga should run without duplicate key errors.

Current Behavior

Icinga2 throws the following error every 10s and does not populate the DB after restart:

critical/IdoMysqlConnection: Error "Duplicate entry '1-31017-2017-09-25 22:00:50-205' for key 'instance_id'" when executing query "UPDATE icinga_comments SET author_name = ..

The very first startupo with a fresh database worked, but after restart the problem occoured again!

Possible Solution

None, we needed to downgrade to 2.7.0 which startet afterwards without problems.

Before we tried to start Icinga2 without the icinga2.state file and with a clean database, but the problem persisted after restart of Icinga2.

Steps to Reproduce (for bugs)

We could not reproduce the issue in our test-instance, so we sadly cannot provide steps to reproduce. We can provide futher informations in request, if this helps to investigate this issue.

Context

Your Environment

  • Version used (icinga2 --version): 2.7.1
  • Operating System and version: CentOS Linux release 7.4.1708 (Core)
  • Enabled features (icinga2 feature list): api checker command compatlog ido-mysql influxdb livestatus mainlog notification
  • Config validation (icinga2 daemon -C): works
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes: 2 Cluster Node + 1 Satellite
@dnsmichi
Copy link
Contributor

This seems to be the same issue as seen in #5603.

@dnsmichi dnsmichi added bug Something isn't working area/db-ido Database output labels Sep 27, 2017
@dnsmichi
Copy link
Contributor

Workaround: Drop the unique constraint for the icinga_comments and icinga_scheduleddowntime table. Can you try if this temporarily solves your issue @robbsie ?

@dnsmichi dnsmichi self-assigned this Sep 27, 2017
@dnsmichi dnsmichi added this to the 2.7.2 milestone Sep 27, 2017
@robbsie
Copy link
Author

robbsie commented Sep 27, 2017

Workaround: Drop the unique constraint for the icinga_comments and icinga_scheduleddowntime table.

Will this cause problems in our live cluster? :)

@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 29, 2017

No, the workaround won't cause problems. This constraint is never triggered, only on a programmer's fault inside the application. It does not add any extra security to prevent duplicated inserts here. This is coming from the old schema which used the insert on duplicate key update mechanism from MySQL (which has been deprecated and is therefore not used in Icinga 2. Instead we rely on UPDATE if nothing matched INSERT).

If you fear performance issues, drop the unique key, and re-apply it as additional index (with a new name, instance_id is horrible).

The problem you're describing is hard to reproduce. It comes from the original fix with #5585 where the internal_downtime_id changes on config reload. Sometimes this happens in clustered HA setups and then the DB IDO feature tries the following things in one transaction:

Say, you'll have two downtimes:

  • mbmif.int.netways.de!disk!mbmif.int.netways.de-1506689625-1
  • mbmif.int.netways.de!disk /!mbmif.int.netways.de-1506689625-2

Now the internal_downtime_id changed inside Icinga 2 for various reasons (e.g. when the state file is corrupt, or the HA cluster node sent in an update, and on the right side the IDs differ to the left side).

The DB IDO feature fires an update against the scheduleddowntimes/comments table. Or, so to speak, two updates.

These downtimes share some details. The object_id, instance_id and entry_time are the same. Which in turn means, that you've got 2 downtimes for the same host/service sent in the same second previously.

The WHERE condition includes the name column, while the SET values includes the internal_downtime_id.

A short "simulation":

MariaDB [icinga]> select * from icinga_scheduleddowntime order by entry_time desc limit 2\G
*************************** 1. row ***************************
  scheduleddowntime_id: 97
           instance_id: 1
         downtime_type: 1
             object_id: 50148
            entry_time: 2017-09-29 14:53:45
           author_name: icingaadmin
          comment_data: gdsgdehtergeegrehteht
  internal_downtime_id: 9
       triggered_by_id: 0
              is_fixed: 1
              duration: 0
  scheduled_start_time: 2017-09-29 14:53:38
    scheduled_end_time: 2017-09-29 20:53:38
           was_started: 1
     actual_start_time: 2017-09-29 14:53:38
actual_start_time_usec: 0
          is_in_effect: 1
          trigger_time: 2017-09-29 14:53:45
                  name: mbmif.int.netways.de!disk!mbmif.int.netways.de-1506689625-1
         session_token: 1506690441
    endpoint_object_id: 214
*************************** 2. row ***************************
  scheduleddowntime_id: 98
           instance_id: 1
         downtime_type: 1
             object_id: 50155
            entry_time: 2017-09-29 14:53:45
           author_name: icingaadmin
          comment_data: gdsgdehtergeegrehteht
  internal_downtime_id: 8
       triggered_by_id: 0
              is_fixed: 1
              duration: 0
  scheduled_start_time: 2017-09-29 14:53:38
    scheduled_end_time: 2017-09-29 20:53:38
           was_started: 1
     actual_start_time: 2017-09-29 14:53:38
actual_start_time_usec: 0
          is_in_effect: 1
          trigger_time: 2017-09-29 14:53:45
                  name: mbmif.int.netways.de!disk /!mbmif.int.netways.de-1506689625-2
         session_token: 1506690441
    endpoint_object_id: 214
2 rows in set (0.00 sec)
MariaDB [icinga]> UPDATE icinga_scheduleddowntime SET internal_downtime_id=9, object_id=50148, entry_time='2017-09-29 14:53:45', instance_id=1 where name='mbmif.int.netways.de!disk /!mbmif.int.netways.de-1506689625-2';
ERROR 1062 (23000): Duplicate entry '1-50148-2017-09-29 14:53:45-9' for key 'instance_id'

The internal_downtime_id is tied to the unique constraint on that table. The update attempts to temporarly create a row with the same values, the next query will update the other row to the new values.

This does not work with a unique constraint here. The latter is not needed for Icinga 2, only the index is important for better performance. We will be changing the constraint for 2.8, but cannot for minor releases (the impact on updates and re-indexing huge tables, especially the history tables, is enormous).

The quick fix is to restore the old behaviour with matching the constraint in the where condition for updates. This will update the tables the other way around - those rows where the changed internal_downtime_id column matches, will receive their updates. Since you're having a range here with say 1..100, a config reload will always fire 100 update queries.

The previous patch tried to ensure that using the name identifier which is unique all over the cluster instances provides more data integrity. The internal_downtime_id is considered legacy, and only there for compatibility reasons (you'll need them for the external command pipe). If you are using the REST API in Icinga Web 2 as command transport, you are already using the correct name identifier to e.g. delete a downtime.
Using the external command pipe this sometimes doesn't work when the core changes the legacy_id, but the other cluster node which has a different one thinks different when it gets an external command.

Anyways, the legacy_id/internal_downtime_id originates from standalone Nagios design, and is not unique over clusters and distributed environments. Names are unique, and that is where we're heading with Icinga 2.

@robbsie
Copy link
Author

robbsie commented Oct 2, 2017

Thanks for that explanation!

We removed the instance_id from icinga_comments and icinga_scheduled_downtime and the update worked without errors.

@edpstiffel
Copy link

Just to mention: we have the same problem with postgresql as ido-backend.

@dnsmichi
Copy link
Contributor

Upgrade to v2.7.2 which fixes this regression.

@edpstiffel
Copy link

already planning the change ;) Thanks.

@dnsmichi
Copy link
Contributor

2.8 is out this week too, which includes better indexes for the original patch. Reasoning: Minor updates should not result in db maintenance cycles.

@robbsie
Copy link
Author

robbsie commented Dec 3, 2017

Sure this is fixed in 2.8?

We, again, had the issue after upgrading to 2.8:

[2017-12-03 18:02:16 +0100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.14.3') [2017-12-03 18:02:18 +0100] critical/IdoMysqlConnection: Error "Duplicate entry '1-3456-2017-12-02 23:00:11-16' for key 'instance_id'" when executing query ...

@yaron-idan
Copy link

I'm meeting this error while trying to install Icinga2 r2.10.1-1 using the puppet module, any idea how to solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/db-ido Database output bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants