Navigation Menu

Skip to content
This repository has been archived by the owner on Dec 15, 2020. It is now read-only.

Clarify and explicitly order enroll host transaction #2318

Closed
wants to merge 2 commits into from
Closed

Clarify and explicitly order enroll host transaction #2318

wants to merge 2 commits into from

Conversation

zwass
Copy link
Contributor

@zwass zwass commented Oct 6, 2020

Previously this used an implicit transaction with INSERT ... ON DUPLICATE KEY UPDATE. Now we make the transaction explicit with the
hopes of resolving some deadlock issues users encountered at very high
scale.

These issues are possibly root-caused by misconfiguration of osquery
resulting in copied host identifiers, making this change less relevant. I
still believe it is worth being more explicit about this transaction.

Previously this used an implicit transaction with `INSERT ... ON
DUPLICATE KEY UPDATE`. Now we make the transaction explicit with the
hopes of resolving some deadlock issues users encountered at very high
scale.
Copy link
Contributor

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to try this, I think it's fine. But I think it will make deadlocks (and performance) worse. It's replacing a single upsert, with 2 or 3 database calls.

I'm less sure about mysql, but certainly in postgres this is strictly worse for locking.

Looking at the mysql docs, I see these descriptions of the locks taken:

UPDATE ... WHERE ... sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

INSERT sets an exclusive lock on the inserted row. This lock is an index-record lock, not a next-key lock (that is, there is no gap lock) and does not prevent other sessions from inserting into the gap before the inserted row.

REPLACE is done like an INSERT if there is no collision on a unique key. Otherwise, an exclusive next-key lock is placed on the row to be replaced.

INSERT ... ON DUPLICATE KEY UPDATE differs from a simple INSERT in that an exclusive lock rather than a shared lock is placed on the row to be updated when a duplicat
e-key error occurs. An exclusive index-record lock is taken for a duplicate primary key value. An exclusive next-key lock is taken for a duplicate unique key value.

From that, the locking around ON DUPLICATE sounds heavier. I also wonder about REPLACE, though some discussion is that this is a delete; insert under the hood, which would be a lose.

Also happy to chat on slack. Depending on what's happening, there may be other patterns.

assert.Equal(t, enrollSecretName, h.EnrollSecretName)

h, err = ds.EnrollHost(tt.uuid, tt.nodeKey+"new", enrollSecretName+"new")
require.Nil(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Nil(t, err)
require.NoError(t, err)

(nit)

This may help MySQL perform the locking in the expected order to resolve
deadlocks.
@zwass
Copy link
Contributor Author

zwass commented Oct 8, 2020

Let me explain the reasoning some here.

This is an attempt to solve the following deadlock, discussed in https://osquery.slack.com/archives/C1XCLA5DZ/p1601480664063800:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2020-09-30 09:46:45 0x7de11cce9700
*** (1) TRANSACTION:
TRANSACTION 35762332832, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 1157846, OS thread handle 139400941860608, query id 14939859382 prod-10-32-7-83.pw1.bcc.somehost.com 10.32.7.83 tlsproddb_user updating
UPDATE hosts SET
			seen_time = '2020-09-30 13:46:45.647691'
		WHERE node_key='6u8qphrXq6FhlWXYZfYWHT2xwBHX+SsN'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 771 page no 7 n bits 88 index PRIMARY of table `tlsproddb`.`hosts` trx id 35762332832 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 35762332830, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 1157730, OS thread handle 138405804414720, query id 14939859380 prod-10-32-1-12.pw1.bcc.somehost.com 10.32.1.12 tlsproddb_user update
INSERT INTO hosts (
			detail_update_time,
			label_update_time,
			osquery_host_id,
			seen_time,
			node_key,
			enroll_secret_name
		) VALUES ('1970-01-02 00:00:00', '1970-01-02 00:00:00', 'DB938000-FD74-11E7-8000-000000000000', '2020-09-30 13:46:45.647665', 'WqAC/7dPeDzkh90b5yb4TIIvvJrgd4ut', 'default')
		ON DUPLICATE KEY UPDATE
			node_key = VALUES(node_key),
			deleted = FALSE
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 771 page no 7 n bits 88 index PRIMARY of table `tlsproddb`.`hosts` trx id 35762332830 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 771 page no 3489 n bits 248 index idx_host_unique_nodekey of table `tlsproddb`.`hosts` trx id 35762332830 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (1)

My hope is that by explicitly ordering the transaction and running the update or insert, we can get MySQL to acquire the locks in an order that avoids potential deadlocks. I have just updated the code to use the FOR UPDATE (https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html) statement that I think may better achieve my goal here.

I agree with your analysis that this is likely less performant, but the motivation is to reduce the deadlocks due to the suspicion that those may be resulting in other errors.

All of that said, I think I will drop this PR into drafts and keep debugging the root cause through the additional logging added in #2313 and see if #2321 possibly helps.

@zwass zwass marked this pull request as draft October 8, 2020 01:43
@zwass
Copy link
Contributor Author

zwass commented Nov 3, 2020

I am not planning to merge this due to the enrollment issues we identified with the logging in #2313.

@zwass zwass closed this Nov 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants