Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mysql deadlock #11320

Closed
wants to merge 3 commits into from
Closed

Conversation

thingstad
Copy link

Fix for #11319

@hashicorp-cla
Copy link

hashicorp-cla commented Apr 8, 2021

CLA assistant check
All committers have signed the CLA.

@hsimon-hashicorp
Copy link
Contributor

Hi @thingstad! Can you add a small changelog entry for this? We'll get it reviewed (ping @imthaghost). Thanks!

If node is current leader, the HA leader lock is monitored. If the connection is lost during the sleeping period,
the monitor loop will never exit.
This happend due to the fact that hasLock() does not provide an binary answer to
the question "do we still have the lock", and due to the fact that QueryRow().Scan() does not return an error
if the connection is lost. Nor does QueryRow().Scan() try to re-establish the lost connection.
@ncabatoff
Copy link
Collaborator

Hi @thingstad,

Do you suppose you could write a test to demonstrate the problem and the fix? Typically our physical backend implementations have integration tests. Sadly the mysql physical tests are relying on the user to provide a MYSQL_ADDR pointing to an existing DB they manage, and the tests are skipped otherwise. What they should be doing instead is use https://github.com/hashicorp/vault/blob/main/helper/testhelpers/mysql/mysqlhelper.go#L21-L21 to create a docker container for mysql. This will allow the tests to run in CI, which would already be a great improvement. It looks like this was already done for TestMySQLHABackend_LockFailPanic, but we failed to add use of PrepareTestContainer to the other tests at the time.

You could make that change as part of this PR, which would be nice but not required. What I would really like to see though is a test that shows that your change solves the problem, i.e. something that reproduces the steps from #11319 using docker containers. I realize that the bug may not manifest every time the db is restarted, but even if it takes many invocations of the tests to see the failure, that's still something. If it takes 20 attempts on average to make the failure happen before your change, and then we run the test 100 times with your change without seeing a failure, that's good enough evidence that it fixes the problem.

@hsimon-hashicorp
Copy link
Contributor

Due to the age of this PR, I will go ahead and close it at this point. Please feel free to re-base and continue if you're willing to do so. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants