New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mysql deadlock #11320
Fix mysql deadlock #11320
Conversation
199af44
to
144e50d
Compare
Hi @thingstad! Can you add a small changelog entry for this? We'll get it reviewed (ping @imthaghost). Thanks! |
If node is current leader, the HA leader lock is monitored. If the connection is lost during the sleeping period, the monitor loop will never exit. This happend due to the fact that hasLock() does not provide an binary answer to the question "do we still have the lock", and due to the fact that QueryRow().Scan() does not return an error if the connection is lost. Nor does QueryRow().Scan() try to re-establish the lost connection.
144e50d
to
7be2a47
Compare
Hi @thingstad, Do you suppose you could write a test to demonstrate the problem and the fix? Typically our physical backend implementations have integration tests. Sadly the mysql physical tests are relying on the user to provide a MYSQL_ADDR pointing to an existing DB they manage, and the tests are skipped otherwise. What they should be doing instead is use https://github.com/hashicorp/vault/blob/main/helper/testhelpers/mysql/mysqlhelper.go#L21-L21 to create a docker container for mysql. This will allow the tests to run in CI, which would already be a great improvement. It looks like this was already done for TestMySQLHABackend_LockFailPanic, but we failed to add use of PrepareTestContainer to the other tests at the time. You could make that change as part of this PR, which would be nice but not required. What I would really like to see though is a test that shows that your change solves the problem, i.e. something that reproduces the steps from #11319 using docker containers. I realize that the bug may not manifest every time the db is restarted, but even if it takes many invocations of the tests to see the failure, that's still something. If it takes 20 attempts on average to make the failure happen before your change, and then we run the test 100 times with your change without seeing a failure, that's good enough evidence that it fixes the problem. |
Due to the age of this PR, I will go ahead and close it at this point. Please feel free to re-base and continue if you're willing to do so. Thanks! |
Fix for #11319