HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

thingstad · 2021-04-08T11:57:57Z

Deadlock on leader if HA DB connection is lost
In a setup with MySQL/MariaDB HA back-end, if the leader looses it's DB connection it has a fair chance of mitigating into a deadlock state.
Loosing the connection releases the lock, and your standby assumes leadership.
You end up with multiple masters.
The previous leader will still report itself as master; as will your former standby node - which now has grabbed leadership.

To Reproduce
Steps to reproduce the behavior:

Prepare a MySQL/MariaDB instance.
Configure two Vault instances to each use MySQL as HA back-end
Start and unseal Vault instances
On MySQL/MariaDB instance run sudo systemctl restart mariadb to restart DB and hence drop current connections
Wait a few seconds till the standby assumes leadership.
Issue curl http://node-addr:8200/v1/sys/leader on each Vault node.
Observe multiple masters repported.

Expected behavior
Expectation is that the leader will recover from having lost the leadership.
Either by trying to grab leadership once again, or becoming a standby.

Environment:

Vault Server Version: 1.7
HA back-end DB: MariaDB 10.5.9

Vault server configuration file(s):

storage "mysql" {
  address = "10.10.10.4:3306"
  database = "vault"
  table = "vault_data"
  username = "vault-username"
  password = "vault-password"
  ha_enabled = "true"
  lock_table = "vault_lock"    
}

listener "tcp" {
  address       = "10.10.10.2:8200"
  tls_disable = 1
}

api_addr = "http://10.10.10.2:8200"
ui = true
log_level = "info"

Additional ways to observe the issue

Adding one line of logging to physical/mysql/mysql.go

Change

// hasLock will check if a lock is held by checking the current lock id against our known ID.
func (i *MySQLHALock) hasLock(key string) error {
	var result sql.NullInt64
	err := i.in.statements["used_lock"].QueryRow(key).Scan(&result)
	if err == sql.ErrNoRows || !result.Valid {
		// This is not an error to us since it just means the lock isn't held
		return nil
	}

Into

// hasLock will check if a lock is held by checking the current lock id against our known ID.
func (i *MySQLHALock) hasLock(key string) error {
	var result sql.NullInt64
	err := i.in.statements["used_lock"].QueryRow(key).Scan(&result)
	if err == sql.ErrNoRows || !result.Valid {
 	        i.logger.Warn("I am now in a deadlock state", "err", err, "result.Valid", result.Valid)
		// This is not an error to us since it just means the lock isn't held
		return nil
	}

Compile on both Vault nodes. Start, and unseal.
Repeat steps described in To Reproduce

The reason for the deadlock occurring

The reason for this deadlock occurring is due to the fact that the function hasLock() does not give a clear answer to the question: "do I still own the lock"?

The calling function; monitorLock(), is stuck in a never ending loop until an error is returned from hasLock().

Quite odd, but just a fact:

	err := i.in.statements["used_lock"].QueryRow(key).Scan(&result)

does not seem to return an error if the connection is lost.
Nor does this statement try to re-establish a broken connection.

Hence, once the connection is lost, hasLock() is called every 5 seconds, with the same outcome each time.
monitorLock() enters a endless loop, wrongly claiming to be the master.

The text was updated successfully, but these errors were encountered:

thingstad · 2021-06-09T07:51:38Z

What does it take to have someone assess/merge the supplied pull request?
If there is any information missing from my side, please be so kind to let me know.

Cheers :-)

chris-ng-scmp · 2021-08-02T03:05:50Z

We have the same issue, every day at the same time have the same slow query

SELECT GET_LOCK('core/lock', 2147483647), IS_USED_LOCK('core/lock')

duration take around 172791s

aureliar8 · 2021-09-08T14:40:53Z

We got the same issue: after a connection loss to MariaDB, several nodes claimed they were the master.

GrahamDahlsveen · 2022-01-04T09:31:53Z

@sgmiller Any chance of getting this prioritized? This is effectively stopping key functionality of vault's redundancy for users of this storage type.
The author of this issue has created a short solution PR #11320.

thingstad mentioned this issue Apr 8, 2021

Fix mysql deadlock #11320

Closed

sgmiller added the core/storage label Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

thingstad commented Apr 8, 2021

thingstad commented Jun 9, 2021

chris-ng-scmp commented Aug 2, 2021

aureliar8 commented Sep 8, 2021

GrahamDahlsveen commented Jan 4, 2022

HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

Comments

thingstad commented Apr 8, 2021

thingstad commented Jun 9, 2021

chris-ng-scmp commented Aug 2, 2021

aureliar8 commented Sep 8, 2021

GrahamDahlsveen commented Jan 4, 2022