Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

Open
thingstad opened this issue Apr 8, 2021 · 4 comments
Open

HA MySQL/MariaDB: Deadlock on leader, if HA DB connection is lost #11319

thingstad opened this issue Apr 8, 2021 · 4 comments

Comments

@thingstad
Copy link

Deadlock on leader if HA DB connection is lost
In a setup with MySQL/MariaDB HA back-end, if the leader looses it's DB connection it has a fair chance of mitigating into a deadlock state.
Loosing the connection releases the lock, and your standby assumes leadership.
You end up with multiple masters.
The previous leader will still report itself as master; as will your former standby node - which now has grabbed leadership.

To Reproduce
Steps to reproduce the behavior:

  1. Prepare a MySQL/MariaDB instance.
  2. Configure two Vault instances to each use MySQL as HA back-end
  3. Start and unseal Vault instances
  4. On MySQL/MariaDB instance run sudo systemctl restart mariadb to restart DB and hence drop current connections
  5. Wait a few seconds till the standby assumes leadership.
  6. Issue curl http://node-addr:8200/v1/sys/leader on each Vault node.
  7. Observe multiple masters repported.

Expected behavior
Expectation is that the leader will recover from having lost the leadership.
Either by trying to grab leadership once again, or becoming a standby.

Environment:

  • Vault Server Version: 1.7
  • HA back-end DB: MariaDB 10.5.9

Vault server configuration file(s):

storage "mysql" {
  address = "10.10.10.4:3306"
  database = "vault"
  table = "vault_data"
  username = "vault-username"
  password = "vault-password"
  ha_enabled = "true"
  lock_table = "vault_lock"    
}

listener "tcp" {
  address       = "10.10.10.2:8200"
  tls_disable = 1
}

api_addr = "http://10.10.10.2:8200"
ui = true
log_level = "info"

Additional ways to observe the issue

Adding one line of logging to physical/mysql/mysql.go

Change

// hasLock will check if a lock is held by checking the current lock id against our known ID.
func (i *MySQLHALock) hasLock(key string) error {
	var result sql.NullInt64
	err := i.in.statements["used_lock"].QueryRow(key).Scan(&result)
	if err == sql.ErrNoRows || !result.Valid {
		// This is not an error to us since it just means the lock isn't held
		return nil
	}

Into

// hasLock will check if a lock is held by checking the current lock id against our known ID.
func (i *MySQLHALock) hasLock(key string) error {
	var result sql.NullInt64
	err := i.in.statements["used_lock"].QueryRow(key).Scan(&result)
	if err == sql.ErrNoRows || !result.Valid {
 	        i.logger.Warn("I am now in a deadlock state", "err", err, "result.Valid", result.Valid)
		// This is not an error to us since it just means the lock isn't held
		return nil
	}

Compile on both Vault nodes. Start, and unseal.
Repeat steps described in To Reproduce

The reason for the deadlock occurring

The reason for this deadlock occurring is due to the fact that the function hasLock() does not give a clear answer to the question: "do I still own the lock"?

The calling function; monitorLock(), is stuck in a never ending loop until an error is returned from hasLock().

Quite odd, but just a fact:

	err := i.in.statements["used_lock"].QueryRow(key).Scan(&result)

does not seem to return an error if the connection is lost.
Nor does this statement try to re-establish a broken connection.

Hence, once the connection is lost, hasLock() is called every 5 seconds, with the same outcome each time.
monitorLock() enters a endless loop, wrongly claiming to be the master.

@thingstad
Copy link
Author

What does it take to have someone assess/merge the supplied pull request?
If there is any information missing from my side, please be so kind to let me know.

Cheers :-)

@chris-ng-scmp
Copy link

We have the same issue, every day at the same time have the same slow query

SELECT GET_LOCK('core/lock', 2147483647), IS_USED_LOCK('core/lock')

duration take around 172791s

@aureliar8
Copy link

We got the same issue: after a connection loss to MariaDB, several nodes claimed they were the master.

@GrahamDahlsveen
Copy link

@sgmiller Any chance of getting this prioritized? This is effectively stopping key functionality of vault's redundancy for users of this storage type.
The author of this issue has created a short solution PR #11320.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants