You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deadlock on leader if HA DB connection is lost
In a setup with MySQL/MariaDB HA back-end, if the leader looses it's DB connection it has a fair chance of mitigating into a deadlock state.
Loosing the connection releases the lock, and your standby assumes leadership.
You end up with multiple masters.
The previous leader will still report itself as master; as will your former standby node - which now has grabbed leadership.
To Reproduce
Steps to reproduce the behavior:
Prepare a MySQL/MariaDB instance.
Configure two Vault instances to each use MySQL as HA back-end
Start and unseal Vault instances
On MySQL/MariaDB instance run sudo systemctl restart mariadb to restart DB and hence drop current connections
Wait a few seconds till the standby assumes leadership.
Issue curl http://node-addr:8200/v1/sys/leader on each Vault node.
Observe multiple masters repported.
Expected behavior
Expectation is that the leader will recover from having lost the leadership.
Either by trying to grab leadership once again, or becoming a standby.
Adding one line of logging to physical/mysql/mysql.go
Change
// hasLock will check if a lock is held by checking the current lock id against our known ID.func (i*MySQLHALock) hasLock(keystring) error {
varresult sql.NullInt64err:=i.in.statements["used_lock"].QueryRow(key).Scan(&result)
iferr==sql.ErrNoRows||!result.Valid {
// This is not an error to us since it just means the lock isn't heldreturnnil
}
Into
// hasLock will check if a lock is held by checking the current lock id against our known ID.func (i*MySQLHALock) hasLock(keystring) error {
varresult sql.NullInt64err:=i.in.statements["used_lock"].QueryRow(key).Scan(&result)
iferr==sql.ErrNoRows||!result.Valid {
i.logger.Warn("I am now in a deadlock state", "err", err, "result.Valid", result.Valid)
// This is not an error to us since it just means the lock isn't heldreturnnil
}
Compile on both Vault nodes. Start, and unseal.
Repeat steps described in To Reproduce
The reason for the deadlock occurring
The reason for this deadlock occurring is due to the fact that the function hasLock() does not give a clear answer to the question: "do I still own the lock"?
The calling function; monitorLock(), is stuck in a never ending loop until an error is returned from hasLock().
does not seem to return an error if the connection is lost.
Nor does this statement try to re-establish a broken connection.
Hence, once the connection is lost, hasLock() is called every 5 seconds, with the same outcome each time. monitorLock() enters a endless loop, wrongly claiming to be the master.
The text was updated successfully, but these errors were encountered:
What does it take to have someone assess/merge the supplied pull request?
If there is any information missing from my side, please be so kind to let me know.
@sgmiller Any chance of getting this prioritized? This is effectively stopping key functionality of vault's redundancy for users of this storage type.
The author of this issue has created a short solution PR #11320.
Deadlock on leader if HA DB connection is lost
In a setup with MySQL/MariaDB HA back-end, if the leader looses it's DB connection it has a fair chance of mitigating into a deadlock state.
Loosing the connection releases the lock, and your standby assumes leadership.
You end up with multiple masters.
The previous leader will still report itself as master; as will your former standby node - which now has grabbed leadership.
To Reproduce
Steps to reproduce the behavior:
sudo systemctl restart mariadb
to restart DB and hence drop current connectionscurl http://node-addr:8200/v1/sys/leader
on each Vault node.Expected behavior
Expectation is that the leader will recover from having lost the leadership.
Either by trying to grab leadership once again, or becoming a standby.
Environment:
Vault server configuration file(s):
Additional ways to observe the issue
Adding one line of logging to
physical/mysql/mysql.go
Change
Into
Compile on both Vault nodes. Start, and unseal.
Repeat steps described in To Reproduce
The reason for the deadlock occurring
The reason for this deadlock occurring is due to the fact that the function
hasLock()
does not give a clear answer to the question: "do I still own the lock"?The calling function;
monitorLock()
, is stuck in a never ending loop until an error is returned fromhasLock()
.Quite odd, but just a fact:
does not seem to return an error if the connection is lost.
Nor does this statement try to re-establish a broken connection.
Hence, once the connection is lost,
hasLock()
is called every 5 seconds, with the same outcome each time.monitorLock()
enters a endless loop, wrongly claiming to be the master.The text was updated successfully, but these errors were encountered: