Standby nodes lose track of active node with MySQL HA #8522

riuvshyn · 2020-03-10T17:15:25Z

Version 1.3.1, 1.3.3
HA backend: MySQL
Storage backend: s3
runtime env k8s
replicas: 3

When leadership is lost and re-election happens clients getting errors:

local node not active but active cluster node not found

Vault logs:

Stand-by replica(1) received client request and failed with

2020-03-10T17:01:55.901Z [INFO]  core: entering standby mode
2020-03-10T17:01:55.907Z [INFO]  core: vault is unsealed
2020-03-10T17:01:55.907Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
2020-03-10T17:06:09.206Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = transport is closing"
2020-03-10T17:06:09.206Z [ERROR] core: forward request error: error="error during forwarding RPC request"

Steps to reproduce:

This can be easily reproduced using vault operator step-down command.
Every time I run step-down command some clients are getting local node not active but active cluster node not found error.
another way to reproduce it is a Vault version upgrade, configuration change or anything that will cause vault restart and change leader. That makes using Vault in production environments with critical workload very dangerous especially in dynamic environments like k8s where pods can be rescheduled anytime.

That error seems to happen only on stand-by replicas that are forwarding requests to the leader.

Ex leader replica(2) does not have any errors:

2020-03-10T17:03:06.226Z [INFO]  identity: entities restored
2020-03-10T17:03:06.236Z [INFO]  identity: groups restored
2020-03-10T17:03:06.354Z [INFO]  core: post-unseal setup complete
2020-03-10T17:03:06.513Z [INFO]  expiration: lease restore complete
2020-03-10T17:06:08.705Z [WARN]  core: stepping down from active operation to standby
2020-03-10T17:06:08.705Z [INFO]  core: pre-seal teardown starting
2020-03-10T17:06:09.205Z [INFO]  rollback: stopping rollback manager
2020-03-10T17:06:09.206Z [INFO]  core: pre-seal teardown complete

New leader replica(3) logs are also clean...

possibly related to #8467

The text was updated successfully, but these errors were encountered:

riuvshyn · 2020-03-10T17:50:55Z

reproduced on 1.3.3

riuvshyn · 2020-04-15T08:40:17Z

reproduced on 1.4.0, any update on this?
This bug causes downtime for every Vault restart / upgrade.
I can share configuration to reproduce this if that could help...

So here I put configuration I use to reproduce this. https://github.com/riuvshyn/vault-ha-local
It basically runs docker-compose with 3 Vault replicas MysqlDB and Nginx frontend as LB
there is no automation on init & unseal so this is manual steps.
Once Vault is unsealed you can use script test.sh which basically jus runs in a loop VAULT_ADDR=http://localhost:8080 VAULT_TOKEN=${VAULT_TOKEN} vault policy read default command and while it is running just execute stepdown command like this: VAULT_ADDR=http://localhost:8080 VAULT_TOKEN=${VAULT_TOKEN} vault operator step-down
after executing the stepdown command test.sh script will stop executing with some error like:

Error reading policy named default: Error making API request.

URL: GET http://localhost:8080/v1/sys/policies/acl/default
Code: 500. Errors:

* local node not active but active cluster node not found

Sometimes the error is different but for most cases I am getting this one.

pratiklotia · 2020-07-15T15:32:27Z

+1
I'm facing the same issue on 1.4.0. Except getting it on raft integrated storage.

[root@vault1 ~]# vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.

URL: GET http://10.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:

* local node not active but active cluster node not found

Any update? Please let me know if I can provide any other details.

sig-abyreddy · 2020-07-22T11:48:31Z

+1

Same issue on 1.2.4

arharikris · 2020-07-29T23:30:57Z

Same issue on Vault v1.3.4 .. I am using etcd

todd-dsm · 2020-09-19T00:05:36Z

+1 Same issue on Vault v1.5.2

$ kubectl exec -ti vault-0 -- vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 400. Errors:

* missing client token
command terminated with exit code 2
tbook:infras-eks thomas$ vault status 
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.5.2
Cluster Name             vault-cluster-f4fe27ad
Cluster ID               foo
HA Enabled               true
HA Cluster               https://vault-0.vault-internal:8201
HA Mode                  active
Raft Committed Index     35
Raft Applied Index       35

The logs indicate that vault-1 and vault-2 have happily joined vault-0. Just can't get it to report with the command.

Throckmortra · 2021-03-25T16:35:23Z

seen on 1.4.7 during upgrade to 1.6.3.

8 minutes of errors

derekwilliamsliquidx · 2021-08-11T16:18:01Z

Experience this randomly during normal operation. 1.7.3 version
2021-08-11T15:58:12.354Z [WARN] core: leadership lost, stopping active operation

adityagu0910 · 2021-09-07T19:03:09Z

Experiencing same after etcd periodic defragmentation.
I am using - Version: Vault v1.2.3

othman-essabir · 2021-12-05T18:33:47Z

@todd-dsm

This is a normal behavior.
You need to export the vault token before => export VAULT_TOKEN="your token"
than run vault operator raft list-peers

PS: Vault Root Token should always be revoked after first setup in production environments

Checkout Vault production hardening. for more info

Avoid Root Tokens. Vault provides a root token when it is first initialized. This token should be used to setup the system initially, particularly setting up auth methods so that users may authenticate. We recommend treating Vault configuration as code, and using version control to manage policies. Once setup, the root token should be revoked to eliminate the risk of exposure. Root tokens can be generated when needed, and should be revoked as soon as possible.

Hope this helps

ncabatoff · 2022-01-06T19:14:03Z

There are many people piling on to this bug who aren't using mysql as their storage backend. I recommend they create new issues. The local node not active but active cluster node not found error is generic and doesn't indicate a bug, it just means what it says: the local node is unable to determine which node is active. The mechanism for determining the active node is very storage-backend-specific and there's no point in tracking issues people are having with this for different storage backends in the same github issue.

hsimon-hashicorp · 2024-03-21T22:27:00Z

Hello! Is this error still occurring in newer versions of Vault? Please let me know so I can bubble it up appropriately. Thanks!

catsby added bug Used to indicate a potential bug core/ha specific to high-availability labels Mar 10, 2020

ncabatoff added the storage/mysql label Jan 6, 2022

ncabatoff changed the title ~~Vault HA bug~~ Standby nodes lose track of active node with MySQL HA Jan 6, 2022

hsimon-hashicorp added the waiting-for-response label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standby nodes lose track of active node with MySQL HA #8522

Standby nodes lose track of active node with MySQL HA #8522

riuvshyn commented Mar 10, 2020 •

edited

riuvshyn commented Mar 10, 2020

riuvshyn commented Apr 15, 2020 •

edited

pratiklotia commented Jul 15, 2020 •

edited

sig-abyreddy commented Jul 22, 2020

arharikris commented Jul 29, 2020

todd-dsm commented Sep 19, 2020

Throckmortra commented Mar 25, 2021

derekwilliamsliquidx commented Aug 11, 2021

adityagu0910 commented Sep 7, 2021

othman-essabir commented Dec 5, 2021 •

edited

ncabatoff commented Jan 6, 2022

hsimon-hashicorp commented Mar 21, 2024

Standby nodes lose track of active node with MySQL HA #8522

Standby nodes lose track of active node with MySQL HA #8522

Comments

riuvshyn commented Mar 10, 2020 • edited

riuvshyn commented Mar 10, 2020

riuvshyn commented Apr 15, 2020 • edited

pratiklotia commented Jul 15, 2020 • edited

sig-abyreddy commented Jul 22, 2020

arharikris commented Jul 29, 2020

todd-dsm commented Sep 19, 2020

Throckmortra commented Mar 25, 2021

derekwilliamsliquidx commented Aug 11, 2021

adityagu0910 commented Sep 7, 2021

othman-essabir commented Dec 5, 2021 • edited

ncabatoff commented Jan 6, 2022

hsimon-hashicorp commented Mar 21, 2024

riuvshyn commented Mar 10, 2020 •

edited

riuvshyn commented Apr 15, 2020 •

edited

pratiklotia commented Jul 15, 2020 •

edited

othman-essabir commented Dec 5, 2021 •

edited