Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standby nodes lose track of active node with MySQL HA #8522

Open
riuvshyn opened this issue Mar 10, 2020 · 12 comments
Open

Standby nodes lose track of active node with MySQL HA #8522

riuvshyn opened this issue Mar 10, 2020 · 12 comments
Labels
bug Used to indicate a potential bug core/ha specific to high-availability storage/mysql waiting-for-response

Comments

@riuvshyn
Copy link
Contributor

riuvshyn commented Mar 10, 2020

Version 1.3.1, 1.3.3
HA backend: MySQL
Storage backend: s3
runtime env k8s
replicas: 3

When leadership is lost and re-election happens clients getting errors:

local node not active but active cluster node not found

Vault logs:

Stand-by replica(1) received client request and failed with

2020-03-10T17:01:55.901Z [INFO]  core: entering standby mode
2020-03-10T17:01:55.907Z [INFO]  core: vault is unsealed
2020-03-10T17:01:55.907Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
2020-03-10T17:06:09.206Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = transport is closing"
2020-03-10T17:06:09.206Z [ERROR] core: forward request error: error="error during forwarding RPC request"

Steps to reproduce:

  1. This can be easily reproduced using vault operator step-down command.
    Every time I run step-down command some clients are getting local node not active but active cluster node not found error.

  2. another way to reproduce it is a Vault version upgrade, configuration change or anything that will cause vault restart and change leader. That makes using Vault in production environments with critical workload very dangerous especially in dynamic environments like k8s where pods can be rescheduled anytime.

That error seems to happen only on stand-by replicas that are forwarding requests to the leader.

Ex leader replica(2) does not have any errors:

2020-03-10T17:03:06.226Z [INFO]  identity: entities restored
2020-03-10T17:03:06.236Z [INFO]  identity: groups restored
2020-03-10T17:03:06.354Z [INFO]  core: post-unseal setup complete
2020-03-10T17:03:06.513Z [INFO]  expiration: lease restore complete
2020-03-10T17:06:08.705Z [WARN]  core: stepping down from active operation to standby
2020-03-10T17:06:08.705Z [INFO]  core: pre-seal teardown starting
2020-03-10T17:06:09.205Z [INFO]  rollback: stopping rollback manager
2020-03-10T17:06:09.206Z [INFO]  core: pre-seal teardown complete

New leader replica(3) logs are also clean...

possibly related to #8467

@catsby catsby added bug Used to indicate a potential bug core/ha specific to high-availability labels Mar 10, 2020
@riuvshyn
Copy link
Contributor Author

reproduced on 1.3.3

@riuvshyn
Copy link
Contributor Author

riuvshyn commented Apr 15, 2020

reproduced on 1.4.0, any update on this?
This bug causes downtime for every Vault restart / upgrade.
I can share configuration to reproduce this if that could help...

So here I put configuration I use to reproduce this. https://github.com/riuvshyn/vault-ha-local
It basically runs docker-compose with 3 Vault replicas MysqlDB and Nginx frontend as LB
there is no automation on init & unseal so this is manual steps.
Once Vault is unsealed you can use script test.sh which basically jus runs in a loop VAULT_ADDR=http://localhost:8080 VAULT_TOKEN=${VAULT_TOKEN} vault policy read default command and while it is running just execute stepdown command like this: VAULT_ADDR=http://localhost:8080 VAULT_TOKEN=${VAULT_TOKEN} vault operator step-down
after executing the stepdown command test.sh script will stop executing with some error like:

Error reading policy named default: Error making API request.

URL: GET http://localhost:8080/v1/sys/policies/acl/default
Code: 500. Errors:

* local node not active but active cluster node not found

Sometimes the error is different but for most cases I am getting this one.

@pratiklotia
Copy link

pratiklotia commented Jul 15, 2020

+1
I'm facing the same issue on 1.4.0. Except getting it on raft integrated storage.

[root@vault1 ~]# vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.

URL: GET http://10.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:

* local node not active but active cluster node not found

Any update? Please let me know if I can provide any other details.

@sig-abyreddy
Copy link

+1

Same issue on 1.2.4

@arharikris
Copy link

Same issue on Vault v1.3.4 .. I am using etcd

@todd-dsm
Copy link

+1 Same issue on Vault v1.5.2

$ kubectl exec -ti vault-0 -- vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 400. Errors:

* missing client token
command terminated with exit code 2
tbook:infras-eks thomas$ vault status 
Key                      Value
---                      -----
Recovery Seal Type       shamir
Initialized              true
Sealed                   false
Total Recovery Shares    1
Threshold                1
Version                  1.5.2
Cluster Name             vault-cluster-f4fe27ad
Cluster ID               foo
HA Enabled               true
HA Cluster               https://vault-0.vault-internal:8201
HA Mode                  active
Raft Committed Index     35
Raft Applied Index       35

The logs indicate that vault-1 and vault-2 have happily joined vault-0. Just can't get it to report with the command.

@Throckmortra
Copy link

seen on 1.4.7 during upgrade to 1.6.3.

8 minutes of errors

@derekwilliamsliquidx
Copy link

Experience this randomly during normal operation. 1.7.3 version
2021-08-11T15:58:12.354Z [WARN] core: leadership lost, stopping active operation

@adityagu0910
Copy link

Experiencing same after etcd periodic defragmentation.
I am using - Version: Vault v1.2.3

@othman-essabir
Copy link

othman-essabir commented Dec 5, 2021

@todd-dsm

This is a normal behavior.
You need to export the vault token before => export VAULT_TOKEN="your token"
than run vault operator raft list-peers

PS: Vault Root Token should always be revoked after first setup in production environments

Checkout Vault production hardening. for more info

Avoid Root Tokens. Vault provides a root token when it is first initialized. This token should be used to setup the system initially, particularly setting up auth methods so that users may authenticate. We recommend treating Vault configuration as code, and using version control to manage policies. Once setup, the root token should be revoked to eliminate the risk of exposure. Root tokens can be generated when needed, and should be revoked as soon as possible.

Hope this helps

@ncabatoff
Copy link
Collaborator

There are many people piling on to this bug who aren't using mysql as their storage backend. I recommend they create new issues. The local node not active but active cluster node not found error is generic and doesn't indicate a bug, it just means what it says: the local node is unable to determine which node is active. The mechanism for determining the active node is very storage-backend-specific and there's no point in tracking issues people are having with this for different storage backends in the same github issue.

@ncabatoff ncabatoff changed the title Vault HA bug Standby nodes lose track of active node with MySQL HA Jan 6, 2022
@hsimon-hashicorp
Copy link
Contributor

Hello! Is this error still occurring in newer versions of Vault? Please let me know so I can bubble it up appropriately. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/ha specific to high-availability storage/mysql waiting-for-response
Projects
None yet
Development

No branches or pull requests