New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standby nodes lose track of active node with MySQL HA #8522
Comments
reproduced on |
reproduced on So here I put configuration I use to reproduce this. https://github.com/riuvshyn/vault-ha-local
Sometimes the error is different but for most cases I am getting this one. |
+1
Any update? Please let me know if I can provide any other details. |
+1 Same issue on |
Same issue on Vault v1.3.4 .. I am using etcd |
+1 Same issue on Vault v1.5.2 $ kubectl exec -ti vault-0 -- vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.
URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 400. Errors:
* missing client token
command terminated with exit code 2
tbook:infras-eks thomas$ vault status
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.5.2
Cluster Name vault-cluster-f4fe27ad
Cluster ID foo
HA Enabled true
HA Cluster https://vault-0.vault-internal:8201
HA Mode active
Raft Committed Index 35
Raft Applied Index 35 The logs indicate that vault-1 and vault-2 have happily joined vault-0. Just can't get it to report with the command. |
seen on 1.4.7 during upgrade to 1.6.3. 8 minutes of errors |
Experience this randomly during normal operation. 1.7.3 version |
Experiencing same after etcd periodic defragmentation. |
This is a normal behavior. PS: Vault Root Token should always be revoked after first setup in production environments Checkout Vault production hardening. for more info
Hope this helps |
There are many people piling on to this bug who aren't using mysql as their storage backend. I recommend they create new issues. The |
Hello! Is this error still occurring in newer versions of Vault? Please let me know so I can bubble it up appropriately. Thanks! |
Version 1.3.1, 1.3.3
HA backend:
MySQL
Storage backend:
s3
runtime env
k8s
replicas:
3
When leadership is lost and re-election happens clients getting errors:
Vault logs:
Stand-by
replica(1)
received client request and failed withSteps to reproduce:
This can be easily reproduced using
vault operator step-down
command.Every time I run
step-down
command some clients are gettinglocal node not active but active cluster node not found
error.another way to reproduce it is a Vault version upgrade, configuration change or anything that will cause vault restart and change leader. That makes using Vault in production environments with critical workload very dangerous especially in dynamic environments like k8s where pods can be rescheduled anytime.
That error seems to happen only on stand-by replicas that are forwarding requests to the leader.
Ex leader
replica(2)
does not have any errors:New leader
replica(3)
logs are also clean...possibly related to #8467
The text was updated successfully, but these errors were encountered: