Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA active node information is never updated when both node under VIP #17357

Closed
shaj13 opened this issue Sep 29, 2022 · 6 comments
Closed

HA active node information is never updated when both node under VIP #17357

shaj13 opened this issue Sep 29, 2022 · 6 comments
Labels
core/ha specific to high-availability

Comments

@shaj13
Copy link
Contributor

shaj13 commented Sep 29, 2022

Describe the bug
HA active node information is never updated when both nodes configuration cluster and API address point to a virtual IP,
The core periodic leader refresh keeps refreshing the active node information but it never updates or caches the clusterLeaderParams due to the check https://github.com/hashicorp/vault/blob/main/vault/ha.go#L193,
so when a client sends an HTTP request to a standby node it returns local node not active but active cluster node not found
In v1.9.4 the cluster can update and cache clusterLeaderParams even if active and standby nodes have the same cluster and API address, with v1.11.X this is broken.

To Reproduce

  1. Run two vault servers configured with the same cluster and API address
  2. Run HTTP requests to vault API
  3. error -> local node not active but active cluster node not found

Expected behavior
HA https://github.com/hashicorp/vault/blob/main/vault/ha.go#L92 Leader func update clusterLeaderParams even when active and standby have the same address for the first time.

Vault server configuration file(s):

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/app/vault.pem"
  tls_key_file  = "/app/vault-key.pem"
  tls_disable   = 0
}

api_addr     = "https://<vip>:8200"
cluster_addr = "https://<vip>:8201"

storage "postgresql" {
  path           = ""
  connection_url = "postgres://vault@data-store:5432/vault?sslmode=verify-ca&sslrootcert=/app/ca.pem&sslcert=/app/vault.pem&sslkey=/app/vault-key.pem"
  ha_enabled     = true
}
@shaj13
Copy link
Contributor Author

shaj13 commented Sep 29, 2022

https://github.com/hashicorp/vault/blob/main/vault/ha.go#L193 should look like

	// At the top of this function we return early when we're the active node.
	// If we're not the active node, and there's a stale advertisement pointing
	// to ourself, there's no point in paying any attention to it.  And by
	// disregarding it, we can avoid a panic in raft tests using the Inmem network
	// layer when we try to connect back to ourself.
	if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr && clusterLeaderParams != nil {
		return false, "", "", nil
	}

@mpalmi mpalmi added the core/ha specific to high-availability label Sep 29, 2022
@shaj13
Copy link
Contributor Author

shaj13 commented Sep 29, 2022

https://github.com/hashicorp/vault/blob/main/vault/ha.go#L193 should look like

	// At the top of this function we return early when we're the active node.
	// If we're not the active node, and there's a stale advertisement pointing
	// to ourself, there's no point in paying any attention to it.  And by
	// disregarding it, we can avoid a panic in raft tests using the Inmem network
	// layer when we try to connect back to ourself.
	if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr && clusterLeaderParams != nil {
		return false, "", "", nil
	}

@mpalmi if the above solution applicable, would be happy to raise an PR

@shaj13 shaj13 closed this as completed Sep 29, 2022
@shaj13 shaj13 reopened this Sep 29, 2022
@ncabatoff
Copy link
Collaborator

Hi @shaj13,

Can you explain more about your setup? How is it that both nodes in your cluster can share the same api and cluster addrs?

@shaj13
Copy link
Contributor Author

shaj13 commented Sep 29, 2022

Hi @shaj13,

Can you explain more about your setup? How is it that both nodes in your cluster can share the same api and cluster addrs?

2 nodes (active and standby) running k3s and one of the pod is vault, alongside RDS PostgreSQL.

In front of the two nodes there a virtual IP,
So when they’re a failure the traffic redirected to the active node, the actual ip not shared with the client nor with the system.

However vault cluster and API address configured with VIP.

@ncabatoff
Copy link
Collaborator

I can see why you might want to set RedirectAddr the same on both nodes in that case. I don't think they should share the same ClusterAddr though. How do you avoid self-dialing in that case?

@shaj13
Copy link
Contributor Author

shaj13 commented Sep 29, 2022

I can see why you might want to set RedirectAddr the same on both nodes in that case. I don't think they should share the same ClusterAddr though. How do you avoid self-dialing in that case?

@ncabatoff Good question, The state machine of vaults HA is either in standby or active.
When it’s standby the request forwarding forward request to active node, and actually the VIP point back to active node IP,
When a failover occurs the vault now promoted and request forwarding cleared and the VIP point back to the active node which it was standby, if the node starts dialing itself it clearly the state machine bug since the request should not be forwarded to itself.
and as I mentioned the VIP is always pointing to the active node, so when the node leader changed the VIP traffic also changed.

Anyway the VIP is wrapped with dns, therefore there will be dns lockup to get the actual VIP.

@shaj13 shaj13 closed this as completed Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ha specific to high-availability
Projects
None yet
Development

No branches or pull requests

3 participants