HA active node information is never updated when both node under VIP #17357

shaj13 · 2022-09-29T09:43:17Z

Describe the bug
HA active node information is never updated when both nodes configuration cluster and API address point to a virtual IP,
The core periodic leader refresh keeps refreshing the active node information but it never updates or caches the clusterLeaderParams due to the check https://github.com/hashicorp/vault/blob/main/vault/ha.go#L193,
so when a client sends an HTTP request to a standby node it returns local node not active but active cluster node not found
In v1.9.4 the cluster can update and cache clusterLeaderParams even if active and standby nodes have the same cluster and API address, with v1.11.X this is broken.

To Reproduce

Run two vault servers configured with the same cluster and API address
Run HTTP requests to vault API
error -> local node not active but active cluster node not found

Expected behavior
HA https://github.com/hashicorp/vault/blob/main/vault/ha.go#L92 Leader func update clusterLeaderParams even when active and standby have the same address for the first time.

Vault server configuration file(s):

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/app/vault.pem"
  tls_key_file  = "/app/vault-key.pem"
  tls_disable   = 0
}

api_addr     = "https://<vip>:8200"
cluster_addr = "https://<vip>:8201"

storage "postgresql" {
  path           = ""
  connection_url = "postgres://vault@data-store:5432/vault?sslmode=verify-ca&sslrootcert=/app/ca.pem&sslcert=/app/vault.pem&sslkey=/app/vault-key.pem"
  ha_enabled     = true
}

The text was updated successfully, but these errors were encountered:

shaj13 · 2022-09-29T09:47:13Z

https://github.com/hashicorp/vault/blob/main/vault/ha.go#L193 should look like

	// At the top of this function we return early when we're the active node.
	// If we're not the active node, and there's a stale advertisement pointing
	// to ourself, there's no point in paying any attention to it.  And by
	// disregarding it, we can avoid a panic in raft tests using the Inmem network
	// layer when we try to connect back to ourself.
	if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr && clusterLeaderParams != nil {
		return false, "", "", nil
	}

shaj13 · 2022-09-29T14:18:13Z

https://github.com/hashicorp/vault/blob/main/vault/ha.go#L193 should look like

	// At the top of this function we return early when we're the active node.
	// If we're not the active node, and there's a stale advertisement pointing
	// to ourself, there's no point in paying any attention to it.  And by
	// disregarding it, we can avoid a panic in raft tests using the Inmem network
	// layer when we try to connect back to ourself.
	if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr && clusterLeaderParams != nil {
		return false, "", "", nil
	}

@mpalmi if the above solution applicable, would be happy to raise an PR

ncabatoff · 2022-09-29T14:44:16Z

Hi @shaj13,

Can you explain more about your setup? How is it that both nodes in your cluster can share the same api and cluster addrs?

shaj13 · 2022-09-29T14:56:43Z

Hi @shaj13,

Can you explain more about your setup? How is it that both nodes in your cluster can share the same api and cluster addrs?

2 nodes (active and standby) running k3s and one of the pod is vault, alongside RDS PostgreSQL.

In front of the two nodes there a virtual IP,
So when they’re a failure the traffic redirected to the active node, the actual ip not shared with the client nor with the system.

However vault cluster and API address configured with VIP.

ncabatoff · 2022-09-29T15:11:44Z

I can see why you might want to set RedirectAddr the same on both nodes in that case. I don't think they should share the same ClusterAddr though. How do you avoid self-dialing in that case?

shaj13 · 2022-09-29T15:27:21Z

I can see why you might want to set RedirectAddr the same on both nodes in that case. I don't think they should share the same ClusterAddr though. How do you avoid self-dialing in that case?

@ncabatoff Good question, The state machine of vaults HA is either in standby or active.
When it’s standby the request forwarding forward request to active node, and actually the VIP point back to active node IP,
When a failover occurs the vault now promoted and request forwarding cleared and the VIP point back to the active node which it was standby, if the node starts dialing itself it clearly the state machine bug since the request should not be forwarded to itself.
and as I mentioned the VIP is always pointing to the active node, so when the node leader changed the VIP traffic also changed.

Anyway the VIP is wrapped with dns, therefore there will be dns lockup to get the actual VIP.

mpalmi added the core/ha specific to high-availability label Sep 29, 2022

shaj13 closed this as completed Sep 29, 2022

shaj13 reopened this Sep 29, 2022

shaj13 closed this as completed Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA active node information is never updated when both node under VIP #17357

HA active node information is never updated when both node under VIP #17357

shaj13 commented Sep 29, 2022

shaj13 commented Sep 29, 2022

shaj13 commented Sep 29, 2022

ncabatoff commented Sep 29, 2022

shaj13 commented Sep 29, 2022

ncabatoff commented Sep 29, 2022

shaj13 commented Sep 29, 2022 •

edited

Loading

HA active node information is never updated when both node under VIP #17357

HA active node information is never updated when both node under VIP #17357

Comments

shaj13 commented Sep 29, 2022

shaj13 commented Sep 29, 2022

shaj13 commented Sep 29, 2022

ncabatoff commented Sep 29, 2022

shaj13 commented Sep 29, 2022

ncabatoff commented Sep 29, 2022

shaj13 commented Sep 29, 2022 • edited Loading

shaj13 commented Sep 29, 2022 •

edited

Loading