Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul: add preflight checks for Envoy bootstrap #23381

Merged
merged 5 commits into from
Jun 27, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Jun 18, 2024

Nomad creates Consul ACL tokens and service registrations to support Consul service mesh workloads, before bootstrapping the Envoy proxy. Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a tokens or services that have not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on consul-k8s where after creating the objects in Consul we try to read them from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust.

This changset adds preflight checks for the objects we create in Consul:

  • We add a preflight check for ACL tokens after we login via via Workload Identity and in the function we use to derive tokens in the legacy workflow. We do this check early because we also want to use this token for registering group services in the allocrunner hooks.
  • We add a preflight check for services right before we bootstrap Envoy in the taskrunner hook, so that we have time for our service client to batch updates to the local Consul agent in addition to the local agent sync.

We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing.

Fixes: #9307
Fixes: #10451
Fixes: #20516

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Ref: https://hashicorp.atlassian.net/browse/NET-9273
Follow-up: https://hashicorp.atlassian.net/browse/NET-10138


Notes for reviewers:

  • This is an unfortunately large PR as I had to touch a lot of test code and there are separate code paths for supporting the legacy workflow. To help make this a little more reviewable, I've split this into 2 commits with the code changes and their specific tests, a third commit with a test helper fix, and a final commit with docs and changelog.
  • I've got a follow-up described in NET-10138 to reduce the risk of this kind of bug in the future by expanding our Consul cluster in E2E to be a 3-node cluster.
  • The backport to 1.6.x+ent is not going to be clean, as all this code was heavily reworked for WI and multi-cluster support in 1.7.0. So I'm going to manually backport the relevant sections and open a new PR for that in the ENT repo.

@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 9ab9983 to 754c08f Compare June 18, 2024 18:44
@tgross tgross added this to the 1.8.x milestone Jun 18, 2024
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 754c08f to f40bb0d Compare June 18, 2024 19:59
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from f40bb0d to 2cc483c Compare June 20, 2024 16:15
@tgross tgross added backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line labels Jun 20, 2024
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 5dd70b9 to 7fa895b Compare June 20, 2024 18:50
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 7fa895b to 2245a4e Compare June 20, 2024 20:58
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 2245a4e to 7db6a39 Compare June 20, 2024 21:00
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 7db6a39 to c643203 Compare June 21, 2024 14:03
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from c643203 to 617a53d Compare June 21, 2024 14:15
@tgross tgross force-pushed the b-consul-token-read-self-on-login branch from 617a53d to de1d286 Compare June 21, 2024 14:25
@tgross tgross changed the title Consul: add preflight check for created ACL tokens Consul: add preflight checks for Envoy bootstrap Jun 21, 2024
@tgross tgross removed the backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent label Jun 21, 2024
@tgross tgross marked this pull request as ready for review June 21, 2024 14:43
@tgross tgross modified the milestones: 1.8.x, 1.8.2 Jun 21, 2024
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited to see envoy bootstrapping get more reliable! Thanks for following this through.

Comment on lines 96 to 102
// these node values will be evaluated at the time we create the hooks, so
// we don't need to worry about them changing out from under us
partition := node.Attributes["consul.partition"]
preflightCheckTimeout := durationFromMeta(
node, "consul.token_preflight_check.timeout", time.Second*10)
preflightCheckBaseInterval := durationFromMeta(
node, "consul.token_preflight_check.base", time.Millisecond*500)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not super familiar with this code, but it appears NewConsulClientFactory will be called from NewClient and therefore the timeout and interval node meta will be captured only during agent startup. Don't the durationFromMeta calls need to happen inside the closure and use client.Node() instead of a captured Node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. For partitions, we don't expect that to change because it's part of the fingerprint we get from the agent. But metadata can be changed via the node meta command so yeah we'll need to move this into the closure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! (And rebased on main) Also:

I'm not super familiar with this code
The various dependency injection tricks we're doing here to avoid circular references doesn't help. This code all got really janky during the big WI + Consul/Vault refresh. Once we remove the legacy workflow we should be able to refactor this in the process to be a little more comprehensible.

Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul service for service mesh workloads, and this service
needs to be present in Consul before we can bootstrap the Envoy proxy. Nomad
always talks to the local Consul agent and never directly to the Consul
servers. But the local Consul agent talks to the Consul servers in stale
consistency mode to reduce load on the servers. This can result in the Nomad
client making the Envoy bootstrap request for a service that has not yet
replicated to the follower that the local client is connected to. This request
gets a 404 on the service and that negative entry gets cached, preventing any
retries from succeeding.

To workaround this, we'll query the service from the local Consul agent before
attempting to bootstrap Envoy. This cannot completely eliminate this source of
error because it's possible that Consul cluster replication is unhealthy at the
time we need it, but this should make Envoy bootstrap significantly more robust.

We've added the timeouts to be configurable via node metadata rather than the
usual static configuration because for most cases, users should not need to
touch or even know these values are configurable; the configuration is mostly
available for testing.
When we removed the helper that wrapped `libtime`, we missed wiring up the
sleeper function so that we could track iterations in testing. But the
parameters for `testify`'s `Greater` assertion are backwards compared to its
other assertions so we accidentally reverse the expectation and value as
well. This caused a test for the iteration count to pass that should not have,
but the ultimate effect was harmless because the iterations aren't tracked
without the correct libtime setup anyways. Update the backoff to use the test
helper field and update it to the current `libtime` API.
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line theme/consul/connect Consul Connect integration theme/consul type/bug
Projects
None yet
3 participants