Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul: add preflight checks for Envoy bootstrap #23381

Merged
merged 5 commits into from
Jun 27, 2024

Commits on Jun 26, 2024

  1. consul: add preflight check for created ACL tokens

    Nomad creates a Consul ACL token for each service for registering it in Consul
    or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
    talks to the local Consul agent and never directly to the Consul servers. But
    the local Consul agent talks to the Consul servers in stale consistency mode to
    reduce load on the servers. This can result in the Nomad client making the Envoy
    bootstrap request with a token that has not yet replicated to the follower that
    the local client is connected to. This request gets a 404 on the ACL token and
    that negative entry gets cached, preventing any retries from succeeding.
    
    To workaround this, we'll use a method described by our friends over on
    `consul-k8s` where after creating the service token we try to read the token
    from the local agent in stale consistency mode (which prevents a failed read
    from being cached). This cannot completely eliminate this source of error
    because it's possible that Consul cluster replication is unhealthy at the time
    we need it, but this should make Envoy bootstrap significantly more robust.
    
    In this changeset, we add the preflight check after we login via Workload
    Identity and in the function we use to derive tokens in the legacy
    workflow. We've added the timeouts to be configurable via node metadata rather
    than the usual static configuration because for most cases, users should not
    need to touch or even know these values are configurable; the configuration is
    mostly available for testing.
    
    Fixes: #9307
    Fixes: #20516
    Fixes: #10451
    
    Ref: hashicorp/consul-k8s#887
    Ref: https://hashicorp.atlassian.net/browse/NET-10051
    tgross committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    c2cf0e3 View commit details
    Browse the repository at this point in the history
  2. Consul: add preflight check for services before Envoy bootstrap

    Nomad creates a Consul service for service mesh workloads, and this service
    needs to be present in Consul before we can bootstrap the Envoy proxy. Nomad
    always talks to the local Consul agent and never directly to the Consul
    servers. But the local Consul agent talks to the Consul servers in stale
    consistency mode to reduce load on the servers. This can result in the Nomad
    client making the Envoy bootstrap request for a service that has not yet
    replicated to the follower that the local client is connected to. This request
    gets a 404 on the service and that negative entry gets cached, preventing any
    retries from succeeding.
    
    To workaround this, we'll query the service from the local Consul agent before
    attempting to bootstrap Envoy. This cannot completely eliminate this source of
    error because it's possible that Consul cluster replication is unhealthy at the
    time we need it, but this should make Envoy bootstrap significantly more robust.
    
    We've added the timeouts to be configurable via node metadata rather than the
    usual static configuration because for most cases, users should not need to
    touch or even know these values are configurable; the configuration is mostly
    available for testing.
    tgross committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    b8579c9 View commit details
    Browse the repository at this point in the history
  3. testing: fix backoff sleeper

    When we removed the helper that wrapped `libtime`, we missed wiring up the
    sleeper function so that we could track iterations in testing. But the
    parameters for `testify`'s `Greater` assertion are backwards compared to its
    other assertions so we accidentally reverse the expectation and value as
    well. This caused a test for the iteration count to pass that should not have,
    but the ultimate effect was harmless because the iterations aren't tracked
    without the correct libtime setup anyways. Update the backoff to use the test
    helper field and update it to the current `libtime` API.
    tgross committed Jun 26, 2024
    Configuration menu
    Copy the full SHA
    fd1d3a5 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    f038c96 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    9c85377 View commit details
    Browse the repository at this point in the history