Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of proxycfg: ensure that an irrecoverable error in proxycfg closes the xds session and triggers a replacement proxycfg watcher into release/1.15.x #16529

Conversation

hc-github-team-consul-core
Copy link
Collaborator

Backport

This PR is auto-generated from #16497 to be assessed for backporting due to the inclusion of the label backport/1.15.

The below text is copied from the body of the original PR.


Description

Receiving an "acl not found" error from an RPC in the agent cache and the streaming/event components will cause any request loops to cease under the assumption that they will never work again if the token was destroyed. This prevents log spam (#14144, #9738).

Unfortunately due to things like:

  • authz requests going to stale servers that may not have witnessed the token creation yet
  • authz requests in a secondary datacenter happening before the tokens get replicated to that datacenter
  • authz requests from a primary TO a secondary datacenter happening before the tokens get replicated to that datacenter

The caller will get an "acl not found" before the token exists, rather than just after. The machinery added above in the linked PRs will kick in and prevent the request loop from looping around again once the tokens actually exist.

For consul-dataplane usages, where xDS is served by the Consul servers rather than the clients ultimately this is not a problem because in that scenario the agent/proxycfg machinery is on-demand and launched by a new xDS stream needing data for a specific service in the catalog. If the watching goroutines are terminated it ripples down and terminates the xDS stream, which CDP will eventually re-establish and restart everything.

For Consul client usages, the agent/proxycfg machinery is ahead-of-time launched at service registration time (called "local" in some of the proxycfg machinery) so when the xDS stream comes in the data is already ready to go. If the watching goroutines terminate it should terminate the xDS stream, but there's no mechanism to re-spawn the watching goroutines. If the xDS stream reconnects it will see no ConfigSnapshot and will not get one again until the client agent is restarted, or the service is re-registered with something changed in it.

This PR fixes a few things in the machinery:

  • there was an inadvertent deadlock in fetching snapshot from the proxycfg machinery by xDS, such that when the watching goroutine terminated the snapshots would never be fetched. This caused some of the xDS machinery to get indefinitely paused and not finish the teardown properly.
  • Every 30s we now attempt to re-insert all locally registered services into the proxycfg machinery.
  • When services are re-inserted into the proxycfg machinery we special case "dead" ones such that we unilaterally replace them rather that doing that conditionally.

Overview of commits

@hc-github-team-consul-core hc-github-team-consul-core force-pushed the backport/proxycfg-dead-on-acl-not-found/locally-right-baboon branch from d0b0ad1 to d2e03c3 Compare March 3, 2023 20:28
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@github-actions github-actions bot added the theme/config Relating to Consul Agent configuration, including reloading label Mar 3, 2023
@hc-github-team-consul-core hc-github-team-consul-core merged commit 388fef8 into release/1.15.x Mar 3, 2023
@hc-github-team-consul-core hc-github-team-consul-core deleted the backport/proxycfg-dead-on-acl-not-found/locally-right-baboon branch March 3, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/config Relating to Consul Agent configuration, including reloading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants