Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

Open
davecheney opened this issue Jun 19, 2019 · 18 comments
Open

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

davecheney opened this issue Jun 19, 2019 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@davecheney
Copy link
Contributor

Envoy's stream protocol states that after each DiscoveryResponse message the subsequent DiscoveryRequest will contain a ACK or NACK for the previous response. Contour currently doesn't pay attention to this ACK/NACK protocol, but for better compliance with the XDS spec we should.

@davecheney davecheney added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jun 19, 2019
@davecheney davecheney added this to the 0.14.0 milestone Jun 19, 2019
davecheney added a commit to davecheney/contour that referenced this issue Jun 19, 2019
Updates projectcontour#499
Updates projectcontour#273
Updates projectcontour#1176

The XDS spec says that Envoy will always initiate a stream with a
discovery request, and expects the management server to respond with
only one discovery response. After that, Envoy will initiate another
discovery request containing an ACK or a NACK from the previous
response.

Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after
inspection of the current code it is evident that we're also not waiting
for Envoy to send the next discovery request.

This PR removes the inner `for {}` loop that would continue to reuse the
initial discovery request until the client disconnected. The previous
code was written in a time when we'd just implemented filtering and it
was possible for the filter to return no results, hence the inner loop
was--incorrectly--trying to loop until there was a result to return.

Huge thanks to @lrouquette who pointed this out.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Jun 19, 2019
Updates projectcontour#499
Updates projectcontour#273
Updates projectcontour#1176

The XDS spec says that Envoy will always initiate a stream with a
discovery request, and expects the management server to respond with
only one discovery response. After that, Envoy will initiate another
discovery request containing an ACK or a NACK from the previous
response.

Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after
inspection of the current code it is evident that we're also not waiting
for Envoy to send the next discovery request.

This PR removes the inner `for {}` loop that would continue to reuse the
initial discovery request until the client disconnected. The previous
code was written in a time when we'd just implemented filtering and it
was possible for the filter to return no results, hence the inner loop
was--incorrectly--trying to loop until there was a result to return.

Huge thanks to @lrouquette who pointed this out.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Jun 19, 2019
Updates projectcontour#499
Updates projectcontour#273
Updates projectcontour#1176

The XDS spec says that Envoy will always initiate a stream with a
discovery request, and expects the management server to respond with
only one discovery response. After that, Envoy will initiate another
discovery request containing an ACK or a NACK from the previous
response.

Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after
inspection of the current code it is evident that we're also not waiting
for Envoy to send the next discovery request.

This PR removes the inner `for {}` loop that would continue to reuse the
initial discovery request until the client disconnected. The previous
code was written in a time when we'd just implemented filtering and it
was possible for the filter to return no results, hence the inner loop
was--incorrectly--trying to loop until there was a result to return.

Huge thanks to @lrouquette who pointed this out.

Signed-off-by: Dave Cheney <dave@cheney.net>
@davecheney davecheney modified the milestones: 0.14.0, 0.15.0 Jul 19, 2019
@phylake
Copy link

phylake commented Aug 2, 2019

NACKs happen (at least) when config validation fails. Envoy erroneously throws exceptions in its config constructors (example) which will leak memory (Destructors aren't run if exceptions are thrown in the ctor body. See Item 10 in More Effective C++ for more detail) for anything allocated in the constructor body. In the worst case Contour continues to send the same resources and gets NACKs back while Envoy leaks memory until it's oom-killed. If memory leaks and how much will be dependent on which object in the deeply nested config classes throws an exception and where.

Contour needs to mitigate this for xDS with resource hints by not sending the resource anymore if its NACKed. I don't know what mitigation it could do for LDS/CDS other than keep the previous set of resources around so they can be sent if the most recent version is NACKed.

davecheney added a commit to davecheney/contour that referenced this issue Aug 19, 2019
Updates projectcontour#1176

The xDS spec says that if Envoy NACKs an update we shouldn't try to send
it again. Given that the update came from k8s and was translated by Contour
either theres a bug in our translation or a mistake in the data that a user
entered into K8s.

For 0.15 log the NACK at ERROR level in the hope that this will give us
some visibility into what kind of invalid configuraion we're
transmitting to Envoy and the frequency of those mistakes.

Ideally we'd be able to trace a NACK all the way back to the originating
record and cordon it. However, given that records like RDS and LDS are
gestalts cordoning the invalid part of the configuration is rather
complicated. Perhaps the better solution would be to strengthen our
validation to prevent these errors reaching Envoy.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit to davecheney/contour that referenced this issue Aug 19, 2019
Updates projectcontour#1176

The xDS spec says that if Envoy NACKs an update we shouldn't try to send
it again. Given that the update came from k8s and was translated by Contour
either theres a bug in our translation or a mistake in the data that a user
entered into K8s.

For 0.15 log the NACK at ERROR level in the hope that this will give us
some visibility into what kind of invalid configuraion we're
transmitting to Envoy and the frequency of those mistakes.

Ideally we'd be able to trace a NACK all the way back to the originating
record and cordon it. However, given that records like RDS and LDS are
gestalts cordoning the invalid part of the configuration is rather
complicated. Perhaps the better solution would be to strengthen our
validation to prevent these errors reaching Envoy.

Signed-off-by: Dave Cheney <dave@cheney.net>
davecheney added a commit that referenced this issue Aug 19, 2019
Updates #1176

The xDS spec says that if Envoy NACKs an update we shouldn't try to send
it again. Given that the update came from k8s and was translated by Contour
either theres a bug in our translation or a mistake in the data that a user
entered into K8s.

For 0.15 log the NACK at ERROR level in the hope that this will give us
some visibility into what kind of invalid configuraion we're
transmitting to Envoy and the frequency of those mistakes.

Ideally we'd be able to trace a NACK all the way back to the originating
record and cordon it. However, given that records like RDS and LDS are
gestalts cordoning the invalid part of the configuration is rather
complicated. Perhaps the better solution would be to strengthen our
validation to prevent these errors reaching Envoy.

Signed-off-by: Dave Cheney <dave@cheney.net>
@davecheney
Copy link
Contributor Author

For 0.15 I've landed #1350 and I plan to leave this issue here for the moment.

Sending invalid updates to Envoy is a problem -- we shouldn't do that, but I don't know what to do with the signal once we did send an invalid update. Logging is about the least worst thing we can do. Trying to cordon the offending update is complicated, it depends on which table it came through. Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field which updates are causing the problem and we'll know if its something we can systematically filter out on the Contour side or if we need to build a framework to trace NACK'd responses from Envoy back to their original k8s object.

@davecheney davecheney removed this from the 0.15.0 milestone Aug 19, 2019
@davecheney davecheney added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 19, 2019
@davecheney davecheney added this to the Backlog milestone Aug 30, 2019
@jpeach jpeach added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jun 9, 2020
@jpeach
Copy link
Contributor

jpeach commented Dec 8, 2020

Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field

Have we gotten any signal on this from users?

Generally, we have opted not to allow passing config straight through to Envoy due to the risk of generating errors. This has complicated development and prevented advanced users from taking advantage of some Envoy features directly.

Another option in addition to logging would be to increment a metric to make this issue more visible to operators

Crashing doesn't help clear the underlying condition that causes the NACK. Contour would just reschedule and then send the bad config again. The desirable behavior is to fail safe. In the context of a configuration system, what we would want to do is keep the last good configuration, then bubble up the error associated with the failure to apply the new configuration. The difficulty here is knowing what the last configuration was (or should be), and how to propagate the error (since there's not always a straightforward relationship between Envoy and Kubernetes resources).

@youngnick
Copy link
Member

In order to implement what James is talking about, we will need to change the way we build the DAG and send it to Envoy, so that instead of building a DAG, and sending that to Envoy, we build a DAG, diff it against the last DAG, and send either the changes (which would allow us to do incremental xDS) or the whole config again (what we do now). In the case that the change is NACKed, then we would have to spray errors everywhere (in every object managed by Contour, in the logs, and a metric as well), to make it clear that nothing is getting updated until the issue is resolved.

In the meantime, we absolutely should log and increment a metric, to at least allow it to be possible to detect that something has happened. I don't think we can risk doing anything that may cause a NACK though until we have the above solution or something that does the same thing.

@youngnick
Copy link
Member

youngnick commented Mar 8, 2021

Okay, I'm having a look into this, and I'm going to start by checking that it's still a problem. To do this, I'm going to do something like this:

  • Remove the case-insensitive comparison for hostnames for HTTPProxy (revert Do a case insensitive match when checking for duplicate FQDNs #3231)
  • Make two HTTPProxies, one with UPPERCASE, one with lowercase.
  • Send to Envoy
  • Check we get a NACK (should be logged)
  • Get a config dump
  • Make a change
  • Get a new config dump
  • Fix the problem
  • Get a new config dump

@youngnick
Copy link
Member

Okay, I ran the above process with both the "contour" and "envoy" xDS engines, and got the following results:

  • For "contour" xDS server, this causes a single error line for each Kubernetes change. Config still takes effect.
  • For "envoy" xDS server, this causes a continuous cycle of rejecting updates. Config still takes effect. This is definitely an operational problem, as gRPC updates are very frequent. Other updates will go through, but the logs for both Envoy and Contour are unusable due to spam.

In both cases, I did not need to restart Envoy to have it pick up new changes.

So, this means that this process is a lot more forgiving than I thought. Which is great! But I think there's still more to investigate here.

@xaleeks xaleeks moved this from Unprioritized to 1.15 release in Contour Project Board Apr 6, 2021
@sunjayBhatia sunjayBhatia moved this from 1.15 release (WIP) to 1.16 release in Contour Project Board Apr 28, 2021
@youngnick youngnick removed their assignment May 10, 2021
@youngnick youngnick moved this from 1.16 release to Parking Lot 1 in Contour Project Board May 25, 2021
@youngnick
Copy link
Member

This one's waiting on @wrowe to have some bandwidth. Moving to parking lot 1.

@izturn
Copy link
Member

izturn commented Jan 3, 2023

#4276 says if we will support it, we must solve this one first, but after read the all issues about this one, I had a question as follows:

  1. With config: implement discovery request rate limiting envoyproxy/envoy#4787, envoy add a new config, its slow down the "while true" likes loop and lead to OOM no more.
  2. @youngnick's comments in the issues such as:

the behavior is that no updates will be accepted until the incorrect config is removed, at which time Envoy will recover

so when the config is turn valid, the operation of restart contour is needed no more

After read the all above links and go through the contour's source code, I don't know what's the real problem about NACK at now? It seems that what we need to do is just let the user know there is a misconfigure?

@sunjayBhatia
Copy link
Member

We have stalled a bit on working on the ACK/NACK problem because as you say, we've found that we pretty much validate all user input anyways so we don't send invalid config to Envoy so it is very difficult to generate a NACK at all at the moment

An ACK is maybe more interesting to figure out, it could allow us to set some status on HTTPProxy/Route etc. that tells users when their config is actually in effect. However figuring out when all the nodes have ACKed configuration and making sure we can actually make sure updates do not cross each other and are processed sequentially is not necessarily super straightforward.

@izturn
Copy link
Member

izturn commented Jan 4, 2023

@sunjayBhatia sorry, I don't quite understand what you mean. Please explain it in more detail and frankly, thx

@sunjayBhatia
Copy link
Member

when we send a configuration update to a fleet of Envoys, they may return an ACK or NACK to signify if the update was accepted

Because we heavily validate user input, the configuration we generate in practice is never NACKed at the moment, so we have held off making much change here, it would be adding complication for not much benefit. Figuring out if a change was ACKed would be useful so we can signify to the user when exactly their configuration has taken effect (useful for something like knative etc.).

However, in the future, if we support WASM modules, Lua scripting, or something else that we cannot fully validate up-front and that Envoy could return a NACK for, we will need a way to signify to users that their resources are invalid. We've again held off on adding these sorts of features and "solving the NACK problem" this because of the below:

Figuring out if a specific change was ACK or NACKed is not actually super straightforward because of how we generate snapshots of xDS config to send to Envoy, changes in k8s resources basically get stacked on top of each other and sent asynchronously, e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources. Getting this to work is a larger piece to modify how we process k8s resource updates and send them along to Envoy, likely introducing some sort of blocking or a complex accounting system which is a change we have held off on making so far.

@izturn
Copy link
Member

izturn commented Jan 5, 2023

e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources.

I don't quite understand what you mean, accord to ACK and NACK semantics summary

The response_nonce field tells the server which of its responses the ACK or NACK is associated with.

The response_nonce can be used to distinguish which snapshot (1 OR 2) is wrong.

Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it

@sunjayBhatia
Copy link
Member

I don't quite understand what you mean, accord to ACK and NACK semantics summary

This doesn't have anything in particular to do with the xDS protocol, but rather how Contour generates the snapshots. We take into account the current state of all the k8s resources in the cluster.

In the example above:

  • snapshot-1 will be NACked if it contains invalid configuration from HTTPProxy A
  • snapshot-2 has invalid configuration from A which will cause it to be NACKed but also could have invalid configuration from B that cause it to be NACKed, since the snapshot of resources we generate configuration from is a snapshot of all the resources in their current state, it isn't easy to tell if the NACK was because of A or B
  • if A is in fact fixed to generate snapshot-3, but B is still invalid, we will get a NACK, but may incorrectly think it was because the latest change to A was invalid, when in fact B is
  • the above gets especially complicated when changes to these k8s resources end up modifying the same xDS resource (e.g. the same Listener)

The response_nonce can be used to distinguish which snapshot (1 OR 2) is wrong.

How we generate snapshots today, the changes from HTTPProxy A and B will both be applied to the DAG to generate snapshot-2 so NACKs will be for the aggregate of the changes, not an individual one

Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it

This would involve caching the last good state of the DAG so we can apply the k8s resource changes on top of it, rather than the xDS config snapshot

In general to isolate what k8s resource change causes a NACK will involve some significant addition of accounting logic, caching of the DAG etc., and most likely blocking on rebuilding the DAG and sending resource updates until we get ACKs/NACKs, which we do not do today

A separate issue is what we should do on startup, if there are invalid resources in the cluster when Contour restarts, we have no prior valid configuration to fall back to so may end up incorrectly dropping traffic when we shouldn't

@izturn
Copy link
Member

izturn commented Feb 1, 2023

I/We'd like to implement/design it, but I think we need to make a community meeting to discuss the whole matter at first.
@skriss @sunjayBhatia WDYT?

@sunjayBhatia
Copy link
Member

We're planning on starting up community meetings again soon, with an ad-hoc meeting coming next week or the week after. We will announce it on slack and the community mailing list: https://groups.google.com/g/project-contour

@izturn
Copy link
Member

izturn commented Feb 2, 2023

Glad to hear that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
No open projects
Development

No branches or pull requests

8 participants