internal/grpc: properly handle Envoy ACK/NACK protocol #1176

davecheney · 2019-06-19T10:22:55Z

Envoy's stream protocol states that after each DiscoveryResponse message the subsequent DiscoveryRequest will contain a ACK or NACK for the previous response. Contour currently doesn't pay attention to this ACK/NACK protocol, but for better compliance with the XDS spec we should.

@lrouquette

Updates projectcontour#499 Updates projectcontour#273 Updates projectcontour#1176 The XDS spec says that Envoy will always initiate a stream with a discovery request, and expects the management server to respond with only one discovery response. After that, Envoy will initiate another discovery request containing an ACK or a NACK from the previous response. Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after inspection of the current code it is evident that we're also not waiting for Envoy to send the next discovery request. This PR removes the inner `for {}` loop that would continue to reuse the initial discovery request until the client disconnected. The previous code was written in a time when we'd just implemented filtering and it was possible for the filter to return no results, hence the inner loop was--incorrectly--trying to loop until there was a result to return. Huge thanks to @lrouquette who pointed this out. Signed-off-by: Dave Cheney <dave@cheney.net>

@lrouquette

Updates projectcontour#499 Updates projectcontour#273 Updates projectcontour#1176 The XDS spec says that Envoy will always initiate a stream with a discovery request, and expects the management server to respond with only one discovery response. After that, Envoy will initiate another discovery request containing an ACK or a NACK from the previous response. Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after inspection of the current code it is evident that we're also not waiting for Envoy to send the next discovery request. This PR removes the inner `for {}` loop that would continue to reuse the initial discovery request until the client disconnected. The previous code was written in a time when we'd just implemented filtering and it was possible for the filter to return no results, hence the inner loop was--incorrectly--trying to loop until there was a result to return. Huge thanks to @lrouquette who pointed this out. Signed-off-by: Dave Cheney <dave@cheney.net>

@lrouquette

Updates projectcontour#499 Updates projectcontour#273 Updates projectcontour#1176 The XDS spec says that Envoy will always initiate a stream with a discovery request, and expects the management server to respond with only one discovery response. After that, Envoy will initiate another discovery request containing an ACK or a NACK from the previous response. Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after inspection of the current code it is evident that we're also not waiting for Envoy to send the next discovery request. This PR removes the inner `for {}` loop that would continue to reuse the initial discovery request until the client disconnected. The previous code was written in a time when we'd just implemented filtering and it was possible for the filter to return no results, hence the inner loop was--incorrectly--trying to loop until there was a result to return. Huge thanks to @lrouquette who pointed this out. Signed-off-by: Dave Cheney <dave@cheney.net>

phylake · 2019-08-02T18:56:38Z

NACKs happen (at least) when config validation fails. Envoy erroneously throws exceptions in its config constructors (example) which will leak memory (Destructors aren't run if exceptions are thrown in the ctor body. See Item 10 in More Effective C++ for more detail) for anything allocated in the constructor body. In the worst case Contour continues to send the same resources and gets NACKs back while Envoy leaks memory until it's oom-killed. If memory leaks and how much will be dependent on which object in the deeply nested config classes throws an exception and where.

Contour needs to mitigate this for xDS with resource hints by not sending the resource anymore if its NACKed. I don't know what mitigation it could do for LDS/CDS other than keep the previous set of resources around so they can be sent if the most recent version is NACKed.

Updates projectcontour#1176 The xDS spec says that if Envoy NACKs an update we shouldn't try to send it again. Given that the update came from k8s and was translated by Contour either theres a bug in our translation or a mistake in the data that a user entered into K8s. For 0.15 log the NACK at ERROR level in the hope that this will give us some visibility into what kind of invalid configuraion we're transmitting to Envoy and the frequency of those mistakes. Ideally we'd be able to trace a NACK all the way back to the originating record and cordon it. However, given that records like RDS and LDS are gestalts cordoning the invalid part of the configuration is rather complicated. Perhaps the better solution would be to strengthen our validation to prevent these errors reaching Envoy. Signed-off-by: Dave Cheney <dave@cheney.net>

Updates #1176 The xDS spec says that if Envoy NACKs an update we shouldn't try to send it again. Given that the update came from k8s and was translated by Contour either theres a bug in our translation or a mistake in the data that a user entered into K8s. For 0.15 log the NACK at ERROR level in the hope that this will give us some visibility into what kind of invalid configuraion we're transmitting to Envoy and the frequency of those mistakes. Ideally we'd be able to trace a NACK all the way back to the originating record and cordon it. However, given that records like RDS and LDS are gestalts cordoning the invalid part of the configuration is rather complicated. Perhaps the better solution would be to strengthen our validation to prevent these errors reaching Envoy. Signed-off-by: Dave Cheney <dave@cheney.net>

davecheney · 2019-08-19T01:59:00Z

For 0.15 I've landed #1350 and I plan to leave this issue here for the moment.

Sending invalid updates to Envoy is a problem -- we shouldn't do that, but I don't know what to do with the signal once we did send an invalid update. Logging is about the least worst thing we can do. Trying to cordon the offending update is complicated, it depends on which table it came through. Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field which updates are causing the problem and we'll know if its something we can systematically filter out on the Contour side or if we need to build a framework to trace NACK'd responses from Envoy back to their original k8s object.

jpeach · 2020-12-08T05:52:11Z

Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field

Have we gotten any signal on this from users?

Generally, we have opted not to allow passing config straight through to Envoy due to the risk of generating errors. This has complicated development and prevented advanced users from taking advantage of some Envoy features directly.

Another option in addition to logging would be to increment a metric to make this issue more visible to operators

Crashing doesn't help clear the underlying condition that causes the NACK. Contour would just reschedule and then send the bad config again. The desirable behavior is to fail safe. In the context of a configuration system, what we would want to do is keep the last good configuration, then bubble up the error associated with the failure to apply the new configuration. The difficulty here is knowing what the last configuration was (or should be), and how to propagate the error (since there's not always a straightforward relationship between Envoy and Kubernetes resources).

youngnick · 2020-12-14T22:44:28Z

In order to implement what James is talking about, we will need to change the way we build the DAG and send it to Envoy, so that instead of building a DAG, and sending that to Envoy, we build a DAG, diff it against the last DAG, and send either the changes (which would allow us to do incremental xDS) or the whole config again (what we do now). In the case that the change is NACKed, then we would have to spray errors everywhere (in every object managed by Contour, in the logs, and a metric as well), to make it clear that nothing is getting updated until the issue is resolved.

In the meantime, we absolutely should log and increment a metric, to at least allow it to be possible to detect that something has happened. I don't think we can risk doing anything that may cause a NACK though until we have the above solution or something that does the same thing.

youngnick · 2021-03-08T02:16:59Z

Okay, I'm having a look into this, and I'm going to start by checking that it's still a problem. To do this, I'm going to do something like this:

Remove the case-insensitive comparison for hostnames for HTTPProxy (revert Do a case insensitive match when checking for duplicate FQDNs #3231)
Make two HTTPProxies, one with UPPERCASE, one with lowercase.
Send to Envoy
Check we get a NACK (should be logged)
Get a config dump
Make a change
Get a new config dump
Fix the problem
Get a new config dump

youngnick · 2021-03-08T04:27:08Z

Okay, I ran the above process with both the "contour" and "envoy" xDS engines, and got the following results:

For "contour" xDS server, this causes a single error line for each Kubernetes change. Config still takes effect.
For "envoy" xDS server, this causes a continuous cycle of rejecting updates. Config still takes effect. This is definitely an operational problem, as gRPC updates are very frequent. Other updates will go through, but the logs for both Envoy and Contour are unusable due to spam.

In both cases, I did not need to restart Envoy to have it pick up new changes.

So, this means that this process is a lot more forgiving than I thought. Which is great! But I think there's still more to investigate here.

youngnick · 2021-05-25T23:56:41Z

This one's waiting on @wrowe to have some bandwidth. Moving to parking lot 1.

izturn · 2023-01-03T09:48:05Z

#4276 says if we will support it, we must solve this one first, but after read the all issues about this one, I had a question as follows:

With config: implement discovery request rate limiting envoyproxy/envoy#4787, envoy add a new config, its slow down the "while true" likes loop and lead to OOM no more.
@youngnick's comments in the issues such as:

the behavior is that no updates will be accepted until the incorrect config is removed, at which time Envoy will recover

so when the config is turn valid, the operation of restart contour is needed no more

After read the all above links and go through the contour's source code, I don't know what's the real problem about NACK at now? It seems that what we need to do is just let the user know there is a misconfigure?

sunjayBhatia · 2023-01-04T00:17:50Z

We have stalled a bit on working on the ACK/NACK problem because as you say, we've found that we pretty much validate all user input anyways so we don't send invalid config to Envoy so it is very difficult to generate a NACK at all at the moment

An ACK is maybe more interesting to figure out, it could allow us to set some status on HTTPProxy/Route etc. that tells users when their config is actually in effect. However figuring out when all the nodes have ACKed configuration and making sure we can actually make sure updates do not cross each other and are processed sequentially is not necessarily super straightforward.

izturn · 2023-01-04T02:14:36Z

@sunjayBhatia sorry, I don't quite understand what you mean. Please explain it in more detail and frankly, thx

sunjayBhatia · 2023-01-04T19:44:08Z

when we send a configuration update to a fleet of Envoys, they may return an ACK or NACK to signify if the update was accepted

Because we heavily validate user input, the configuration we generate in practice is never NACKed at the moment, so we have held off making much change here, it would be adding complication for not much benefit. Figuring out if a change was ACKed would be useful so we can signify to the user when exactly their configuration has taken effect (useful for something like knative etc.).

However, in the future, if we support WASM modules, Lua scripting, or something else that we cannot fully validate up-front and that Envoy could return a NACK for, we will need a way to signify to users that their resources are invalid. We've again held off on adding these sorts of features and "solving the NACK problem" this because of the below:

Figuring out if a specific change was ACK or NACKed is not actually super straightforward because of how we generate snapshots of xDS config to send to Envoy, changes in k8s resources basically get stacked on top of each other and sent asynchronously, e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources. Getting this to work is a larger piece to modify how we process k8s resource updates and send them along to Envoy, likely introducing some sort of blocking or a complex accounting system which is a change we have held off on making so far.

izturn · 2023-01-05T10:48:43Z

e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources.

I don't quite understand what you mean, accord to ACK and NACK semantics summary

The response_nonce field tells the server which of its responses the ACK or NACK is associated with.

The response_nonce can be used to distinguish which snapshot (1 OR 2) is wrong.

Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it

sunjayBhatia · 2023-01-05T17:51:13Z

I don't quite understand what you mean, accord to ACK and NACK semantics summary

This doesn't have anything in particular to do with the xDS protocol, but rather how Contour generates the snapshots. We take into account the current state of all the k8s resources in the cluster.

In the example above:

snapshot-1 will be NACked if it contains invalid configuration from HTTPProxy A
snapshot-2 has invalid configuration from A which will cause it to be NACKed but also could have invalid configuration from B that cause it to be NACKed, since the snapshot of resources we generate configuration from is a snapshot of all the resources in their current state, it isn't easy to tell if the NACK was because of A or B
if A is in fact fixed to generate snapshot-3, but B is still invalid, we will get a NACK, but may incorrectly think it was because the latest change to A was invalid, when in fact B is
the above gets especially complicated when changes to these k8s resources end up modifying the same xDS resource (e.g. the same Listener)

The response_nonce can be used to distinguish which snapshot (1 OR 2) is wrong.

How we generate snapshots today, the changes from HTTPProxy A and B will both be applied to the DAG to generate snapshot-2 so NACKs will be for the aggregate of the changes, not an individual one

Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it

This would involve caching the last good state of the DAG so we can apply the k8s resource changes on top of it, rather than the xDS config snapshot

In general to isolate what k8s resource change causes a NACK will involve some significant addition of accounting logic, caching of the DAG etc., and most likely blocking on rebuilding the DAG and sending resource updates until we get ACKs/NACKs, which we do not do today

A separate issue is what we should do on startup, if there are invalid resources in the cluster when Contour restarts, we have no prior valid configuration to fall back to so may end up incorrectly dropping traffic when we shouldn't

izturn · 2023-02-01T07:21:37Z

I/We'd like to implement/design it, but I think we need to make a community meeting to discuss the whole matter at first.
@skriss @sunjayBhatia WDYT?

sunjayBhatia · 2023-02-01T19:32:33Z

We're planning on starting up community meetings again soon, with an ad-hoc meeting coming next week or the week after. We will announce it on slack and the community mailing list: https://groups.google.com/g/project-contour

izturn · 2023-02-02T01:54:45Z

Glad to hear that

davecheney added this to the 0.14.0 milestone Jun 19, 2019

davecheney mentioned this issue Jun 19, 2019

internal/grpc: handle one response per request #1177

Merged

stevesloka mentioned this issue Jul 18, 2019

xDS stale nonce is ignored #1187

Open

davecheney modified the milestones: 0.14.0, 0.15.0 Jul 19, 2019

davecheney mentioned this issue Aug 19, 2019

internal/grpc: log xDS NACK messages at ERROR level #1350

Merged

davecheney removed this from the 0.15.0 milestone Aug 19, 2019

davecheney added this to the Backlog milestone Aug 30, 2019

jpeach mentioned this issue Dec 12, 2019

internal/envoy: Regex "MaxProgramSize" too small #2043

Closed

bgagnon mentioned this issue Jan 15, 2020

Status reflecting the state of Envoy programming #2021

Open

michmike added this to To do in Contour Project Board Mar 24, 2020

This was referenced Jun 5, 2020

internal/dag: remove Envoy header name quoting #2569

Closed

Add support to set Header value from another header #2516

Closed

jpeach added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jun 9, 2020

jpeach mentioned this issue Jun 25, 2020

Design Proposal: Control the conditions of a RetryPolicy #2626

Merged

This was referenced Dec 17, 2020

Make the TLS cipher suite configurable #2880

Closed

RouteConfiguration rejected due to duplicate entry #3230

Closed

erwbgy mentioned this issue Dec 30, 2020

Support specific dynamic request headers #3234

Closed

youngnick self-assigned this Mar 1, 2021

xaleeks moved this from Unprioritized to 1.15 release in Contour Project Board Apr 6, 2021

sunjayBhatia moved this from 1.15 release (WIP) to 1.16 release in Contour Project Board Apr 28, 2021

youngnick removed their assignment May 10, 2021

youngnick moved this from 1.16 release to Parking Lot 1 in Contour Project Board May 25, 2021

youngnick assigned wrowe May 25, 2021

youngnick mentioned this issue May 31, 2021

Add option to remove query strings from access log #3576

Closed

sunjayBhatia mentioned this issue Jun 8, 2021

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Closed

youngnick mentioned this issue Jun 15, 2021

design: add tracing design doc #3802

Closed

youngnick mentioned this issue Nov 24, 2021

single ingress path regex error breaks all other ingress objects processing #4191

Closed

youngnick mentioned this issue Jan 17, 2022

Support for regex rewrite in contour #4241

Closed

youngnick mentioned this issue Jan 28, 2022

Exploration: WASM in Contour #4276

Closed

5 tasks

skriss removed this from the Backlog milestone Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

davecheney commented Jun 19, 2019

phylake commented Aug 2, 2019

davecheney commented Aug 19, 2019

jpeach commented Dec 8, 2020

youngnick commented Dec 14, 2020

youngnick commented Mar 8, 2021 •

edited

youngnick commented Mar 8, 2021

youngnick commented May 25, 2021

izturn commented Jan 3, 2023 •

edited

sunjayBhatia commented Jan 4, 2023

izturn commented Jan 4, 2023

sunjayBhatia commented Jan 4, 2023

izturn commented Jan 5, 2023

sunjayBhatia commented Jan 5, 2023

izturn commented Feb 1, 2023

sunjayBhatia commented Feb 1, 2023

izturn commented Feb 2, 2023

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

internal/grpc: properly handle Envoy ACK/NACK protocol #1176

Comments

davecheney commented Jun 19, 2019

phylake commented Aug 2, 2019

davecheney commented Aug 19, 2019

jpeach commented Dec 8, 2020

youngnick commented Dec 14, 2020

youngnick commented Mar 8, 2021 • edited

youngnick commented Mar 8, 2021

youngnick commented May 25, 2021

izturn commented Jan 3, 2023 • edited

sunjayBhatia commented Jan 4, 2023

izturn commented Jan 4, 2023

sunjayBhatia commented Jan 4, 2023

izturn commented Jan 5, 2023

sunjayBhatia commented Jan 5, 2023

izturn commented Feb 1, 2023

sunjayBhatia commented Feb 1, 2023

izturn commented Feb 2, 2023

youngnick commented Mar 8, 2021 •

edited

izturn commented Jan 3, 2023 •

edited