Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: ADS stream failure triggers wildcard subscriptions on new stream #7013

Closed
atollena opened this issue Feb 29, 2024 · 11 comments · Fixed by #7026
Closed

xds: ADS stream failure triggers wildcard subscriptions on new stream #7013

atollena opened this issue Feb 29, 2024 · 11 comments · Fixed by #7026
Assignees

Comments

@atollena
Copy link
Collaborator

What version of gRPC are you using?

master branch/1.61.1

What version of Go are you using (go version)?

1.22.0

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

  1. Create a channel with an xds:///dest target.
  2. Make sure it is properly initialized (send some successful requests onto it). As a result, the gRPC client has an open ADS stream with a set of subscriptions for the API listener corresponding to the target.
  3. Shut this channel down via Close(), or make it enter IDLE state.
  4. At this point the gRPC client has an open ADS stream with a set of empty subscriptions for all resource types (e.g. LDS, CDS, EDS) to the xDS server. resource_names is empty in the last DiscoveryRequests. This ADS stream is still open. Not that the server will send no response, so it does not have a chance to send a new version to client. This empty subscription is not treated as wildcard by the management server because there was at least one explicit subscriptions on the stream.
  5. The management server closes the stream, e.g. for rebalance. According to A57 it should simply reconnect. Note that grpc-go issues two warning in this case (here and here), which is unnecessarily worrying to users -- this should probably just be at info level. Happy to submit a PR for this.
  6. gRPC reconnects to the management server. It creates an ADS stream, and sends discovery requests for all resource types (LDS, CDS, EDS) with empty resource names. The management server correctly interprets this as a wildcard subscription (see this section of the xDS protocol spec). What exactly LDS wildcard entails is up to the management server, but to illustrate the issue, let's assume the management server sends the API listener for xds:///dest as part of it. Note that gRPC discards all resources because it doesn't subscribe to any of them yet.
  7. The channel with xds:/// is recreated, or exits idle mode. This causes a new explicit subscription to the corresponding listener, but since the management server already sent this LDS resource as part of the wildcard subscription, it considers the client up to date and does not send a response.
  8. The client times out and considers that the resource does not exist. RPCs fail.

What did you expect to see?

I expected one of the two following behaviours. Either:

  1. When the last xDS channel is closed or enters idle mode, then the xDS transport is closed and current version information discarded. That would sidestep the issue.
  2. Another reasonable behavior would be to never create empty subscriptions on a new ADS stream, to avoid triggering the protocol's wildcard subscription logic. I think upon reconnection, for CDS and LDS, if there is no subscription then there should be no request sent. Looking at the code, there seem to be some code trying to handle this case but I think it incorrectly iterates on resource type instead of the resource themselves.

What did you see instead?

gRPC sends discovery requests with empty resource_names field upon reconnecting, triggering wildcard subscription.

@atollena
Copy link
Collaborator Author

atollena commented Mar 1, 2024

I want to add another problem that I see in the current behavior of grpc-go (and probably other languages?) for channel transition:

READY -> IDLE -> CONNECTING

If the data plane is disconnected from the management server when this transition happens, when the management server comes back online, it has no way to know that the client has discarded the corresponding API listener. I'm not sure if this qualifies as a bug or just an oddity of the protocol.

Consider the following events:

  1. channel is in READY state, client has acked API listener resource at version A
  2. management server goes offline. channel continues to work fine with its current version of the listener
  3. channel goes IDLE, and discards the listener resource. Current version for this type URL is not changed, it is still A (the code for unsubscription is here)
  4. channel goes CONNECTING. xDS resolver tries to obtain the API listener resource with a discovery request with version A.

If the management server comes back online after step 3 or step 4, it has no way to know that the client does not already have a valid API listeners for the subscribed resource, as the version matches the current resources version. The data plane has discarded the resource, the resource has not changed so the version has not changed, and the management server was not exposed to the unsubscription+re-subscription events because it was offline when it occurred. The channel will only recovers from that state when at least one listener resource changes, triggering a full update response.

That seems like an inherent problem of the current implementation, and the only fix I could think of would be for the data plane to discard the version anytime it unsubscribed from a resource and has not yet gotten a response with the corresponding nonce for this URL type. This would force the control plane to send all subscribed resources.

And reading the protocol spec carefully, this problem is acknowledged:

Note that the version for a resource type is not a property of an individual xDS stream but rather a property of the resources themselves. If the stream becomes broken and the client creates a new stream, the client’s initial request on the new stream should indicate the most recent version seen by the client on the previous stream. Servers may decide to optimize by not resending resources that the client had already seen on the previous stream, but only if they know that the client is not subscribing to a new resource that it was not previously subscribed to. For example, it is generally safe for servers to do this optimization for LDS and CDS when the only subscription is a wildcard subscription, and it is safe to do in environments where the clients will always subscribe to exactly the same set of resources.

For context, we use a hash of a sorted list of <resource_name, resource_content> as the resource version for each resource type, for all subscribed resources, with a special value for the content if the resource does not exist. And we originally thought that by being clever and including the list of subscriptions inside the version info, we could safely resume subscription without triggering a full update. @valerian-roche is in the process of making https://github.com/envoyproxy/go-control-plane play better with gRPC by making state of the world stateful, and implemented this trick.

So it seems like when the client is a gRPC application, presumably with dynamic subscription per channel, the server should never do this optimization. This bears the question of why gRPC sends a version string on new ADS streams, even though we know that in most cases, it is unsafe for the management server to take this version info into consideration. Wouldn't it have been less error prone to just not include the version?

Note that this problem wouldn't happen with incremental, because un-subscription + re-subscriptions are explicit and resource version are per resource (via the https://github.com/envoyproxy/data-plane-api/blob/main/envoy/api/v2/discovery.proto#L171). It'd be nice to see progress on incremental transport for gRPC at some point, mostly because it is simpler.

@dfawley
Copy link
Member

dfawley commented Mar 5, 2024

I think upon reconnection, for CDS and LDS, if there is no subscription then there should be no request sent.

Yes, this seems like the proper behavior to me. I'll look into this further.

I want to add another problem that I see in the current behavior of grpc-go (and probably other languages?) for channel transition:

This is very interesting. We'll talk about this a bit more internally, but one possible workaround would be to not ever send uncached resources in the initial request, and then add subscriptions for them after that.

Another option: don't set the version with the initial request if we have any uncached resources being requested.

@dfawley
Copy link
Member

dfawley commented Mar 6, 2024

@atollena I sent #7026 to fix the initial bug report. Is it possible for you to confirm that this fixes the issue for you in practice?

For the second issue, we have an open discussion item for deciding how to handle it, but it likely affects all languages.

@dfawley
Copy link
Member

dfawley commented Mar 7, 2024

@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this?

@atollena
Copy link
Collaborator Author

atollena commented Mar 7, 2024

@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this?

I was able to try your fix today, and it does fix the issue of wildcard. Thank you!

@atollena
Copy link
Collaborator Author

atollena commented Mar 7, 2024

Should there be another issue for the protocol problem I pointed out in #7013 (comment)?

@atollena
Copy link
Collaborator Author

atollena commented Mar 7, 2024

@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this?

I don't need a patch release, we're not using this in production yet.

@dfawley
Copy link
Member

dfawley commented Mar 7, 2024

Should there be another issue for the protocol problem I pointed out in #7013 (comment)?

I am tracking this internally. We need to discuss this cross-language to determine a solution.

@dfawley
Copy link
Member

dfawley commented Mar 15, 2024

@atollena,

We discussed the other issue today, and this is apparently a known problem:

https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#ack-nack-and-resource-type-instance-version

Servers may decide to optimize by not resending resources that the client had already seen on the previous stream, but only if they know that the client is not subscribing to a new resource that it was not previously subscribed to. For example, it is generally safe for servers to do this optimization for LDS and CDS when the only subscription is a wildcard subscription, and it is safe to do in environments where the clients will always subscribe to exactly the same set of resources.

@atollena
Copy link
Collaborator Author

We discussed the other issue today, and this is apparently a known problem:
...

I was aware of this, I actually quoted that same spec section in the issue :)

It is imo confusing and error prone that gRPC sends a version number on reconnect when it has discarded resources, because taking it into consideration on the control plane side is definitely going to lead to subtle problems like this one. How do your control planes handle the version on new ADS stream for non wildcard subscriptions? I suppose it always discards it? In that case, what is the point of sending it?

@dfawley
Copy link
Member

dfawley commented Mar 18, 2024

Good question. tl;dr is that it might be theoretically possible to make use of this, but in practice it's very unlikely. The only one we've been able to come up with is if the server knows its clients never unsubscribe + resubscribe. We probably wouldn't set that in these scenarios if we were starting from scratch, but because that is the current behavior, we've so far decided to leave it in place in case someone is making use of that.

valerian-roche added a commit to DataDog/go-control-plane that referenced this issue Apr 15, 2024
…when using wildcard watches

During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions.
When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again.
In the context of wildcard watches this is not an issue, so the behavior is kept.

More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment))

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
valerian-roche added a commit to DataDog/go-control-plane that referenced this issue Apr 15, 2024
…when using wildcard watches (#16)

During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions.
When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again.
In the context of wildcard watches this is not an issue, so the behavior is kept.

More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment))

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
valerian-roche added a commit to valerian-roche/go-control-plane that referenced this issue May 9, 2024
…when using wildcard watches (#16)

During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions.
When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again.
In the context of wildcard watches this is not an issue, so the behavior is kept.

More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment))

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants