-
Couldn't load subscription status.
- Fork 3.9k
Description
Background
In our project we are using the https://github.com/envoyproxy/go-control-plane with the xDS provider from https://github.com/wongnai/xds and grpc-java@v1.50.0 for client.
The client is communicating with multiple gRPC services on different hosts.
Current behavior
After the idle timeout of a channel (default: 30min) the xds client forgets resolved names.
The go-control-plane (and also the java-control-plane) are stateful and do not forget what the client requested, e.g.: https://github.com/envoyproxy/go-control-plane/blob/main/pkg/cache/v3/simple.go#L350
This leads to the following problem:
- client requests
service-a,service-b,service-candservice-donxds-resolverand gets a proper answer - after the idle timeout things get shutdown for
service-b:
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-resolver<6>: (service-b)] Stop watching LDS resource service-b"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[cluster_manager-lb<69>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-resolver<6>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-client<35>] Unsubscribing LDS resource service-b from server xds-resolver:1337"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[cds-lb<70>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-resolver<92>: (service-b)] Created resolver for service-b"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[cluster-resolver-lb<71>: (service-b)] Stop watching EDS resource service-b.dev:grpc"
...
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[weighted-target-lb<75>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[cluster-resolver-lb<71>: (service-b)] Shutdown"
...
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[xds-client<35>] Unsubscribing EDS resource service-b.dev:grpc from server xds-resolver:1337"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[priority-lb<72>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[xds-client<35>] Unsubscribing CDS resource service-b.dev:grpc from server xds-resolver:1337"
"2022-10-18 @ 07:44:21.256",TRACE,"grpc-default-worker-ELG-1-1","[priority-lb<72>: (service-b)] Priority deleted: service-b.dev:grpc[child1]"
"2022-10-18 @ 07:44:21.257",TRACE,"grpc-default-worker-ELG-1-1","[cluster_manager-lb<69>: (service-b)] Child balancer cluster:service-b.dev:grpc deleted"
- some time later the client wants to communicate with
service-bagain and tries to resolve it:
"2022-10-18 @ 08:05:33.361",DEBUG,"http-nio-8080-exec-5","[xds-resolver<92>: (service-b)] Start watching LDS resource service-b"
"2022-10-18 @ 08:05:33.361",TRACE,"http-nio-8080-exec-5","[xds-client<38>: (xds-resolver:1337)] Sent DiscoveryRequest
{
""versionInfo"": ""422bd1d65455a992"",
""node"": {
""id"": ""client-node-id"",
""userAgentName"": ""gRPC Java"",
""userAgentVersion"": ""1.50.0"",
""clientFeatures"": [""envoy.lb.does_not_support_overprovisioning"", ""xds.config.resource-in-sotw""]
},
""resourceNames"": [""service-a"", ""service-b""],
""typeUrl"": ""type.googleapis.com/envoy.config.listener.v3.Listener"",
""responseNonce"": ""124""
}"
"2022-10-18 @ 08:05:33.361",DEBUG,"http-nio-8080-exec-5","[xds-client<35>] Subscribe io.grpc.xds.XdsListenerResource@74a2f92e resource service-b"
"2022-10-18 @ 08:05:33.361",DEBUG,"http-nio-8080-exec-5","[xds-client<38>: (xds-resolver:1337)] Sending LDS request for resources: [service-a, service-b]"
"2022-10-18 @ 08:05:46.369",DEBUG,"grpc-timer-0","[xds-client<35>] LDS resource service-b initial fetch timeout"
"2022-10-18 @ 08:05:46.369",DEBUG,"grpc-timer-0","[xds-client<35>] Conclude LDS resource service-b not exist"
"2022-10-18 @ 08:05:46.369",DEBUG,"grpc-timer-0","[xds-resolver<92>: (service-b)] LDS resource does not exist: service-b"
- this fails because the xDS provider remembered that this client already knows
service-bin this version and just adds a watcher for possible upcoming changes:
"2022-10-18 @ 08:05:29.397",debug,"-","open watch 1528 for type.googleapis.com/envoy.config.listener.v3.Listener[service-a service-b] from nodeID """", version ""422bd1d65455a992"""
"2022-10-18 @ 08:05:29.397",debug,"-","nodeID """" requested type.googleapis.com/envoy.config.listener.v3.Listener[service-a service-b] and known map[service-c:{} service-d:{} service-a:{} service-b:{}]. Diff []"
At this stage the client can not communicate with the service anymore.
Workaround
Increase the idle timeout on the channel to some high value. In the hope the client wants to talk to it in this time frame.
Attachements
- full client log: client.csv