Skip to content

Idle timeout breaks xDS with go-control-plane #9632

@Necropaw

Description

@Necropaw

Background

In our project we are using the https://github.com/envoyproxy/go-control-plane with the xDS provider from https://github.com/wongnai/xds and grpc-java@v1.50.0 for client.

The client is communicating with multiple gRPC services on different hosts.

Current behavior

After the idle timeout of a channel (default: 30min) the xds client forgets resolved names.

The go-control-plane (and also the java-control-plane) are stateful and do not forget what the client requested, e.g.: https://github.com/envoyproxy/go-control-plane/blob/main/pkg/cache/v3/simple.go#L350

This leads to the following problem:

  • client requests service-a, service-b, service-c and service-d on xds-resolver and gets a proper answer
  • after the idle timeout things get shutdown for service-b:
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-resolver<6>: (service-b)] Stop watching LDS resource service-b"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[cluster_manager-lb<69>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-resolver<6>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-client<35>] Unsubscribing LDS resource service-b from server xds-resolver:1337"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[cds-lb<70>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.255",DEBUG,"grpc-default-worker-ELG-1-1","[xds-resolver<92>: (service-b)] Created resolver for service-b"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[cluster-resolver-lb<71>: (service-b)] Stop watching EDS resource service-b.dev:grpc"
...
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[weighted-target-lb<75>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[cluster-resolver-lb<71>: (service-b)] Shutdown"
...
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[xds-client<35>] Unsubscribing EDS resource service-b.dev:grpc from server xds-resolver:1337"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[priority-lb<72>: (service-b)] Shutdown"
"2022-10-18 @ 07:44:21.256",DEBUG,"grpc-default-worker-ELG-1-1","[xds-client<35>] Unsubscribing CDS resource service-b.dev:grpc from server xds-resolver:1337"
"2022-10-18 @ 07:44:21.256",TRACE,"grpc-default-worker-ELG-1-1","[priority-lb<72>: (service-b)] Priority deleted: service-b.dev:grpc[child1]"
"2022-10-18 @ 07:44:21.257",TRACE,"grpc-default-worker-ELG-1-1","[cluster_manager-lb<69>: (service-b)] Child balancer cluster:service-b.dev:grpc deleted"
  • some time later the client wants to communicate with service-b again and tries to resolve it:
"2022-10-18 @ 08:05:33.361",DEBUG,"http-nio-8080-exec-5","[xds-resolver<92>: (service-b)] Start watching LDS resource service-b"
"2022-10-18 @ 08:05:33.361",TRACE,"http-nio-8080-exec-5","[xds-client<38>: (xds-resolver:1337)] Sent DiscoveryRequest
{
  ""versionInfo"": ""422bd1d65455a992"",
  ""node"": {
    ""id"": ""client-node-id"",
    ""userAgentName"": ""gRPC Java"",
    ""userAgentVersion"": ""1.50.0"",
    ""clientFeatures"": [""envoy.lb.does_not_support_overprovisioning"", ""xds.config.resource-in-sotw""]
  },
  ""resourceNames"": [""service-a"", ""service-b""],
  ""typeUrl"": ""type.googleapis.com/envoy.config.listener.v3.Listener"",
  ""responseNonce"": ""124""
}"
"2022-10-18 @ 08:05:33.361",DEBUG,"http-nio-8080-exec-5","[xds-client<35>] Subscribe io.grpc.xds.XdsListenerResource@74a2f92e resource service-b"
"2022-10-18 @ 08:05:33.361",DEBUG,"http-nio-8080-exec-5","[xds-client<38>: (xds-resolver:1337)] Sending LDS request for resources: [service-a, service-b]"
"2022-10-18 @ 08:05:46.369",DEBUG,"grpc-timer-0","[xds-client<35>] LDS resource service-b initial fetch timeout"
"2022-10-18 @ 08:05:46.369",DEBUG,"grpc-timer-0","[xds-client<35>] Conclude LDS resource service-b not exist"
"2022-10-18 @ 08:05:46.369",DEBUG,"grpc-timer-0","[xds-resolver<92>: (service-b)] LDS resource does not exist: service-b"
  • this fails because the xDS provider remembered that this client already knows service-b in this version and just adds a watcher for possible upcoming changes:
"2022-10-18 @ 08:05:29.397",debug,"-","open watch 1528 for type.googleapis.com/envoy.config.listener.v3.Listener[service-a service-b] from nodeID """", version ""422bd1d65455a992"""
"2022-10-18 @ 08:05:29.397",debug,"-","nodeID """" requested type.googleapis.com/envoy.config.listener.v3.Listener[service-a service-b] and known map[service-c:{} service-d:{} service-a:{} service-b:{}]. Diff []"

At this stage the client can not communicate with the service anymore.

Workaround

Increase the idle timeout on the channel to some high value. In the hope the client wants to talk to it in this time frame.

Attachements

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions