A74: xDS Config Tears (#404)

* A74: xDS Config Tears * add note about ClusterSpecifierPlugin being non-public * review comments * stay subscribed to cluster resource while holding cluster ref * store each cluster separately in XdsConfig * add optimization to avoid extra JSON encoding/decoding in xds_cluster_impl and xds_override_host * attempt to clarify wording * fix section header structure * update wording * add mailing list thread * fix typo * clean up XdsConfig struct * update diagrams * dfawley comments * cds LB policy should get subscription handle independently of whether cluster is already present in the XdsConfig
grpc · Feb 16, 2024 · 9240c4b · 9240c4b
1 parent a1c1bae
commit 9240c4b
Show file tree

Hide file tree

Showing 3 changed files with 346 additions and 0 deletions.
diff --git a/A74-xds-config-tears.md b/A74-xds-config-tears.md
@@ -0,0 +1,345 @@
+A74: xDS Config Tears
+----
+* Author(s): @markdroth
+* Approver: @ejona86, @dfawley
+* Status: {Draft, In Review, Ready for Implementation, Implemented}
+* Implemented in: <language, ...>
+* Last updated: 2023-12-28
+* Discussion at: https://groups.google.com/g/grpc-io/c/ifcC3DbopWM
+
+## Abstract
+
+xDS configuration is split across four main resource types: listener,
+route configuration, cluster, and endpoint.  Today, these different
+resource types are watched in different parts of the gRPC channel:
+the listener and route configuration are watched in the xds resolver,
+whereas the cluster and endpoint resources are watched in the LB policy
+tree (in the cds and xds_cluster_resolver LB policies, respectively).
+This split prevents us from applying config changes to the channel in
+an atomic way, meaning that we can see config tears, which can lead
+to unnecessary latency and even spurious RPC failures.  This document
+proposes a design to fix that problem.
+
+## Background
+
+The way that we currently split the xDS resource watches between the
+resolver and the LB policy tree means that we cannot apply config updates
+to the gRPC channel as an atomic whole.  However, there are cases where a
+control plane needs to send a config change that spans multiple resources.
+
+At minimum, this can cause us to impose unnecessary latency for RPCs.
+For example, the xds resolver may apply a new route configuration that
+references a new cluster, even though the client has not yet obtained
+the resource for the new cluster.  As soon as the xds resolver applies
+that change, the ConfigSelector will start using the new route
+configuration, which means that it will start sending RPCs to the new
+cluster.  However, those RPCs may be delayed until the cds LB policy
+obtains the cluster resource.
+
+In some edge cases, this can even cause us to incorrectly fail RPCs.
+For example, consider a case using xDS fallback (see [A71]), where the
+primary server and the fallback server use different resource names
+for the cluster resources.  If the client is using the data from the
+fallback server and then the primary server comes back up, the client
+may see the does-not-exist notification for the cluster name from the
+fallback server before it sees the new route configuration that points
+to the correct cluster name, thus causing us to temporarily fail RPCs
+routed to that cluster.
+
+This document proposes a design to fix this by moving all of the
+resource watches into the xds resolver.
+
+### Related Proposals: 
+* [A27: xDS Global Load Balancing][A27]
+* [A28: xDS Traffic Splitting and Routing][A28]
+* [A29: xDS mTLS Security][A29]
+* [A31: xDS Timeout Support and Config Selector Design][A31]
+* [A37: xDS Aggregate and Logical DNS Clusters][A37]
+* [A55: xDS-Based Stateful Session Affinity for Proxyless gRPC][A55]
+* [A56: Priority LB Policy][A56]
+* [A71: xDS Fallback][A71]
+
+[A27]: A27-xds-global-load-balancing.md
+[A28]: A28-xds-traffic-splitting-and-routing.md
+[A29]: A29-xds-tls-security.md
+[A31]: A31-xds-timeout-support-and-config-selector.md
+[A37]: A37-xds-aggregate-and-logical-dns-clusters.md
+[A55]: A55-xds-stateful-session-affinity.md
+[A56]: A56-priority-lb-policy.md
+[A71]: A71-xds-fallback.md
+
+## Proposal
+
+This proposal has several parts:
+- We will move all resource watches into the xds resolver.
+- The complete xDS config will be passed to the LB policies as an
+  attribute.
+- The cds LB policy will be changed to get the configuration for the
+  cluster from the passed-in attribute rather than starting watches on
+  the XdsClient.
+- The xds_cluster_resolver LB policy will be removed completely, as
+  obtaining endpoint addresses will now be done in the xds resolver.
+  The code for generating the child policy configs for the priority
+  policy will now be done in the cds LB policy.
+- As an optimization, we will change the xds_cluster_impl and
+  xds_override_host LB policies to get most of their configuration
+  parameters from the passed-in attribute rather than from their JSON
+  configs.
+- In some implementations, this may also allow simplifying the XdsCredentials
+  plumbing by leveraging the config passed to the LB policies instead of
+  generating a map of mTLS configs in the cds LB policy.
+
+Note that although all of the xDS resource watches will now be done in
+the xds resolver, it will still be necessary to pass the XdsClient to
+the LB policies, because the xds_cluster_impl policy needs to access the
+XdsClient for load reporting.
+
+The updated gRPC client xDS architecture is shown below.
+
+![gRPC Client xDS Architecture Diagram](A74_graphics/grpc_client_architecture.png)
+
+[Link to SVG file](A74_graphics/grpc_client_architecture.svg) 
+
+### Changes in xds Resolver
+
+We will introduce a new subcomponent in the xds resolver called the
+XdsDependencyManager, which sits in between the xds resolver and the
+XdsClient.  The XdsDependencyManager will watch all xDS resources,
+resolving dependencies between them, and returning a complete
+configuration to the xds resolver only once all necessary dependencies
+have been obtained.  For example, if the XdsDependencyManager receives an
+updated route configuration that refers to a new cluster, it will start
+a watch for the new cluster.  If that cluster requires a new endpoint
+resource, it will start a watch for that endpoint resource.  Only once
+it has all necessary resources will it return a new configuration to
+the xds resolver.
+
+The XdsDependencyManager will be passed the listener resource name when
+it is instantiated, and it will immediately start a watch for that
+listener resource on the XdsClient.  If the route configuration is not
+inlined into the listener, the XdsDependencyManager will start a watch
+for the route configuration resource.  It will also be passed the data
+plane authority, which it will use to select the right virtual host from
+the route configuration.
+
+The XdsDependencyManager will get the list of clusters from the routes
+in the selected virtual host and will start a watch for each of them.
+It will also start watches for clusters that are determined dynamically
+(see below).  For EDS clusters, the XdsDependencyManager will start a
+watch for the required endpoint resource.  For logical DNS clusters
+(see [A37]), the XdsDependencyManager will create a DNS resolver to
+obtain endpoint addresses.  For aggregate clusters (see [A37]), the
+XdsDependencyManager will resolve the dependency graph, starting watches
+for dependent clusters as needed.
+
+The XdsDependencyManager will provide a watcher API similar to the one
+provided by the XdsClient, with a few key differences.  The watcher
+API will look like this (C++ syntax, various implementation details elided):
+
+```c++
+class Watcher {
+ public:
+  virtual ~Watcher() = default;
+
+  // Called when a new xDS config is available.
+  virtual void OnUpdate(XdsConfig config) = 0;
+
+  // These methods are invoked when there is an error or
+  // does-not-exist on LDS or RDS only.  The context will be a
+  // human-readable string indicating the scope in which the error
+  // occurred (e.g., the resource type and name).
+  virtual void OnError(absl::string_view context, absl::Status status) = 0;
+  virtual void OnResourceDoesNotExist(std::string context) = 0;
+};
+```
+
+The xds resolver will implement this watcher interface and pass the
+watcher object to the XdsDependencyManager at instantiation time.
+
+The updates delivered to this watcher will contain a complete xDS
+configuration, including all necessary resources.  The xDS configuration
+will look something like this (C++ syntax, various implementation
+details elided):
+
+```c++
+struct XdsConfig {
+  // Listener resource.
+  XdsListenerResource listener;
+  // RouteConfig resource.  Will be populated even if RouteConfig is
+  // inlined into the Listener resource.
+  XdsRouteConfigResource route_config;
+  // Virtual host.  Points into route_config.  Will always be non-null.
+  XdsRouteConfigResource::VirtualHost* virtual_host;
+
+  struct ClusterConfig {
+    // Cluster resource.
+    XdsClusterResource cluster;
+
+    // Endpoint info for EDS and LOGICAL_DNS clusters.  If there was an
+    // error, endpoints will be null and resolution_note will be set.
+    struct EndpointConfig {
+      XdsEndpointResource endpoints;
+      std::string resolution_note;
+    };
+    // The list of leaf clusters for an aggregate cluster.
+    struct AggregateConfig {
+      std::vector<absl::string_view> leaf_clusters;
+    };
+    std::variant<EndpointConfig, AggregateConfig> children;
+  };
+  // Cluster map.  A cluster will have a non-OK status if either
+  // (a) there was an error and we did not already have a valid
+  // resource or (b) the resource does not exist.
+  std::map<std::string, absl::StatusOr<ClusterConfig>> clusters;
+};
+```
+
+Note that the watcher's OnError() and OnResourceDoesNotExist()
+methods will be invoked only for problems with the listener and route
+configuration resources.  For cluster-specific errors, we need to
+retain the existing behavior, which is that it causes a problem for the
+individual cluster but does not affect traffic sent to any other cluster.
+Therefore, when the XdsDependencyManager sees an error on a cluster for
+which it did not previously have data, or if it sees a does-not-exist
+notification on a cluster, it will report that by returning a non-OK
+status for the cluster in the XdsConfig.  Similarly, if there is a problem
+obtaining endpoint data from EDS or DNS, the XdsDependencyManager will
+report that problem via the cluster's resolution_note field.
+
+The xds resolver will pass the returned XdsConfig to the LB policies via
+an attribute, so that the LB policies can use the cluster data.  Note that
+we cannot simply pass the data via the cds LB policy's JSON config,
+because we need this to work in cases where the cluster is determined
+dynamically, such as where a route uses a ClusterSpecifierPlugin, in which
+case the cds LB policy's config comes from the ClusterSpecifierPlugin,
+not from the xds resolver.  (We have some internal functionality that
+uses this, not something that has been documented in a public gRFC.)
+
+In order to handle such cases, the XdsDependencyManager will also
+provide an API to subscribe to a particular cluster resource whose
+name was not present in the route configuration.  The xds resolver
+will pass the XdsDependencyManager (or an interface wrapping it, if
+the implementation does not want to expose XdsDependencyManager outside
+of the xds resolver) to the LB policies via an attribute, so that the
+cds LB policy can use this API.  The API will return some subscription
+handle that the caller can release when they are no longer interested
+in the cluster subscription.
+
+As per [gRFC A31][A31], the ConfigSelector gives each RPC a ref to the
+cluster that was selected for it to ensure that the cluster is not
+removed from the xds_cluster_manager LB policy config before the RPC is
+done with its LB picks.  These cluster refs will also hold a
+subscription for the cluster from the XdsDependencyManager, so that the
+XdsDependencyManager will not stop watching the cluster resource until
+the cluster is removed from the xds_cluster_manager LB policy config.
+
+Note that in the XdsConfig, the cluster map will contain an entry for
+every cluster that is either referenced in the route configuration,
+referenced in an aggregate cluster, or for which a dynamic subscription
+was registered.
+
+### Changes in cds LB policy
+
+Instead of starting a watch on the XdsClient to get the cluster data,
+the cds LB policy will get its configuration by looking up the cluster
+name in the XdsConfig passed to it via an attribute.
+
+To address the case where the cluster is determined dynamically (e.g., a
+route that uses a ClusterSpecifierPlugin), we will add a new field to
+the cds LB policy's configuration:
+
+```
+// Set this to true for a cluster that is determined dynamically
+// rather than being present in the route configuration.
+bool is_dynamic = 2;
+```
+
+If that field is set to true, then the cds policy will obtain a
+subscription handle from the XdsDependencyManager (or the interface
+wrapping it) if it has not already done so.  This handle will not be
+released until the cds policy is destroyed.
+
+Note that when a cds LB policy obtains a dynamic subscription to a
+cluster, there may be subsequent updates from the xds resolver that do not
+yet include the newly subscribed-to cluster.  Therefore, the cds policy
+will need to ignore updates in which the XdsConfig does not contain the
+cluster.  However, this case should not occur with non-dynamic clusters,
+because the xds resolver ensures that those clusters are being subscribed
+to until they are removed from the LB policy config.
+
+The code for generating the configuration for the priority policy (see
+[A56]) will be moved from the xds_cluster_resolver policy into the
+cds policy.  Note that this includes the heuristic we use to attempt to
+minimize churn in priority child policy names as localities move between
+priorities.  The cds policy will also take the endpoint information from
+the XdsConfig and pass it down to the priority policy as addresses.
+
+### Removing the xds_cluster_resolver LB Policy
+
+The xds_cluster_resolver LB policy (introduced in [A37]) will be removed
+completely, since all of its functionality is being moved elsewhere in
+this design.  The XdsDependencyManager will be responsible for subscribing
+to EDS resources and for creating DNS resolvers for Logical DNS clusters.
+And the cds LB policy will be responsible for generating the configuration
+for the priority policy.
+
+### Changes in xds_cluster_impl and xds_override_host LB policies
+
+Now that we are passing the XdsConfig to the LB policies via an
+attribute, the xds_cluster_impl and xds_override_host LB policies can
+now get the cluster's configuration from the XdsConfig rather than from
+their JSON config.  This will avoid the need to populate most of the
+fields in the JSON config in the cds LB policy and to parse these fields
+in the xds_cluster_impl and xds_override_host policies.
+
+In the xds_cluster_impl policy config, we will retain only the
+clusterName and childPolicy fields.  All other fields are deprecated
+and will no longer be used.
+
+In the xds_override_host policy config, we will retain the childPolicy
+field and add a clusterName field.  The overrideHostStatus field is
+deprecated and will no longer be used.
+
+### XdsCredentials Simplification
+
+Because this design calls for passing the XdsConfig down to the LB
+policies, some gRPC implementations may be able to leverage that to
+provide some simplifications in the XdsCredentials plumbing (see [A29]).
+
+Because the mTLS config needs to come from the leaf clusters in the
+aggregate cluster graph, the cds LB policy currently needs to construct
+a map from leaf cluster name to the corresponding mTLS config, and then
+when the subchannel connects, it picks the config for the appropriate
+underlying cluster.
+
+However, now that we are passing down the entire XdsConfig via an
+attribute, we can instead move the XdsCredentials plumbing from the cds
+LB policy down to the xds_cluster_impl policy (introduced in [A37]).
+The xds_cluster_impl policy already knows which leaf cluster it is for,
+so it can look up the mTLS config directly and pass that down to the
+subchannels, so there is no longer any need for a map.
+
+### Temporary environment variable protection
+
+This design does not provide any new functionality that will be enabled
+by external I/O, so no environment variable guard is necessary.
+
+## Rationale
+
+In addition to improving reliability in the cases described about, this
+design paves the way for possible future changes that would allow the
+client to be resilient to bad configs from the control plane, once we
+have enough observability infrastructure to make that tractable.
+
+## Implementation
+
+LB policy config fields are updated in
+https://github.com/grpc/grpc-proto/pull/140.
+
+Implemented in C-core in https://github.com/grpc/grpc/pull/35011.
+
+Java, Go, and Node will implement this in the future.
+
+## Open issues (if applicable)
+
+N/A
diff --git a/A74_graphics/grpc_client_architecture.png b/A74_graphics/grpc_client_architecture.png
diff --git a/A74_graphics/grpc_client_architecture.svg b/A74_graphics/grpc_client_architecture.svg