Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A71: xDS Fallback #386

Merged
merged 24 commits into from
Jan 4, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
119 changes: 119 additions & 0 deletions A71-xds-fallback.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
A71: xDS Fallback
----
* Author(s): @eostroukhov
* Approver: @markdroth
* Status: {Draft, In Review, Ready for Implementation, Implemented}
* Implemented in: <language, ...>
* Last updated: 2023-07-27
* Discussion at: https://groups.google.com/g/grpc-io/c/07M6Ua7q4Hc

## Abstract

This proposal describes a fallback mechanism for situtaions when the primary
xDS server is unavailable. xDS support for client-side gRPC is described
in gRFC [A27][A27] and server-side support is described in [A36][A36].

Several specific scenarios need to be considered:
1. The xDS server is not available at the time of the initial connection.
markdroth marked this conversation as resolved.
Show resolved Hide resolved
1. The xDS server is no longer available after some resources have been
obtained.

## Background

xDS (also known as xDS Discovery Service) is a suite of APIs for discovering
markdroth marked this conversation as resolved.
Show resolved Hide resolved
and subscribing to the configuration of a server mesh. Even a brief downtime
of the xDS control plane may cause significant disruption in the service mesh
inter-component connectivity and result in wider outages.

Current xDS implementation in gRPC uses a single global instance of
larry-safran marked this conversation as resolved.
Show resolved Hide resolved
the XdsClient class to communicate with the xDS control plane. This instance
is shared across all channels to reduce overhead by providing a shared cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This instance is shared across all channels and all xDS-enabled gRPC server listeners.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for xDS resources as well as reducing a number of connections to xDS servers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reducing a number/reducing the number/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Related Proposals:
markdroth marked this conversation as resolved.
Show resolved Hide resolved
* [A27: xDS-Based Global Load Balancing][A27]
* [A36: xDS-Enabled Servers][A36]
* [A47: xDS Federation][A47]
* [A57: XdsClient Failure Mode Behavior][A57]

[A27]: A27-xds-global-load-balancing.md
[A36]: A36-xds-for-servers.md
[A47]: A47-xds-federation.md
[A57]: A57-xds-client-failure-mode-behavior.md

## Proposal

gRPC code should use fallback iff both:

* primary server not reachable
markdroth marked this conversation as resolved.
Show resolved Hide resolved
* at least one watcher exists for a resource that is not cached (where

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 cases where a watcher could exist for a non-cached resource

  1. Didn't get a response before primary server became unreachable
  2. Primary server was already unreachable and a new request for a resource is made

The code paths for these situations seems like they would be quite different

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there probably will be two code-paths that need to be changed here, but I think each one will basically just call the same code to provide the right behavior, so it's really just two ways of triggering the same common code. I don't think there will be much duplication here.

"does not exist" counts as cached).
markdroth marked this conversation as resolved.
Show resolved Hide resolved

Instead of using a global XdsClient instance, gRPC will use a shared XdsClient
markdroth marked this conversation as resolved.
Show resolved Hide resolved
instance for each data plane target. In other words, if two channels are
markdroth marked this conversation as resolved.
Show resolved Hide resolved
created for the target "xds:foo", they will share one XdsClient instance, but
if another channel is created for "xds:bar", it will use a different XdsClient
instance.

Attempt to fetch a resource for a cached target will result in a failure.
markdroth marked this conversation as resolved.
Show resolved Hide resolved
murgatroid99 marked this conversation as resolved.
Show resolved Hide resolved

The following changes will be made to the codebase:
markdroth marked this conversation as resolved.
Show resolved Hide resolved

1. Add support for multiple xDS servers in the `bootstrap.json`.
markdroth marked this conversation as resolved.
Show resolved Hide resolved
2. Update implemention to support configuration with multiple xDS servers.
3. Refactor XdsClient to support multiple instances.
4. Update code relying on XdsClient to work with multiple instances
* xDS resource fetching and tracking.
* xDS stats *[???] (do we want to collect them cross-XdsClients?)*
markdroth marked this conversation as resolved.
Show resolved Hide resolved

#### bootstrap.json

Currently `bootstrap.json` supports multiple xDS servers but semantics are
not explicitely specified.

1. xDS servers will be attempted in the order they are specified in
the bootstrap.json. Server will only be attempted if the previous entry in
the list is not available.
1. xDS client will report a failure if the last entry in the list is not
available.
1. `channel-creds` or any other server attributes are not shared and need
markdroth marked this conversation as resolved.
Show resolved Hide resolved
to be defined independently for every server.

*[???] Are there any cases when we revert to a primary server?*
markdroth marked this conversation as resolved.
Show resolved Hide resolved

#### Internal xDS bootstrap representation

Currently internal data structures do not allow for more then a single xDS
dfawley marked this conversation as resolved.
Show resolved Hide resolved
server. The implementation needs to be updated to handle multiple servers,
maintainig their fallback order.

#### XdsClient class changes

Each language implementation will need to ensure that multiple XdsClient
instances may be created and torn down as needed during the application
lifetime.

### Temporary environment variable protection

This option will be behind `GRPC_EXPERIMENTAL_XDS_FALLBACK`. If this variable
is unset or is falsy, only one xDS server will be read from the bootstrap
markdroth marked this conversation as resolved.
Show resolved Hide resolved
file.

## Rationale
markdroth marked this conversation as resolved.
Show resolved Hide resolved

Other approaches considered:
markdroth marked this conversation as resolved.
Show resolved Hide resolved
dfawley marked this conversation as resolved.
Show resolved Hide resolved

#### Always switch to a fallback server if the primary is not available

A single XdsClient is used for all targets. All cached resources are purged and
refetched as soon as primary xDS server becomes unavailable. Major shortcoming
of this approach is that this would put fallback server availability at risk
due to spike in demand once the primary server goes down and all the clients
try to refetch the resources.

#### Always switch to a fallback server if the primary is not available but only if there's a need to fetch the resources that are missing from the cache.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the chosen approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case there's only a single XdsClient so all cached resources for all origins would be purged if there's a request for uncached resource.


This is similar to an option above but should decrease the load on the fallback
servers.

This may still result in unnecessary refetch in some cases.