gRFC A6: Retries #12

ncteisen · 2017-02-11T00:30:38Z

gRFC for retries

Authors: ncteisen and ericgribkoff

grpc.io discussion

minor changes

…y-design

markdroth

Just a few cosmetic issues to deal with here.

markdroth · 2017-02-13T15:23:31Z

A6.md

+* Author(s): [Noah Eisen](https://github.com/ncteisen) and [Eric Gribkoff](https://github.com/ericgribkoff)
+* Approver: a11r
+* Status: Draft
+* Implemented in: All Languages


I think this line is intended to indicate where this proposal is already implemented. Given that we haven't yet implemented this in any language, I think this field should be blank.

markdroth · 2017-02-13T15:28:33Z

A6.md

+The maximum number of retry attempts per original RPC. This can be specified in the retry policy as follows:
+
+```
+'max_retry_attempts': 3


The example for the hedging policy below shows the parameters wrapped in a 'hedging_policy' block. For consistency, the retry examples should be wrapped in a 'retry_policy' block.

markdroth · 2017-02-13T15:29:36Z

A6.md

+  1. Retry policy: Retry in *n* ms. If this attempt also fails, retry delay will reset to initial backoff for the following retry (if applicable)
+  2. Hedging policy: Send next hedged request in *n* ms. Subsequent hedged requests will resume at `n + hedging_delay`
+
+![State Diagram](A6_graphics/StateDiagram.png)


It would probably be a good idea to include the SVG version of these diagrams as well as the PNG, so that we can edit them in the future if necessary.

@ncteisen - this request is buried on github after I pushed the latest commit. Do you have SVG versions of the diagrams we can add?

markdroth · 2017-02-13T15:31:14Z

A6.md

+#### Exposed Retry Metadata
+Both client and server application logic will have access to data about retries via gRPC metadata. Upon seeing an RPC from the client, the server will know if it was a retry, and moreover, it will know the number of previously made attempts. Likewise, the client will receive the number of retry attempts made when receiving the results of an RPC.
+
+The new header names for exposing the metadata will be `"x-grpc-retry-attempts"` to give clients and servers access to the attempt count.


I assume the value of this field will be a human-readable integer? If so, please explicitly state that.

markdroth · 2017-02-13T15:33:29Z

A6.md

+When gRPC receives a non-OK response status from a server, this status is checked against the set of retryable status codes to determine if a retry attempt should be made.
+
+```
+'retryable_status_codes': {UNAVAILABLE}


For consistency, let's make all examples in this doc be the exact JSON format that will be used in the service config.

markdroth · 2017-02-13T15:34:36Z

A6.md

+This is an example of a retry policy and its associated configuration. It implements exponential backoff with a maximum of three retry attempts, only retrying RPCs when an `UNAVAILABLE` status code is received.
+
+```
+'retry_policy': {


The default conversion from protobuf will change a protobuf field of the form "foo_bar" to a JSON string of the form "fooBar". So let's use that form for all fields in this doc (e.g., "retry_policy" should be "retryPolicy").

In fact, before finalizing this, I would encourage you to write the protobuf definition that will be used internally and then create a proto and see how it converts to JSON, to make sure it's compatible with what we're specifying here.

Unrelated to this PR, but I think it'd be great to have a link to a reference protobuf definition which reflects the JSON in the main service config doc in a form of Appendix at the bottom. I think a lot of people due to DNS limitation will use it, myself included.

The (mostly finalized) JSON format of the service config with retry, hedging, and throttling policies is now included. I think it's a good idea to include the protobuf definition as well once its finalized.

@lukaszx0, can you please chime in on the comment thread for #5 (https://groups.google.com/d/topic/grpc-io/DkweyrWEXxU/discussion) and describe the limitations that would prevent you from using DNS for service configs? I'd like more information on that, but I'd like to keep this discussion in that other thread, since it's not really directly related to the retry design. Thanks!

@markdroth sure, I had it opened in tab but didn't get to it.

markdroth · 2017-02-13T15:35:44Z

A6.md

+{
+  'loadBalancingPolicy': string,
+
+  # Throttling parameters for retry attempts and hedged RPCs.


Please use // instead of # for comments in JSON.

Done (aside: https://github.com/grpc/grpc/blob/master/doc/service_config.md uses # for the comments)

markdroth · 2017-02-13T15:36:49Z

A6.md

+          # Integer
+          'multiplier': number
+        },
+        # Array of strings. Supported values are the gRPC status codes, e.g.,


I assume that the status codes should be encoded as JSON strings? If so, let's state that explicitly.

markdroth · 2017-02-13T15:39:25Z

A6.md

+        # Integer
+        'hedging_delay_ms' : number,
+        # Array of strings.
+        # Supported values are the gRPC status codes, e.g.,


markdroth · 2017-02-13T15:41:40Z

A6.md

+
+## Proposal
+
+### Overview


Oh, one more thing: This section should probably include a link to #5.

lukaszx0

I took a first pass and studied it a bit. In general looks great, very well written! Kudos!

lukaszx0 · 2017-02-19T06:09:25Z

A6.md

+'retryable_status_codes': {UNAVAILABLE}
+```
+
+In general, only status codes that indicate the service did not process the request should be retried. However, in the case of idempotent methods, a more aggressive set of parameters can be specified, as there is no harm in the server processing the request more than once.


The idempotence is mentioned here, but it's not clear where does it come from. I assume it is expected to be set in method descriptor - I know that there is API for that in Java, not sure about other languages - if so, maybe clarify it a bit? How about "... in the case of idempotent methods (which can be set in method descriptor)..."

Updated the doc to clarify. We are actually intentionally not defining a mechanism for service owners to specify that a method is idempotent. An idempotent flag would really just mean "retry on these additional status codes", and it's cleaner to just let the service owner explicitly set the status codes rather than have an idempotent flag.

The use of idempotent here in the doc was just to give an example of when its safe to expand the set of retryable status codes, and the example has been expanded to make that clearer. The idempotent bit in the Java API is not being used by this retry policy design.

lukaszx0 · 2017-02-19T06:09:32Z

A6.md

+}
+```
+
+When a method has `hedging_params` set, the first RPC is sent immediately, as with a standard non-hedged call. After `hedging_delay` has elapsed without a successful response, the second RPC will be issued. If neither RPC has received a response after `hedging_delay` has elapsed again, a third RPC is sent, and so on, up to `max_requests`. In the above configuration, after 1ms there would be one outstanding hedged RPC, after 501ms there would be two outstanding hedged RPCs, and after 1001ms there would be three outstanding hedged RPCs.


The quoted key names are inconsistent with the example above the paragraph (hedging_params => hedging_policy, hedging_delay => hedging_delay_ms)

Besides that, I wonder if this paragraph should mention presence of deadline which can already be set for a call ? It is mentioned above in "Maximum Number of Retries" and below in "Configuration Language" examples for hedging, but not here and since we're within "Detailed Design" section, I think the semantics of hedging in presence of call deadline should be also described.

Done and done, thanks for pointing that out.

lukaszx0 · 2017-02-19T06:09:36Z

A6.md

+
+Throttling may only be specified per server name, rather than per method or per service.
+
+For each server name, gRPC maintains a `token_count` which is initially set to `max_tokens`. Every outgoing RPC (regardless of service or method invoked) will effect `token_count` as follows:


nit: to avoid confusion for future readers, maybe worth being more verbose by saying "... gRPC client maintains..."

lukaszx0 · 2017-02-19T06:17:36Z

A6.md

+
+Neither retry attempts or hedged RPCs block when `token_count` is less than or equal to the threshold. Retry attempts are canceled and the failure returned to the client application. The hedged request is cancelled, and if there are no other already-sent hedged RPCs the failure is returned to the client application.
+
+The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code).


nit: I think we could improve UX for future readers by removing here links. How about:

... are RPCs that fail with a status code that qualifies as a retryable ([see retryable status codes](#retryable-status-codes)) or non-fatal status code (see [hedging policy](#hedging-policy)) ...

or maybe even just:

... are RPCs that fail with a status code that qualifies as a [retryable](#retryable-status-codes)) or [non-fatal status code](#hedging-policy) ...

lukaszx0 · 2017-02-19T07:06:25Z

A6.md

+
+#### Where Retries Occur
+
+The retry policy will be implemented in-between the channel and the load balancing policy. That way every retry gets a chance to be sent out on a different subchannel than it originally failed on.


Immediately wanted to ask about metadata which LB could us, but found answer later. Maybe worth linking to it? Someting like "... every retry gets a chance to be sent out on a different subchannel based on the retry metadata which can by used by load balancer ... "

This is a good point, but currently the metadata only says if the RPC is a retry. We are considering adding information about the previously used subchannel(s), based on comments from the gRFC email thread. If/once that is added, then adding the link you suggest (based on the retry metadata...) makes sense.

lukaszx0 · 2017-02-19T07:18:50Z

A6.md

+
+#### Buffered RPCs
+
+The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config.


I assume both key's values are in bytes. Probably should be explicitly stated at least it docs, or maybe in key name? Or maybe it's number of messages to buffer? (if regardless of size, would mean that is effectively unbounded)

Added in_bytes to the name.

lukaszx0 · 2017-02-19T07:29:21Z

A6.md

+
+Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries.
+
+Eventually, retry logic should be taken out of the wrapping libraries, and only exist in gRPC. But allowing the retries to be disabled entirely will make that rollout process easier.


Does that mean that support for disabling retries in the implementations will be temporary or will it be always supporter?

I hesitate to say always...but it will always be supported, at least for now :-) We want clients in resource-constrained environments to also be able to disable retries, so this is intended as a feature of the API, not a temporary stopgap.

lukaszx0 · 2017-02-19T08:23:05Z

A6.md

+#### Pushback
+Servers may explicitly pushback by setting metadata in their response to the client. The pushback can either tell the client to retry after a given delay or to not retry at all. If the client has already exhausted its `max_retry_attempts`, the call will not be retried even if the server says to retry after a given delay.
+
+A new metadata key, `"x-grpc-retry-pushback-ms"`, will be added to support server pushback. If the value for pushback is set to -1, then it will be seen as the server asking the client not to retry at all.


Before I got to "Throttling Configuration" below, it wasn't really clear to me what's the scope of pushback on the server side. I think it's worth mentioning here since this section is supposed to describe it in details, not the examples one below.

I'm not sure what you mean by scope on the server side. Pushback is intended to give service owners a means to programmatically influence retry attempts, whereas everything else (including throttling) is implemented automatically on the client side. Pushback is only related to throttling configuration in that pushback saying not to retry will be counted as a failure.

lukaszx0 · 2017-02-19T08:34:01Z

A6.md

+The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code).
+
+#### Pushback
+Servers may explicitly pushback by setting metadata in their response to the client. The pushback can either tell the client to retry after a given delay or to not retry at all. If the client has already exhausted its `max_retry_attempts`, the call will not be retried even if the server says to retry after a given delay.


I assume by the mention of max_rety_attempts that Pushback applies only to backoff policy. Probably should be stated explicitly?

Pushback actually applies to hedging too. This was hidden in the Summary of Retry and Hedging Logic section, now it's explicitly describe here as well.

Addressing lukaszx0's comments

Addressing markdroth's comments

lukaszx0 · 2017-02-22T05:32:25Z

A6.md

+
+After the RPC response has been returned to the client application layer, the RPC is removed from the buffer.
+
+New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable. For client-side streaming RPCs, if additional outgoing messages would cause the buffered size to exceed the per-RPC limit, the entire message will be removed from the buffer. 


New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable.

Does that mean that they'll fail immediately? Does that mean they will make original request and in case of failure won't qualify for retry and fail after first initial request? I assume it's the former, maybe worth clarifying?

Actually, it is the latter. We don't intend the buffer to limit the overall number or size of outgoing RPCs, so we still send the original request and essentially behave exactly as if there were no retry policy specified for RPCs sent when the buffer is full.

Updated the doc to state this explicitly.

Ah, I see. Makes sense, thanks for the clarification.

markdroth

Just a few more minor comments. On the whole, this looks great!

markdroth · 2017-02-22T15:37:44Z

A6.md

@@ -52,7 +52,7 @@ Many teams have implemented their own retry logic wrapped around gRPC like [Vene

 ### Overview

-gRPC will support two configurable retry policies. The service configuration may choose from a retry policy (retry failed RPCs) or a hedging policy (aggressively send the same RPC multiple times in parallel). An individual RPC may be governed by a retry policy or a hedge policy, but not both.
+gRPC will support two configurable retry policies. The [service configuration](https://github.com/grpc/proposal/pull/5) may choose from a retry policy (retry failed RPCs) or a hedging policy (aggressively send the same RPC multiple times in parallel). An individual RPC may be governed by a retry policy or a hedge policy, but not both.


I suggest changing the link for the text "service configuration" to point to https://github.com/grpc/grpc/blob/master/doc/service_config.md, and adding the following after it:

(which may soon be published via DNS)

markdroth · 2017-02-22T15:47:46Z

A6.md

@@ -214,15 +217,15 @@ To clarify the second scenario, we define an *outgoing message* as everything th

 #### Buffered RPCs

-The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config.
+The gRPC library will have a configurable amount of available memory, `retry_buffer_size_in_bytes`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit_in_bytes`. These limits are configured by the client, rather than coming from the service config.


I suggest changing the wording to make it clear that this part of the spec is optional and may differ based on how the client is implemented. Suggested text (feel free to wordsmith):

The gRPC client library may support application-configured limits for the amount of memory used for retries. It is suggested that the client limit both the total amount of memory used to buffer retryable or hedged RPCs as well as the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs).

markdroth · 2017-02-22T15:48:37Z

A6.md

@@ -214,15 +217,15 @@ To clarify the second scenario, we define an *outgoing message* as everything th

 #### Buffered RPCs


Suggest changing this heading to something like "Memory Management".

"Memory Management (Buffering)" maybe?

lukaszx0 · 2017-02-22T19:58:10Z

A6.md

+
+![State Diagram](A6_graphics/StateDiagram.png)
+
+[Link to diagram on draw.io](https://drive.google.com/a/google.com/file/d/0Bwcp6fL6ikUvOGJWcExrNFhMZ0k/view?usp=sharing)


This won't fly. I'm guessing you linked to your work drive at google. I think it would be nice if you could export SVGs and add them to repo alongside pngs (looks like draw.io let's you do that)

Oops!! Thanks for the catch. I switched all of the diagrams to SVG format :)

Cool, didn't know GH will render it. Nice, thank you!

Um, apparently it won't render them inline - https://github.com/ncteisen/proposal/blob/3b24372020f0acc7017bbf2e1e559dcdc9279561/A6.md 😞

I do think it would be better to include the SVG files in this PR instead of linking to an external source. Otherwise, there's no guarantee that someone won't delete the external source without realizing that it's referenced from this document.

Haha oops again, reverted back to PNGs, and now there are links to the SVG files too.

Looks good! Thank you :)

mwitkow

Right, having implemented most of what is proposed here as generic gRPC-Go Interceptors (https://github.com/mwitkow/go-grpc-middleware/tree/master/retry) that we use in production, a couple of things:

a) all of this logic can be implemented as interceptor APIs in Go, Java and C++.

b) retry logic would benefit a lot from knowing whether the method is idempotent or not. II understand that this is supposed to be handled by "service configs", but realistically they're really hard to use. Few people would put their retry logic in DNS TXT entries, and even fewer people operate the gRPC LB protocol. Can we consider adding a .proto option (annotation) to Method definitions?

c) One thing I found out useful "in the wild" is the ability to limit the Deadline of the retriable call. For example, the "parent" RPC call (user invoked) has a deadline of 5s, but each retriable call only 1s. This allows you to skip a "deadlining" server and retry against one that works.

mwitkow · 2017-02-28T08:14:28Z

A6.md

+
+In general, only status codes that indicate the service did not process the request should be retried. However, a more aggressive set of parameters can be specified when the service owner knows that a method is idempotent, or safe to be processed more than once by the server. For example, if an RPC to a delete user method fails with an `INTERNAL` error code, it's possible that the user was already deleted before the failure occurred. If the method is idempotent then retrying the call should have no adverse effects. However, it is entirely possible to implement a non-idempotent delete method that has adverse side effects if called multiple times, and such a method should not be retried unless the error codes guarantee the original RPC was not processed. It is up to the service owner to pick the correct set of retryable status codes given the semantics of their service. gRPC retries do not provide a mechanism to specifically mark a method as idempotent.
+
+#### Hedging Policy


Can we rename hedging to something else? When I was first reading this I thought it meant one:
send X RPCs immediately and return the call once the first one completes, cancelling the others.

Hedging seems like the appropriate name. It can mean what you say (with a hedging delay of 0ms) but also supports sending the RPCs on a fixed delay.

mwitkow · 2017-02-28T08:17:04Z

A6.md

+
+#### Memory Management (Buffering)
+
+The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config.


I believe that this could be made significantly simpler by saying: number of buffered per RPC.
This makes it easier for users of the library to reason about it: they know that 5 unary RPCs will make becasue retries buffer 10.

This would work if we assume a fixed size for each RPCs. Since this also applies to streaming RPCs that are effectively unbounded, we need to specify the buffer size in memory rather than number of RPCs.

mwitkow · 2017-02-28T08:17:22Z

A6.md

+
+#### Disabling Retries
+
+Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries.


This should be tunable per-call.

Our intent is to limit the client-side API surface as much as possible. We have to allow disabling retries altogether, due to existing wrapper libraries and clients in resource-constrained environments, but otherwise we want to keep retry policy the domain of the service owner rather than individual clients.

ericgribkoff · 2017-02-28T15:06:06Z

Hi Michal,

Thank you for the comments. We are trying to keep high-level discussion on the email thread (see here) but my responses to your points (b) and (c) are below.

b) retry logic would benefit a lot from knowing whether the method is idempotent or not. II understand that this is supposed to be handled by "service configs", but realistically they're really hard to use. Few people would put their retry logic in DNS TXT entries, and even fewer people operate the gRPC LB protocol. Can we consider adding a .proto option (annotation) to Method definitions?

There are two concerns here. One is that saying a method is idempotent really just means "retry on status codes x y and z". If we pick a canonical set of idempotent status codes, we are forcing every service owner to obey these semantics if they want to use retries. However, the gRPC status code mechanism is very flexible (even allowing services to return UNAVAILABLE arbitrarily) so we'd prefer to force service owners to consider the semantics of their application and pick a concrete set of status codes rather than just flipping an "idempotent" switch. The second concern is around the ease of use of the service config. The intent is for the service config to be a universally useful mechanism, and we want to avoid just baking everything into the proto. Concerns about the delivery mechanism for service config shouldn't invalidate its use for encoding retry policy, and may be something we have to tackle separately.

c) One thing I found out useful "in the wild" is the ability to limit the Deadline of the retriable call. For example, the "parent" RPC call (user invoked) has a deadline of 5s, but each retriable call only 1s. This allows you to skip a "deadlining" server and retry against one that works.

This is covered by our hedging policy. There doesn't seem to be any reason to cancel the first RPC in your scenario, as it may be just about to complete on the server and cancellation implies the work done so far is wasted. Instead, hedging allows you to send the first request, wait one second, send a second request, and accept the results of whichever completes first.

markdroth · 2017-02-28T18:15:58Z

A6.md


-The gRPC library will have a configurable amount of available memory, `retry_buffer_size_in_bytes`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit_in_bytes`. These limits are configured by the client, rather than coming from the service config.
+The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config.


I think this section is still a bit too specific (i.e., "will" instead of "may", and specifying specific names for the two pools). Having discussed the C-core implementation with @ctiller, I think that we will not actually wind up supporting a separate total pool for retries; instead, that will come out of the existing pool used for all of gRPC. However, we will still provide a per-RPC limit, since we want to avoid one large RPC from starving all other RPCs of space for retries.

I think this section should be re-worded to describe the types of constraints that an implementation may want to consider, but it should not mandate any particular mechanism or configuration parameters.

* Clarify that initial metadata commits an RPC * Remove gRPC wire format change section * gRPC servers should delay sending headers until first message * Hedged RPCs will pass 'do not call' list to local LB

markdroth · 2017-03-02T16:23:54Z

A6.md

@@ -140,7 +140,7 @@ If all instances of a hedged RPC fail, there are no additional retry attempts. E

 If server pushback that specifies not to retry is received in response to a hedged request, no further hedged requests should be issued for the call.

-Hedged requests should be sent to distinct backends, if possible.
+Hedged requests should be sent to distinct backends, if possible. To facilitate this, the gRPC client will maintain a list of previously used backend addresses for each hedged RPC. This list will be passed to the gRPC client's local load-balancing policy. Previously used addresses should be skipped when the load-balancing policy picks a server for the hedged RPC. If all available backend addresses have already been used, the load-balancing policy may pick any backend. This means that service owners who specify more hedged requests than they have available backends may still see hedged requests sent to the same backend.


I think the language here should be less prescriptive and more about what functionality can be obtained if desired. For example, instead of saying "Previously used addresses should be skipped", we can say something like "The load balancing policy may use this information to send retries to an address that was not previously used if possible". In addition, the last couple of sentences can be changed to indicate that the load-balancing policy will need to decide what to do in the case where there are no addresses that have not already been tried.

a11r · 2017-03-03T22:21:07Z

Thanks for the detailed design and everyone for the thoughtful feedback. I am approving and merging this PR now. Any follow up discussion can continue on the thread.

ncteisen and others added 6 commits February 10, 2017 15:08

Initial commit of retry design proposal

91ead6c

minor changes

d255cfd

Added style and links

83b546a

Add ToC

08a5ca9

Merge branch 'retry-design' into retry-design

03ee06b

Merge pull request #1 from ericgribkoff/retry-design

a68ea77

minor changes

ncteisen assigned a11r Feb 11, 2017

googlebot added the cla: yes label Feb 11, 2017

ncteisen added 2 commits February 10, 2017 16:31

Add link to discussion

fd1b80f

Merge branch 'retry-design' of github.com:ncteisen/proposal into retr…

4d61d04

…y-design

ncteisen changed the title ~~gRPC A6: Retries~~ gRFC A6: Retries Feb 11, 2017

markdroth reviewed Feb 13, 2017

View reviewed changes

ncteisen mentioned this pull request Feb 19, 2017

RPC get grpc_status unavailable after server restart grpc/grpc#9767

Closed

lukaszx0 reviewed Feb 19, 2017

View reviewed changes

ericgribkoff added 3 commits February 19, 2017 09:05

Update A6.md

fa7a2c5

Addressing lukaszx0's comments

JSON updates

b69c907

Addressing markdroth's comments

Flatten exponential backoff parameters

1685d7e

lukaszx0 reviewed Feb 22, 2017

View reviewed changes

Clarify behavior when new RPCs don't fit in buffer

cf8383a

markdroth reviewed Feb 22, 2017

View reviewed changes

Small tweaks

7bd985f

lukaszx0 reviewed Feb 22, 2017

View reviewed changes

ncteisen force-pushed the retry-design branch from 3dfb2dc to 3b24372 Compare February 22, 2017 20:10

Change diagrams to svg format.

75e08fa

ncteisen force-pushed the retry-design branch from 3b24372 to 75e08fa Compare February 22, 2017 20:17

lukaszx0 mentioned this pull request Feb 23, 2017

Service configs in DNS. #5

Merged

Add section on change to wire format

9136afb

ncteisen force-pushed the retry-design branch from 19487f9 to 9136afb Compare February 27, 2017 23:07

mwitkow suggested changes Feb 28, 2017

View reviewed changes

markdroth reviewed Feb 28, 2017

View reviewed changes

ericgribkoff added 3 commits March 2, 2017 07:54

Committed RPCs, Hedging and LB policy

8706bf8

* Clarify that initial metadata commits an RPC * Remove gRPC wire format change section * gRPC servers should delay sending headers until first message * Hedged RPCs will pass 'do not call' list to local LB

table of contents

3c4e5c8

link to wire spec

ad060be

markdroth reviewed Mar 2, 2017

View reviewed changes

ericgribkoff added 3 commits March 2, 2017 08:56

change 'should' to 'may' for hedging/LB interaction

a21627e

replace non-error response definition

51cc05d

fix suggested server behavior in when retries are valid

9a72547

a11r merged commit 47f8bef into grpc:master Mar 3, 2017

ncteisen deleted the retry-design branch March 9, 2017 22:15


		Throttling may only be specified per server name, rather than per method or per service.

		For each server name, gRPC maintains a `token_count` which is initially set to `max_tokens`. Every outgoing RPC (regardless of service or method invoked) will effect `token_count` as follows:


		Neither retry attempts or hedged RPCs block when `token_count` is less than or equal to the threshold. Retry attempts are canceled and the failure returned to the client application. The hedged request is cancelled, and if there are no other already-sent hedged RPCs the failure is returned to the client application.

		The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code).


		#### Where Retries Occur

		The retry policy will be implemented in-between the channel and the load balancing policy. That way every retry gets a chance to be sent out on a different subchannel than it originally failed on.


		#### Buffered RPCs

		The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config.


		Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries.

		Eventually, retry logic should be taken out of the wrapping libraries, and only exist in gRPC. But allowing the retries to be disabled entirely will make that rollout process easier.


		After the RPC response has been returned to the client application layer, the RPC is removed from the buffer.

		New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable. For client-side streaming RPCs, if additional outgoing messages would cause the buffered size to exceed the per-RPC limit, the entire message will be removed from the buffer.

		@@ -214,15 +217,15 @@ To clarify the second scenario, we define an outgoing message as everything th

		#### Buffered RPCs


		![State Diagram](A6_graphics/StateDiagram.png)

		[Link to diagram on draw.io](https://drive.google.com/a/google.com/file/d/0Bwcp6fL6ikUvOGJWcExrNFhMZ0k/view?usp=sharing)


		In general, only status codes that indicate the service did not process the request should be retried. However, a more aggressive set of parameters can be specified when the service owner knows that a method is idempotent, or safe to be processed more than once by the server. For example, if an RPC to a delete user method fails with an `INTERNAL` error code, it's possible that the user was already deleted before the failure occurred. If the method is idempotent then retrying the call should have no adverse effects. However, it is entirely possible to implement a non-idempotent delete method that has adverse side effects if called multiple times, and such a method should not be retried unless the error codes guarantee the original RPC was not processed. It is up to the service owner to pick the correct set of retryable status codes given the semantics of their service. gRPC retries do not provide a mechanism to specifically mark a method as idempotent.

		#### Hedging Policy


		#### Memory Management (Buffering)

		The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config.


		#### Disabling Retries

		Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries.

gRFC A6: Retries #12

gRFC A6: Retries #12

Conversation

ncteisen commented Feb 11, 2017 • edited by ericgribkoff

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukaszx0 Feb 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukaszx0 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukaszx0 Feb 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwitkow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericgribkoff commented Feb 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a11r commented Mar 3, 2017

ncteisen commented Feb 11, 2017 •

edited by ericgribkoff

lukaszx0 Feb 19, 2017 •

edited

lukaszx0 left a comment •

edited

lukaszx0 Feb 19, 2017 •

edited