Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRFC A6: Retries #12

Merged
merged 21 commits into from Mar 3, 2017
Merged

gRFC A6: Retries #12

merged 21 commits into from Mar 3, 2017

Conversation

ncteisen
Copy link
Contributor

@ncteisen ncteisen commented Feb 11, 2017

gRFC for retries

Authors: ncteisen and ericgribkoff

grpc.io discussion

@ncteisen ncteisen changed the title gRPC A6: Retries gRFC A6: Retries Feb 11, 2017
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few cosmetic issues to deal with here.

A6.md Outdated
* Author(s): [Noah Eisen](https://github.com/ncteisen) and [Eric Gribkoff](https://github.com/ericgribkoff)
* Approver: a11r
* Status: Draft
* Implemented in: All Languages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is intended to indicate where this proposal is already implemented. Given that we haven't yet implemented this in any language, I think this field should be blank.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A6.md Outdated
The maximum number of retry attempts per original RPC. This can be specified in the retry policy as follows:

```
'max_retry_attempts': 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example for the hedging policy below shows the parameters wrapped in a 'hedging_policy' block. For consistency, the retry examples should be wrapped in a 'retry_policy' block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A6.md Outdated
1. Retry policy: Retry in *n* ms. If this attempt also fails, retry delay will reset to initial backoff for the following retry (if applicable)
2. Hedging policy: Send next hedged request in *n* ms. Subsequent hedged requests will resume at `n + hedging_delay`

![State Diagram](A6_graphics/StateDiagram.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be a good idea to include the SVG version of these diagrams as well as the PNG, so that we can edit them in the future if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncteisen - this request is buried on github after I pushed the latest commit. Do you have SVG versions of the diagrams we can add?

A6.md Outdated
#### Exposed Retry Metadata
Both client and server application logic will have access to data about retries via gRPC metadata. Upon seeing an RPC from the client, the server will know if it was a retry, and moreover, it will know the number of previously made attempts. Likewise, the client will receive the number of retry attempts made when receiving the results of an RPC.

The new header names for exposing the metadata will be `"x-grpc-retry-attempts"` to give clients and servers access to the attempt count.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the value of this field will be a human-readable integer? If so, please explicitly state that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

A6.md Outdated
When gRPC receives a non-OK response status from a server, this status is checked against the set of retryable status codes to determine if a retry attempt should be made.

```
'retryable_status_codes': {UNAVAILABLE}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, let's make all examples in this doc be the exact JSON format that will be used in the service config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A6.md Outdated
This is an example of a retry policy and its associated configuration. It implements exponential backoff with a maximum of three retry attempts, only retrying RPCs when an `UNAVAILABLE` status code is received.

```
'retry_policy': {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default conversion from protobuf will change a protobuf field of the form "foo_bar" to a JSON string of the form "fooBar". So let's use that form for all fields in this doc (e.g., "retry_policy" should be "retryPolicy").

In fact, before finalizing this, I would encourage you to write the protobuf definition that will be used internally and then create a proto and see how it converts to JSON, to make sure it's compatible with what we're specifying here.

Copy link
Contributor

@lukaszx0 lukaszx0 Feb 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I think it'd be great to have a link to a reference protobuf definition which reflects the JSON in the main service config doc in a form of Appendix at the bottom. I think a lot of people due to DNS limitation will use it, myself included.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (mostly finalized) JSON format of the service config with retry, hedging, and throttling policies is now included. I think it's a good idea to include the protobuf definition as well once its finalized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukaszx0, can you please chime in on the comment thread for #5 (https://groups.google.com/d/topic/grpc-io/DkweyrWEXxU/discussion) and describe the limitations that would prevent you from using DNS for service configs? I'd like more information on that, but I'd like to keep this discussion in that other thread, since it's not really directly related to the retry design. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth sure, I had it opened in tab but didn't get to it.

A6.md Outdated
{
'loadBalancingPolicy': string,

# Throttling parameters for retry attempts and hedged RPCs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use // instead of # for comments in JSON.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A6.md Outdated
# Integer
'multiplier': number
},
# Array of strings. Supported values are the gRPC status codes, e.g.,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the status codes should be encoded as JSON strings? If so, let's state that explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

A6.md Outdated
# Integer
'hedging_delay_ms' : number,
# Array of strings.
# Supported values are the gRPC status codes, e.g.,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Proposal

### Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, one more thing: This section should probably include a link to #5.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@lukaszx0 lukaszx0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass and studied it a bit. In general looks great, very well written! Kudos!

A6.md Outdated
'retryable_status_codes': {UNAVAILABLE}
```

In general, only status codes that indicate the service did not process the request should be retried. However, in the case of idempotent methods, a more aggressive set of parameters can be specified, as there is no harm in the server processing the request more than once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotence is mentioned here, but it's not clear where does it come from. I assume it is expected to be set in method descriptor - I know that there is API for that in Java, not sure about other languages - if so, maybe clarify it a bit? How about "... in the case of idempotent methods (which can be set in method descriptor)..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc to clarify. We are actually intentionally not defining a mechanism for service owners to specify that a method is idempotent. An idempotent flag would really just mean "retry on these additional status codes", and it's cleaner to just let the service owner explicitly set the status codes rather than have an idempotent flag.

The use of idempotent here in the doc was just to give an example of when its safe to expand the set of retryable status codes, and the example has been expanded to make that clearer. The idempotent bit in the Java API is not being used by this retry policy design.

A6.md Outdated
}
```

When a method has `hedging_params` set, the first RPC is sent immediately, as with a standard non-hedged call. After `hedging_delay` has elapsed without a successful response, the second RPC will be issued. If neither RPC has received a response after `hedging_delay` has elapsed again, a third RPC is sent, and so on, up to `max_requests`. In the above configuration, after 1ms there would be one outstanding hedged RPC, after 501ms there would be two outstanding hedged RPCs, and after 1001ms there would be three outstanding hedged RPCs.
Copy link
Contributor

@lukaszx0 lukaszx0 Feb 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quoted key names are inconsistent with the example above the paragraph (hedging_params => hedging_policy, hedging_delay => hedging_delay_ms)

Besides that, I wonder if this paragraph should mention presence of deadline which can already be set for a call ? It is mentioned above in "Maximum Number of Retries" and below in "Configuration Language" examples for hedging, but not here and since we're within "Detailed Design" section, I think the semantics of hedging in presence of call deadline should be also described.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done, thanks for pointing that out.

A6.md Outdated

Throttling may only be specified per server name, rather than per method or per service.

For each server name, gRPC maintains a `token_count` which is initially set to `max_tokens`. Every outgoing RPC (regardless of service or method invoked) will effect `token_count` as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to avoid confusion for future readers, maybe worth being more verbose by saying "... gRPC client maintains..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

A6.md Outdated

Neither retry attempts or hedged RPCs block when `token_count` is less than or equal to the threshold. Retry attempts are canceled and the failure returned to the client application. The hedged request is cancelled, and if there are no other already-sent hedged RPCs the failure is returned to the client application.

The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we could improve UX for future readers by removing here links. How about:

... are RPCs that fail with a status code that qualifies as a retryable ([see retryable status codes](#retryable-status-codes)) or non-fatal status code (see [hedging policy](#hedging-policy)) ...

or maybe even just:

... are RPCs that fail with a status code that qualifies as a [retryable](#retryable-status-codes)) or [non-fatal status code](#hedging-policy) ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#### Where Retries Occur

The retry policy will be implemented in-between the channel and the load balancing policy. That way every retry gets a chance to be sent out on a different subchannel than it originally failed on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Immediately wanted to ask about metadata which LB could us, but found answer later. Maybe worth linking to it? Someting like "... every retry gets a chance to be sent out on a different subchannel based on the retry metadata which can by used by load balancer ... "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, but currently the metadata only says if the RPC is a retry. We are considering adding information about the previously used subchannel(s), based on comments from the gRFC email thread. If/once that is added, then adding the link you suggest (based on the retry metadata...) makes sense.

A6.md Outdated

#### Buffered RPCs

The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume both key's values are in bytes. Probably should be explicitly stated at least it docs, or maybe in key name? Or maybe it's number of messages to buffer? (if regardless of size, would mean that is effectively unbounded)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in_bytes to the name.


Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries.

Eventually, retry logic should be taken out of the wrapping libraries, and only exist in gRPC. But allowing the retries to be disabled entirely will make that rollout process easier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that support for disabling retries in the implementations will be temporary or will it be always supporter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitate to say always...but it will always be supported, at least for now :-) We want clients in resource-constrained environments to also be able to disable retries, so this is intended as a feature of the API, not a temporary stopgap.

A6.md Outdated
#### Pushback
Servers may explicitly pushback by setting metadata in their response to the client. The pushback can either tell the client to retry after a given delay or to not retry at all. If the client has already exhausted its `max_retry_attempts`, the call will not be retried even if the server says to retry after a given delay.

A new metadata key, `"x-grpc-retry-pushback-ms"`, will be added to support server pushback. If the value for pushback is set to -1, then it will be seen as the server asking the client not to retry at all.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I got to "Throttling Configuration" below, it wasn't really clear to me what's the scope of pushback on the server side. I think it's worth mentioning here since this section is supposed to describe it in details, not the examples one below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by scope on the server side. Pushback is intended to give service owners a means to programmatically influence retry attempts, whereas everything else (including throttling) is implemented automatically on the client side. Pushback is only related to throttling configuration in that pushback saying not to retry will be counted as a failure.

A6.md Outdated
The only RPCs that are counted as failures for the throttling policy are RPCs that fail with a status code that qualifies as a retryable or non-fatal status code (see [here](#retryable-status-codes) and [here](#hedging-policy)), or that receive a pushback response indicating not to retry. This avoids conflating server failure with responses to malformed requests (such as the `INVALID_ARGUMENT` status code).

#### Pushback
Servers may explicitly pushback by setting metadata in their response to the client. The pushback can either tell the client to retry after a given delay or to not retry at all. If the client has already exhausted its `max_retry_attempts`, the call will not be retried even if the server says to retry after a given delay.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume by the mention of max_rety_attempts that Pushback applies only to backoff policy. Probably should be stated explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushback actually applies to hedging too. This was hidden in the Summary of Retry and Hedging Logic section, now it's explicitly describe here as well.

Addressing lukaszx0's comments
Addressing markdroth's comments
A6.md Outdated

After the RPC response has been returned to the client application layer, the RPC is removed from the buffer.

New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable. For client-side streaming RPCs, if additional outgoing messages would cause the buffered size to exceed the per-RPC limit, the entire message will be removed from the buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New RPCs which do not fit in the available buffer space (either due to the total available buffer space, or due to the per-RPC limit) will not be retryable.

Does that mean that they'll fail immediately? Does that mean they will make original request and in case of failure won't qualify for retry and fail after first initial request? I assume it's the former, maybe worth clarifying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is the latter. We don't intend the buffer to limit the overall number or size of outgoing RPCs, so we still send the original request and essentially behave exactly as if there were no retry policy specified for RPCs sent when the buffer is full.

Updated the doc to state this explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Makes sense, thanks for the clarification.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more minor comments. On the whole, this looks great!

A6.md Outdated
@@ -52,7 +52,7 @@ Many teams have implemented their own retry logic wrapped around gRPC like [Vene

### Overview

gRPC will support two configurable retry policies. The service configuration may choose from a retry policy (retry failed RPCs) or a hedging policy (aggressively send the same RPC multiple times in parallel). An individual RPC may be governed by a retry policy or a hedge policy, but not both.
gRPC will support two configurable retry policies. The [service configuration](https://github.com/grpc/proposal/pull/5) may choose from a retry policy (retry failed RPCs) or a hedging policy (aggressively send the same RPC multiple times in parallel). An individual RPC may be governed by a retry policy or a hedge policy, but not both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest changing the link for the text "service configuration" to point to https://github.com/grpc/grpc/blob/master/doc/service_config.md, and adding the following after it:

(which may soon be published via DNS)

A6.md Outdated
@@ -214,15 +217,15 @@ To clarify the second scenario, we define an *outgoing message* as everything th

#### Buffered RPCs

The gRPC library will have a configurable amount of available memory, `retry_buffer_size`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit`. These limits are configured by the client, rather than coming from the service config.
The gRPC library will have a configurable amount of available memory, `retry_buffer_size_in_bytes`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit_in_bytes`. These limits are configured by the client, rather than coming from the service config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest changing the wording to make it clear that this part of the spec is optional and may differ based on how the client is implemented. Suggested text (feel free to wordsmith):

The gRPC client library may support application-configured limits for the amount of memory used for retries. It is suggested that the client limit both the total amount of memory used to buffer retryable or hedged RPCs as well as the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs).

A6.md Outdated
@@ -214,15 +217,15 @@ To clarify the second scenario, we define an *outgoing message* as everything th

#### Buffered RPCs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing this heading to something like "Memory Management".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Memory Management (Buffering)" maybe?

A6.md Outdated

![State Diagram](A6_graphics/StateDiagram.png)

[Link to diagram on draw.io](https://drive.google.com/a/google.com/file/d/0Bwcp6fL6ikUvOGJWcExrNFhMZ0k/view?usp=sharing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't fly. I'm guessing you linked to your work drive at google. I think it would be nice if you could export SVGs and add them to repo alongside pngs (looks like draw.io let's you do that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!! Thanks for the catch. I switched all of the diagrams to SVG format :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, didn't know GH will render it. Nice, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it would be better to include the SVG files in this PR instead of linking to an external source. Otherwise, there's no guarantee that someone won't delete the external source without realizing that it's referenced from this document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha oops again, reverted back to PNGs, and now there are links to the SVG files too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you :)

Copy link

@mwitkow mwitkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, having implemented most of what is proposed here as generic gRPC-Go Interceptors (https://github.com/mwitkow/go-grpc-middleware/tree/master/retry) that we use in production, a couple of things:

a) all of this logic can be implemented as interceptor APIs in Go, Java and C++.

b) retry logic would benefit a lot from knowing whether the method is idempotent or not. II understand that this is supposed to be handled by "service configs", but realistically they're really hard to use. Few people would put their retry logic in DNS TXT entries, and even fewer people operate the gRPC LB protocol. Can we consider adding a .proto option (annotation) to Method definitions?

c) One thing I found out useful "in the wild" is the ability to limit the Deadline of the retriable call. For example, the "parent" RPC call (user invoked) has a deadline of 5s, but each retriable call only 1s. This allows you to skip a "deadlining" server and retry against one that works.


In general, only status codes that indicate the service did not process the request should be retried. However, a more aggressive set of parameters can be specified when the service owner knows that a method is idempotent, or safe to be processed more than once by the server. For example, if an RPC to a delete user method fails with an `INTERNAL` error code, it's possible that the user was already deleted before the failure occurred. If the method is idempotent then retrying the call should have no adverse effects. However, it is entirely possible to implement a non-idempotent delete method that has adverse side effects if called multiple times, and such a method should not be retried unless the error codes guarantee the original RPC was not processed. It is up to the service owner to pick the correct set of retryable status codes given the semantics of their service. gRPC retries do not provide a mechanism to specifically mark a method as idempotent.

#### Hedging Policy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename hedging to something else? When I was first reading this I thought it meant one:
send X RPCs immediately and return the call once the first one completes, cancelling the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hedging seems like the appropriate name. It can mean what you say (with a hedging delay of 0ms) but also supports sending the RPCs on a fixed delay.

A6.md Outdated

#### Memory Management (Buffering)

The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this could be made significantly simpler by saying: number of buffered per RPC.
This makes it easier for users of the library to reason about it: they know that 5 unary RPCs will make becasue retries buffer 10.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work if we assume a fixed size for each RPCs. Since this also applies to streaming RPCs that are effectively unbounded, we need to specify the buffer size in memory rather than number of RPCs.


#### Disabling Retries

Clients cannot override retry policy set by the service config. However, retry support can be disabled entirely within the gRPC client library. This is designed to enable existing libraries that wrap gRPC with their own retry implementation (such as Veneer Toolkit) to avoid having retries taking place at the gRPC layer and within their own libraries.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tunable per-call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our intent is to limit the client-side API surface as much as possible. We have to allow disabling retries altogether, due to existing wrapper libraries and clients in resource-constrained environments, but otherwise we want to keep retry policy the domain of the service owner rather than individual clients.

@ericgribkoff
Copy link
Contributor

Hi Michal,

Thank you for the comments. We are trying to keep high-level discussion on the email thread (see here) but my responses to your points (b) and (c) are below.

b) retry logic would benefit a lot from knowing whether the method is idempotent or not. II understand that this is supposed to be handled by "service configs", but realistically they're really hard to use. Few people would put their retry logic in DNS TXT entries, and even fewer people operate the gRPC LB protocol. Can we consider adding a .proto option (annotation) to Method definitions?

There are two concerns here. One is that saying a method is idempotent really just means "retry on status codes x y and z". If we pick a canonical set of idempotent status codes, we are forcing every service owner to obey these semantics if they want to use retries. However, the gRPC status code mechanism is very flexible (even allowing services to return UNAVAILABLE arbitrarily) so we'd prefer to force service owners to consider the semantics of their application and pick a concrete set of status codes rather than just flipping an "idempotent" switch. The second concern is around the ease of use of the service config. The intent is for the service config to be a universally useful mechanism, and we want to avoid just baking everything into the proto. Concerns about the delivery mechanism for service config shouldn't invalidate its use for encoding retry policy, and may be something we have to tackle separately.

c) One thing I found out useful "in the wild" is the ability to limit the Deadline of the retriable call. For example, the "parent" RPC call (user invoked) has a deadline of 5s, but each retriable call only 1s. This allows you to skip a "deadlining" server and retry against one that works.

This is covered by our hedging policy. There doesn't seem to be any reason to cancel the first RPC in your scenario, as it may be just about to complete on the server and cancellation implies the work done so far is wasted. Instead, hedging allows you to send the first request, wait one second, send a second request, and accept the results of whichever completes first.

A6.md Outdated

The gRPC library will have a configurable amount of available memory, `retry_buffer_size_in_bytes`, to buffer outgoing retryable or hedged RPCs. There is also a per-RPC size limit ,`per_rpc_buffer_limit_in_bytes`. These limits are configured by the client, rather than coming from the service config.
The gRPC client library will support application-configured limits for the amount of memory used for retries. It is suggested that the client sets `retry_buffer_size_in_bytes` to limit the total amount of memory used to buffer retryable or hedged RPCs. The client should also set `per_rpc_buffer_limit_in_bytes` to limit the amount of memory used by any one RPC (to prevent a single large RPC from using the whole buffer and preventing retries of subsequent smaller RPCs). These limits are configured by the client, rather than coming from the service config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is still a bit too specific (i.e., "will" instead of "may", and specifying specific names for the two pools). Having discussed the C-core implementation with @ctiller, I think that we will not actually wind up supporting a separate total pool for retries; instead, that will come out of the existing pool used for all of gRPC. However, we will still provide a per-RPC limit, since we want to avoid one large RPC from starving all other RPCs of space for retries.

I think this section should be re-worded to describe the types of constraints that an implementation may want to consider, but it should not mandate any particular mechanism or configuration parameters.

* Clarify that initial metadata commits an RPC
* Remove gRPC wire format change section
* gRPC servers should delay sending headers until first message
* Hedged RPCs will pass 'do not call' list to local LB
A6.md Outdated
@@ -140,7 +140,7 @@ If all instances of a hedged RPC fail, there are no additional retry attempts. E

If server pushback that specifies not to retry is received in response to a hedged request, no further hedged requests should be issued for the call.

Hedged requests should be sent to distinct backends, if possible.
Hedged requests should be sent to distinct backends, if possible. To facilitate this, the gRPC client will maintain a list of previously used backend addresses for each hedged RPC. This list will be passed to the gRPC client's local load-balancing policy. Previously used addresses should be skipped when the load-balancing policy picks a server for the hedged RPC. If all available backend addresses have already been used, the load-balancing policy may pick any backend. This means that service owners who specify more hedged requests than they have available backends may still see hedged requests sent to the same backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the language here should be less prescriptive and more about what functionality can be obtained if desired. For example, instead of saying "Previously used addresses should be skipped", we can say something like "The load balancing policy may use this information to send retries to an address that was not previously used if possible". In addition, the last couple of sentences can be changed to indicate that the load-balancing policy will need to decide what to do in the case where there are no addresses that have not already been tried.

@a11r
Copy link
Member

a11r commented Mar 3, 2017

Thanks for the detailed design and everyone for the thoughtful feedback. I am approving and merging this PR now. Any follow up discussion can continue on the thread.

@a11r a11r merged commit 47f8bef into grpc:master Mar 3, 2017
@ncteisen ncteisen deleted the retry-design branch March 9, 2017 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants