Implement HTTPRetry RetryOn config #9840

hzxuzhonghu · 2018-11-09T02:05:33Z

update istio api
Add support RetryOn config

istio-testing · 2018-11-09T02:05:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hzxuzhonghu
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: hklai

If they are not already assigned, you can assign the PR to them by writing /assign @hklai in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2018-11-09T02:30:21Z

Codecov Report

Merging #9840 into release-1.1 will increase coverage by 1%.
The diff coverage is 36%.

@@             Coverage Diff              @@
##           release-1.1   #9840    +/-   ##
============================================
+ Coverage           70%     70%    +1%     
============================================
  Files              434     434            
  Lines            40977   41196   +219     
============================================
+ Hits             28577   28762   +185     
- Misses           11027   11061    +34     
  Partials          1373    1373

Impacted Files	Coverage Δ
pilot/pkg/networking/core/v1alpha3/route/route.go	`80% <0%> (-1%)`	⬇️
pilot/pkg/model/validation.go	`83% <100%> (+1%)`	⬆️
mixer/pkg/protobuf/yaml/resolver.go	`95% <0%> (-5%)`	⬇️
pilot/pkg/proxy/envoy/v2/lds.go	`53% <0%> (-4%)`	⬇️
...ter/kubernetesenv/template/template_handler.gen.go	`96% <0%> (-1%)`	⬇️
mixer/adapter/solarwinds/metrics_handler.go	`84% <0%> (ø)`	⬇️
mixer/adapter/dogstatsd/dogstatsd.go	`100% <0%> (ø)`	⬆️
mixer/adapter/rbac/rbac.go	`0% <0%> (ø)`	⬆️
mixer/adapter/denier/denier.go	`100% <0%> (ø)`	⬆️
pilot/pkg/proxy/envoy/v2/ads.go	`85% <0%> (+1%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7b9bd2...5597da3. Read the comment docs.

hzxuzhonghu · 2018-11-09T04:07:43Z

/test istio-unit-tests

hzxuzhonghu · 2018-11-09T06:21:31Z

/test istio-unit-tests

ramaraochavali · 2018-11-09T09:53:21Z

pilot/pkg/networking/core/v1alpha3/route/route.go

@@ -534,10 +534,23 @@ func translateHeaderMatch(name string, in *networking.StringMatch) route.HeaderM
 func translateRetryPolicy(in *networking.HTTPRetry) *route.RouteAction_RetryPolicy {
 	if in != nil && in.Attempts > 0 {
 		d := util.GogoDurationToDuration(in.PerTryTimeout)
+		// default retry on condition
+		retryOn := "gateway-error"


Do you want to force default retries for gateway-error?

That's @rshriram suggested #8081 (comment)

Sorry. actually I meant asking are you defaulting only for gateway-error? I think some of the gRPC codes like for example unavailable may be considered for default so that every user/deployment need not have to specify it. NBD though.

That’s a good point. What do you think are the common conditions for which we should have a default retry policy?
Also by default I mean the user specifying a retry setting in the virtual service with no retry policy.

I think this should be connect failure, refused stream, gateway error, unavailable.

How about including cancelled and resource-exhausted by default as well ?

sure. Keep in mind that the retries will kick in if and only if you specify the retry setting in the virtual service. At that point you have the option of going with the default retry policies or specifying your own.

googlebot · 2018-11-09T15:53:03Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

rshriram · 2018-11-09T20:20:35Z

/test e2e-mixer-no_auth

rshriram · 2018-11-09T20:21:47Z

/test istio-unit-tests

rshriram · 2018-11-09T21:33:04Z

/test istio-unit-tests

istio-testing · 2018-11-09T21:40:12Z

@hzxuzhonghu: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
prow/istio-integ-local-tests.sh	`5597da3`	link	`/test istio-integ-local-tests`
prow/istio-integ-k8s-tests.sh	`5597da3`	link	`/test istio-integ-k8s-tests`
prow/e2e-mixer-no_auth-mcp.sh	`5597da3`	link	`/test e2e-mixer-no_auth-mcp`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hzxuzhonghu · 2018-11-12T02:05:19Z

/test istio-unit-tests

hzxuzhonghu · 2018-11-13T11:14:01Z

@rshriram Is this ready to merge?

nmittler · 2018-12-11T00:33:50Z

@hzxuzhonghu I just ran across this ... Thanks for doing this! I think lack of retry is contributing to 503s in #7665.

Do you know if this has this been included in an Istio release?

Also, is there a reasonable default, or is retry turned off by default? If the latter, I'm concerned about our "getting started" experience ... any time you bring a pod down you risk experiencing user-facing 503s.

@rshriram @costinm any thoughts here?

hzxuzhonghu · 2018-12-11T01:42:53Z

@nmittler This feature is not in release-1.0. It is only in release-1.1 and master now.

From below code, I am sure currently, there is no retry by default. And I agree enabling retryOn by default is an improvement.

func translateRetryPolicy(in *networking.HTTPRetry) *route.RouteAction_RetryPolicy {
	if in != nil && in.Attempts > 0 {
		...
		return &route.RouteAction_RetryPolicy{
			NumRetries:    &types.UInt32Value{Value: uint32(in.GetAttempts())},
			RetryOn:       retryOn,
			PerTryTimeout: &d,
			RetryHostPredicate: []*route.RouteAction_RetryPolicy_RetryHostPredicate{
				{
					// to configure retries to prefer hosts that haven’t been attempted already,
					// the builtin `envoy.retry_host_predicates.previous_hosts` predicate can be used.
					Name: "envoy.retry_host_predicates.previous_hosts",
				},
			},
			HostSelectionRetryMaxAttempts: 3,
		}
	}
	return nil
}

frankbu · 2018-12-11T15:06:55Z

@hzxuzhonghu @rshriram Isn't no retry by default a breaking change since it previously defaulted to "5xx,connect-failure,refused-stream"?

hzxuzhonghu · 2018-12-11T16:26:54Z

no, we did not break. Previously, thats the default retryon. But still need explicitly set retry attempts.

nmittler · 2018-12-11T17:17:42Z

@hzxuzhonghu @frankbu @rshriram @costinm @louiscryan @PiotrSikora @Stono

I think we're all in agreement that the sidecars should (by default) do some sort of retry before propagating 503s back to the client. The question is: what should they do exactly?

RetryOn: what should we retry on by default? The question here revolves around idempotence. I suspect that connect-failure and refused-stream should be safe, regardless. But what about gateway-error (i.e. 502, 503, 504) for non-idempotent requests?
MaxRetries: How many attempts by default? 5? 10? Dynamically determined by the current number of endpoints in the cluster?

Stono · 2018-12-11T17:23:11Z

Gateway unavailable (503 yes), gateway timeout (504) absolutely not as you'll indirectly facilitate a DoS. Also this solution only helps http. How about grpc? It does still feel like a plaster on a problem even if it is handled transparently. Isn't the root cause of the problem the fact that upstream proxies are trying to send requests to downstream proxies that no longer exist due to the delay in eds propagation?

…

On Tue, 11 Dec 2018, 5:18 pm Nathan Mittler ***@***.*** wrote: @hzxuzhonghu <https://github.com/hzxuzhonghu> @frankbu <https://github.com/frankbu> @rshriram <https://github.com/rshriram> @costinm <https://github.com/costinm> @louiscryan <https://github.com/louiscryan> @PiotrSikora <https://github.com/PiotrSikora> @Stono <https://github.com/Stono> I think we're all in agreement that the sidecars should (by default) do some sort of retry before propagating 503s back to the client. The question is: what should it do exactly? 1. First, what should we retry on by default? The question here revolves around idempotence. I suspect that connect-failure and refused-stream should be safe, regardless. But what about gateway-error (i.e. 502, 503, 504) for non-idempotent requests? 2. How many attempts by default? 5? 10? Dynamically determined by the current number of endpoints in the cluster? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9840 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaviY1ikeZwgsgUHfeTYizJJW0J_lOiks5u3-jogaJpZM4YV0_A> .

Stono · 2018-12-11T17:24:35Z

Also the retries need to go to a different instance, so thinking about circuit breaking too.

…

On Tue, 11 Dec 2018, 5:22 pm Karl Stoney ***@***.*** wrote: Gateway unavailable (503 yes), gateway timeout (504) absolutely not as you'll indirectly facilitate a DoS. Also this solution only helps http. How about grpc? It does still feel like a plaster on a problem even if it is handled transparently. Isn't the root cause of the problem the fact that upstream proxies are trying to send requests to downstream proxies that no longer exist due to the delay in eds propagation? On Tue, 11 Dec 2018, 5:18 pm Nathan Mittler ***@***.*** wrote: > @hzxuzhonghu <https://github.com/hzxuzhonghu> @frankbu > <https://github.com/frankbu> @rshriram <https://github.com/rshriram> > @costinm <https://github.com/costinm> @louiscryan > <https://github.com/louiscryan> @PiotrSikora > <https://github.com/PiotrSikora> @Stono <https://github.com/Stono> > > I think we're all in agreement that the sidecars should (by default) do > some sort of retry before propagating 503s back to the client. The question > is: what should it do exactly? > > 1. > > First, what should we retry on by default? The question here revolves > around idempotence. I suspect that connect-failure and refused-stream > should be safe, regardless. But what about gateway-error (i.e. 502, > 503, 504) for non-idempotent requests? > 2. > > How many attempts by default? 5? 10? Dynamically determined by the > current number of endpoints in the cluster? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#9840 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABaviY1ikeZwgsgUHfeTYizJJW0J_lOiks5u3-jogaJpZM4YV0_A> > . >

nmittler · 2018-12-11T17:40:52Z

Also this solution only helps http. How about grpc?

gRPC is still based on http/2, so I suspect that this solution should work for gRPC as well. This won't work for TCP, however.

It does still feel like a plaster on a problem even if it is handled
transparently. Isn't the root cause of the problem the fact that upstream
proxies are trying to send requests to downstream proxies that no longer
exist due to the delay in eds propagation?

Yes, that's the issue here. We can get better and propagate EDS faster, but there will always be a delay. I suspect that some number of 503s will be unavoidable during periods of pod churn, so we do need a bit of plaster here, unfortunately.

hzxuzhonghu · 2018-12-12T02:05:52Z

Make retry by default sounds good to me, but we should consider the Timeout, https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route.proto#envoy-api-field-route-routeaction-timeout

How many times of retry is also decided by the total timeout RouteAction.Timeout

RetryOn: what should we retry on by default? The question here revolves around idempotence. I suspect that connect-failure and refused-stream should be safe, regardless. But what about gateway-error (i.e. 502, 503, 504) for non-idempotent requests?

for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set gateway-error by default.

Stono · 2018-12-12T07:13:19Z

Interesting point about the timeouts. Really the timeout should be some sort of estimation as to the maximum permissible duration of eds updates reaching a proxy. It's not a connection, nor transfer timeout it's purely a "iterate over what I believe the destination endpoints to be for N duration until i find one which doesn't result in a 503, blacklisting those which do as I go"

…

On Wed, 12 Dec 2018, 2:06 am Zhonghu Xu ***@***.*** wrote: Make retry by default sounds good to me, but we should consider the Timeout, https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route.proto#envoy-api-field-route-routeaction-timeout How many times of retry is also decided by the total timeout RouteAction.Timeout *RetryOn*: what should we retry on by default? The question here revolves around idempotence. I suspect that connect-failure and refused-stream should be safe, regardless. But what about gateway-error (i.e. 502, 503, 504) for non-idempotent requests? for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set gateway-error by default. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9840 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaviczshyUw2C_2rDI1GGjs0vn9mdI2ks5u4GSVgaJpZM4YV0_A> .

Stono · 2018-12-12T16:03:11Z

BTW our current retry setup retries on gateway-error, connect-failure, and refused-stream and we do it with envoy headers due to the fact isito currently hardcoded retires any 5xx error (this pr addressing that), which in our opinion is bad.

nmittler · 2018-12-12T16:40:49Z

@hzxuzhonghu

for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set gateway-error by default.

Agreed, we should only retry on 503, connect-failure, and refused-stream.

@hzxuzhonghu @Stono WRT timeouts, do you have a suggestion of how they might be used? Making it a function of EDS updates makes sense for our pod churn/retry use case, but the timeout would also affect successful requests. If we make it too small, we run the risk of cancelling actual long(ish)-running requests. We currently are using the 15s default for all routes ... not sure if that's our magic number or if we need a new one :). Thoughts?

Stono · 2018-12-12T16:47:32Z

I think timeouts here are interesting and confusing. You don't want a per try timeout because really we are only retrying connection issues, therefore it needs to be very low (say 10ms). 10x tries at 10ms = 100ms which is less than the time it could take to push out eds. So we don't achieve much. What we actually need is for envoy to only try and endpoint once if it results in a gateway error, move to the next endpoint. When it runs out of endpoints, it should fail all together. Is it then possible for it to fetch the latest eds config before outright failing perhaps? Karl

…

On Wed, 12 Dec 2018, 4:41 pm Nathan Mittler ***@***.*** wrote: @hzxuzhonghu <https://github.com/hzxuzhonghu> for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set gateway-error by default. Agreed, we should only retry on 503, connect-failure, and refused-stream. @hzxuzhonghu <https://github.com/hzxuzhonghu> @Stono <https://github.com/Stono> WRT timeouts, do you have a suggestion of how they might be used? Making it a function of EDS updates makes sense for our pod churn/retry use case, but the timeout would also affect successful requests. If we make it too small, we run the risk of cancelling actual long(ish)-running requests. We currently are using the 15s default for all routes ... not sure if that's our magic number or if we need a new one :). Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9840 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaviU0qSf3zGvpj-B4CgRWIQOJAYlfZks5u4TGwgaJpZM4YV0_A> .

nmittler · 2018-12-12T21:44:49Z

You don't want a per try timeout because really we are only retrying
connection issues, therefore it needs to be very low (say 10ms). 10x tries
at 10ms = 100ms which is less than the time it could take to push out eds.
So we don't achieve much.

Agreed, I think the timeout should be a timeout for the request (defaults to 15s), not each attempt.

What we actually need is for envoy to only try and endpoint once if it
results in a gateway error, move to the next endpoint. When it runs out of
endpoints, it should fail all together.

So we could use the number of endpoints in the outbound cluster as the number of retry attempts (I think I had suggested this somewhere above). Of course, this would still be limited by the overall request timeout, so depending on how responsive the system is (503s due to connect-failure should return pretty quickly) as well as how many endpoints there are, we may or may not get through them all. But this would get us pretty close to what you're asking, I suspect.

Is it then possible for it to fetch the latest eds config before outright
failing perhaps?

It's an interesting idea. Initially, I'd be worried regarding the extra strain it would be putting on Pilot during a period of pod churn ... especially since Pilot would be getting per-cluster requests from each Envoy. That could pretty quickly blow up. Better to wait for the next push that contains all the updates.

nmittler · 2018-12-19T00:11:35Z

@hzxuzhonghu @rshriram @costinm one thing that seems to be missing here is individual status codes. Envoy separates them out into theRetriableStatusCodes field. One option is to extend our RetryOn field to also include integers which will then be interpreted to be status codes. WDYT?

louiscryan · 2019-02-13T19:32:12Z

pair programming, CLA is signed by both contributors.

googlebot added the cla: yes label Nov 9, 2018

istio-testing requested review from frankbu and kyessenov November 9, 2018 02:05

hzxuzhonghu added 3 commits November 9, 2018 11:40

update istio api

7a0f5a0

support RetryPolicy RetryOn condition custom set

0dfc05e

add validate for RetryOn

06000fd

hzxuzhonghu force-pushed the HTTPRetry branch 2 times, most recently from a044a10 to 06000fd Compare November 9, 2018 03:47

hzxuzhonghu requested a review from rshriram November 9, 2018 03:48

ramaraochavali reviewed Nov 9, 2018

View reviewed changes

Update route.go

9484ada

googlebot added cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. and removed cla: yes labels Nov 9, 2018

rshriram approved these changes Nov 9, 2018

View reviewed changes

Update route.go

5597da3

rshriram merged commit 7e13732 into istio:release-1.1 Nov 13, 2018

hzxuzhonghu deleted the HTTPRetry branch November 14, 2018 01:23

hzxuzhonghu mentioned this pull request Dec 11, 2018

Add more configuration options to the retry policy #8081

Closed

louiscryan added cla: yes and removed cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. labels Feb 13, 2019

nicktrav mentioned this pull request Jan 24, 2020

resource-exhausted not retry #19789

Closed

nicktrav mentioned this pull request Mar 6, 2020

Remove gRPC RESOURCE_EXHAUSTED from default retry policy #21946

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HTTPRetry RetryOn config #9840

Implement HTTPRetry RetryOn config #9840

hzxuzhonghu commented Nov 9, 2018 •

edited

Loading

istio-testing commented Nov 9, 2018

codecov bot commented Nov 9, 2018 •

edited

Loading

hzxuzhonghu commented Nov 9, 2018

hzxuzhonghu commented Nov 9, 2018

ramaraochavali Nov 9, 2018

hzxuzhonghu Nov 9, 2018

ramaraochavali Nov 9, 2018

rshriram Nov 9, 2018

rshriram Nov 9, 2018

pnambiarsf Nov 9, 2018 •

edited

Loading

rshriram Nov 9, 2018

googlebot commented Nov 9, 2018

rshriram commented Nov 9, 2018

rshriram commented Nov 9, 2018

rshriram commented Nov 9, 2018

istio-testing commented Nov 9, 2018 •

edited

Loading

hzxuzhonghu commented Nov 12, 2018

hzxuzhonghu commented Nov 13, 2018

nmittler commented Dec 11, 2018

hzxuzhonghu commented Dec 11, 2018

frankbu commented Dec 11, 2018

hzxuzhonghu commented Dec 11, 2018

nmittler commented Dec 11, 2018 •

edited

Loading

Stono commented Dec 11, 2018 via email

Stono commented Dec 11, 2018 via email

nmittler commented Dec 11, 2018

hzxuzhonghu commented Dec 12, 2018

Stono commented Dec 12, 2018 via email

Stono commented Dec 12, 2018 •

edited

Loading

nmittler commented Dec 12, 2018

Stono commented Dec 12, 2018 via email

nmittler commented Dec 12, 2018 •

edited

Loading

nmittler commented Dec 19, 2018

louiscryan commented Feb 13, 2019

Implement HTTPRetry RetryOn config #9840

Implement HTTPRetry RetryOn config #9840

Conversation

hzxuzhonghu commented Nov 9, 2018 • edited Loading

istio-testing commented Nov 9, 2018

codecov bot commented Nov 9, 2018 • edited Loading

Codecov Report

hzxuzhonghu commented Nov 9, 2018

hzxuzhonghu commented Nov 9, 2018

ramaraochavali Nov 9, 2018

Choose a reason for hiding this comment

hzxuzhonghu Nov 9, 2018

Choose a reason for hiding this comment

ramaraochavali Nov 9, 2018

Choose a reason for hiding this comment

rshriram Nov 9, 2018

Choose a reason for hiding this comment

rshriram Nov 9, 2018

Choose a reason for hiding this comment

pnambiarsf Nov 9, 2018 • edited Loading

Choose a reason for hiding this comment

rshriram Nov 9, 2018

Choose a reason for hiding this comment

googlebot commented Nov 9, 2018

rshriram commented Nov 9, 2018

rshriram commented Nov 9, 2018

rshriram commented Nov 9, 2018

istio-testing commented Nov 9, 2018 • edited Loading

hzxuzhonghu commented Nov 12, 2018

hzxuzhonghu commented Nov 13, 2018

nmittler commented Dec 11, 2018

hzxuzhonghu commented Dec 11, 2018

frankbu commented Dec 11, 2018

hzxuzhonghu commented Dec 11, 2018

nmittler commented Dec 11, 2018 • edited Loading

Stono commented Dec 11, 2018 via email

Stono commented Dec 11, 2018 via email

nmittler commented Dec 11, 2018

hzxuzhonghu commented Dec 12, 2018

Stono commented Dec 12, 2018 via email

Stono commented Dec 12, 2018 • edited Loading

nmittler commented Dec 12, 2018

Stono commented Dec 12, 2018 via email

nmittler commented Dec 12, 2018 • edited Loading

nmittler commented Dec 19, 2018

louiscryan commented Feb 13, 2019

hzxuzhonghu commented Nov 9, 2018 •

edited

Loading

codecov bot commented Nov 9, 2018 •

edited

Loading

pnambiarsf Nov 9, 2018 •

edited

Loading

istio-testing commented Nov 9, 2018 •

edited

Loading

nmittler commented Dec 11, 2018 •

edited

Loading

Stono commented Dec 12, 2018 •

edited

Loading

nmittler commented Dec 12, 2018 •

edited

Loading