Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HTTPRetry RetryOn config #9840

Merged
merged 5 commits into from
Nov 13, 2018
Merged

Conversation

hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Nov 9, 2018

  1. update istio api

  2. Add support RetryOn config

fixes: #8081

@istio-testing
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hzxuzhonghu
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: hklai

If they are not already assigned, you can assign the PR to them by writing /assign @hklai in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Nov 9, 2018

Codecov Report

Merging #9840 into release-1.1 will increase coverage by 1%.
The diff coverage is 36%.

Impacted file tree graph

@@             Coverage Diff              @@
##           release-1.1   #9840    +/-   ##
============================================
+ Coverage           70%     70%    +1%     
============================================
  Files              434     434            
  Lines            40977   41196   +219     
============================================
+ Hits             28577   28762   +185     
- Misses           11027   11061    +34     
  Partials          1373    1373
Impacted Files Coverage Δ
pilot/pkg/networking/core/v1alpha3/route/route.go 80% <0%> (-1%) ⬇️
pilot/pkg/model/validation.go 83% <100%> (+1%) ⬆️
mixer/pkg/protobuf/yaml/resolver.go 95% <0%> (-5%) ⬇️
pilot/pkg/proxy/envoy/v2/lds.go 53% <0%> (-4%) ⬇️
...ter/kubernetesenv/template/template_handler.gen.go 96% <0%> (-1%) ⬇️
mixer/adapter/solarwinds/metrics_handler.go 84% <0%> (ø) ⬇️
mixer/adapter/dogstatsd/dogstatsd.go 100% <0%> (ø) ⬆️
mixer/adapter/rbac/rbac.go 0% <0%> (ø) ⬆️
mixer/adapter/denier/denier.go 100% <0%> (ø) ⬆️
pilot/pkg/proxy/envoy/v2/ads.go 85% <0%> (+1%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7b9bd2...5597da3. Read the comment docs.

@hzxuzhonghu
Copy link
Member Author

/test istio-unit-tests

1 similar comment
@hzxuzhonghu
Copy link
Member Author

/test istio-unit-tests

@@ -534,10 +534,23 @@ func translateHeaderMatch(name string, in *networking.StringMatch) route.HeaderM
func translateRetryPolicy(in *networking.HTTPRetry) *route.RouteAction_RetryPolicy {
if in != nil && in.Attempts > 0 {
d := util.GogoDurationToDuration(in.PerTryTimeout)
// default retry on condition
retryOn := "gateway-error"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to force default retries for gateway-error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's @rshriram suggested #8081 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. actually I meant asking are you defaulting only for gateway-error? I think some of the gRPC codes like for example unavailable may be considered for default so that every user/deployment need not have to specify it. NBD though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a good point. What do you think are the common conditions for which we should have a default retry policy?
Also by default I mean the user specifying a retry setting in the virtual service with no retry policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be connect failure, refused stream, gateway error, unavailable.

Copy link

@pnambiarsf pnambiarsf Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about including cancelled and resource-exhausted by default as well ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. Keep in mind that the retries will kick in if and only if you specify the retry setting in the virtual service. At that point you have the option of going with the default retry policies or specifying your own.

@googlebot
Copy link
Collaborator

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@googlebot googlebot added cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. and removed cla: yes labels Nov 9, 2018
@rshriram
Copy link
Member

rshriram commented Nov 9, 2018

/test e2e-mixer-no_auth

@rshriram
Copy link
Member

rshriram commented Nov 9, 2018

/test istio-unit-tests

1 similar comment
@rshriram
Copy link
Member

rshriram commented Nov 9, 2018

/test istio-unit-tests

@istio-testing
Copy link
Collaborator

istio-testing commented Nov 9, 2018

@hzxuzhonghu: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
prow/istio-integ-local-tests.sh 5597da3 link /test istio-integ-local-tests
prow/istio-integ-k8s-tests.sh 5597da3 link /test istio-integ-k8s-tests
prow/e2e-mixer-no_auth-mcp.sh 5597da3 link /test e2e-mixer-no_auth-mcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hzxuzhonghu
Copy link
Member Author

/test istio-unit-tests

@hzxuzhonghu
Copy link
Member Author

@rshriram Is this ready to merge?

@rshriram rshriram merged commit 7e13732 into istio:release-1.1 Nov 13, 2018
@hzxuzhonghu hzxuzhonghu deleted the HTTPRetry branch November 14, 2018 01:23
@nmittler
Copy link
Contributor

@hzxuzhonghu I just ran across this ... Thanks for doing this! I think lack of retry is contributing to 503s in #7665.

Do you know if this has this been included in an Istio release?

Also, is there a reasonable default, or is retry turned off by default? If the latter, I'm concerned about our "getting started" experience ... any time you bring a pod down you risk experiencing user-facing 503s.

@rshriram @costinm any thoughts here?

@hzxuzhonghu
Copy link
Member Author

@nmittler This feature is not in release-1.0. It is only in release-1.1 and master now.

From below code, I am sure currently, there is no retry by default. And I agree enabling retryOn by default is an improvement.

func translateRetryPolicy(in *networking.HTTPRetry) *route.RouteAction_RetryPolicy {
	if in != nil && in.Attempts > 0 {
		...
		return &route.RouteAction_RetryPolicy{
			NumRetries:    &types.UInt32Value{Value: uint32(in.GetAttempts())},
			RetryOn:       retryOn,
			PerTryTimeout: &d,
			RetryHostPredicate: []*route.RouteAction_RetryPolicy_RetryHostPredicate{
				{
					// to configure retries to prefer hosts that haven’t been attempted already,
					// the builtin `envoy.retry_host_predicates.previous_hosts` predicate can be used.
					Name: "envoy.retry_host_predicates.previous_hosts",
				},
			},
			HostSelectionRetryMaxAttempts: 3,
		}
	}
	return nil
}

@frankbu
Copy link
Contributor

frankbu commented Dec 11, 2018

@hzxuzhonghu @rshriram Isn't no retry by default a breaking change since it previously defaulted to "5xx,connect-failure,refused-stream"?

@hzxuzhonghu
Copy link
Member Author

no, we did not break. Previously, thats the default retryon. But still need explicitly set retry attempts.

@nmittler
Copy link
Contributor

nmittler commented Dec 11, 2018

@hzxuzhonghu @frankbu @rshriram @costinm @louiscryan @PiotrSikora @Stono

I think we're all in agreement that the sidecars should (by default) do some sort of retry before propagating 503s back to the client. The question is: what should they do exactly?

  1. RetryOn: what should we retry on by default? The question here revolves around idempotence. I suspect that connect-failure and refused-stream should be safe, regardless. But what about gateway-error (i.e. 502, 503, 504) for non-idempotent requests?

  2. MaxRetries: How many attempts by default? 5? 10? Dynamically determined by the current number of endpoints in the cluster?

@Stono
Copy link
Contributor

Stono commented Dec 11, 2018 via email

@Stono
Copy link
Contributor

Stono commented Dec 11, 2018 via email

@nmittler
Copy link
Contributor

Also this solution only helps http. How about grpc?

gRPC is still based on http/2, so I suspect that this solution should work for gRPC as well. This won't work for TCP, however.

It does still feel like a plaster on a problem even if it is handled
transparently. Isn't the root cause of the problem the fact that upstream
proxies are trying to send requests to downstream proxies that no longer
exist due to the delay in eds propagation?

Yes, that's the issue here. We can get better and propagate EDS faster, but there will always be a delay. I suspect that some number of 503s will be unavoidable during periods of pod churn, so we do need a bit of plaster here, unfortunately.

@hzxuzhonghu
Copy link
Member Author

Make retry by default sounds good to me, but we should consider the Timeout, https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route.proto#envoy-api-field-route-routeaction-timeout

How many times of retry is also decided by the total timeout RouteAction.Timeout

RetryOn: what should we retry on by default? The question here revolves around idempotence. I suspect that connect-failure and refused-stream should be safe, regardless. But what about gateway-error (i.e. 502, 503, 504) for non-idempotent requests?

for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set gateway-error by default.

@Stono
Copy link
Contributor

Stono commented Dec 12, 2018 via email

@Stono
Copy link
Contributor

Stono commented Dec 12, 2018

BTW our current retry setup retries on gateway-error, connect-failure, and refused-stream and we do it with envoy headers due to the fact isito currently hardcoded retires any 5xx error (this pr addressing that), which in our opinion is bad.

@nmittler
Copy link
Contributor

@hzxuzhonghu

for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set gateway-error by default.

Agreed, we should only retry on 503, connect-failure, and refused-stream.

@hzxuzhonghu @Stono WRT timeouts, do you have a suggestion of how they might be used? Making it a function of EDS updates makes sense for our pod churn/retry use case, but the timeout would also affect successful requests. If we make it too small, we run the risk of cancelling actual long(ish)-running requests. We currently are using the 15s default for all routes ... not sure if that's our magic number or if we need a new one :). Thoughts?

@Stono
Copy link
Contributor

Stono commented Dec 12, 2018 via email

@nmittler
Copy link
Contributor

nmittler commented Dec 12, 2018

You don't want a per try timeout because really we are only retrying
connection issues, therefore it needs to be very low (say 10ms). 10x tries
at 10ms = 100ms which is less than the time it could take to push out eds.
So we don't achieve much.

Agreed, I think the timeout should be a timeout for the request (defaults to 15s), not each attempt.

What we actually need is for envoy to only try and endpoint once if it
results in a gateway error, move to the next endpoint. When it runs out of
endpoints, it should fail all together.

So we could use the number of endpoints in the outbound cluster as the number of retry attempts (I think I had suggested this somewhere above). Of course, this would still be limited by the overall request timeout, so depending on how responsive the system is (503s due to connect-failure should return pretty quickly) as well as how many endpoints there are, we may or may not get through them all. But this would get us pretty close to what you're asking, I suspect.

Is it then possible for it to fetch the latest eds config before outright
failing perhaps?

It's an interesting idea. Initially, I'd be worried regarding the extra strain it would be putting on Pilot during a period of pod churn ... especially since Pilot would be getting per-cluster requests from each Envoy. That could pretty quickly blow up. Better to wait for the next push that contains all the updates.

@nmittler
Copy link
Contributor

@hzxuzhonghu @rshriram @costinm one thing that seems to be missing here is individual status codes. Envoy separates them out into theRetriableStatusCodes field. One option is to extend our RetryOn field to also include integers which will then be interpreted to be status codes. WDYT?

@louiscryan
Copy link
Contributor

pair programming, CLA is signed by both contributors.

@louiscryan louiscryan added cla: yes and removed cla: no Set by the Google CLA bot to indicate the author of a PR has not signed the Google CLA. labels Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants