-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement HTTPRetry RetryOn config #9840
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hzxuzhonghu If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report
@@ Coverage Diff @@
## release-1.1 #9840 +/- ##
============================================
+ Coverage 70% 70% +1%
============================================
Files 434 434
Lines 40977 41196 +219
============================================
+ Hits 28577 28762 +185
- Misses 11027 11061 +34
Partials 1373 1373
Continue to review full report at Codecov.
|
a044a10
to
06000fd
Compare
/test istio-unit-tests |
1 similar comment
/test istio-unit-tests |
@@ -534,10 +534,23 @@ func translateHeaderMatch(name string, in *networking.StringMatch) route.HeaderM | |||
func translateRetryPolicy(in *networking.HTTPRetry) *route.RouteAction_RetryPolicy { | |||
if in != nil && in.Attempts > 0 { | |||
d := util.GogoDurationToDuration(in.PerTryTimeout) | |||
// default retry on condition | |||
retryOn := "gateway-error" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to force default retries for gateway-error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's @rshriram suggested #8081 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. actually I meant asking are you defaulting only for gateway-error
? I think some of the gRPC codes like for example unavailable
may be considered for default so that every user/deployment need not have to specify it. NBD though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a good point. What do you think are the common conditions for which we should have a default retry policy?
Also by default I mean the user specifying a retry setting in the virtual service with no retry policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be connect failure, refused stream, gateway error, unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about including cancelled and resource-exhausted by default as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. Keep in mind that the retries will kick in if and only if you specify the retry setting in the virtual service. At that point you have the option of going with the default retry policies or specifying your own.
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
/test e2e-mixer-no_auth |
/test istio-unit-tests |
1 similar comment
/test istio-unit-tests |
@hzxuzhonghu: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test istio-unit-tests |
@rshriram Is this ready to merge? |
@hzxuzhonghu I just ran across this ... Thanks for doing this! I think lack of retry is contributing to 503s in #7665. Do you know if this has this been included in an Istio release? Also, is there a reasonable default, or is retry turned off by default? If the latter, I'm concerned about our "getting started" experience ... any time you bring a pod down you risk experiencing user-facing 503s. |
@nmittler This feature is not in release-1.0. It is only in release-1.1 and master now. From below code, I am sure currently, there is no retry by default. And I agree enabling
|
@hzxuzhonghu @rshriram Isn't no retry by default a breaking change since it previously defaulted to "5xx,connect-failure,refused-stream"? |
no, we did not break. Previously, thats the default retryon. But still need explicitly set retry attempts. |
@hzxuzhonghu @frankbu @rshriram @costinm @louiscryan @PiotrSikora @Stono I think we're all in agreement that the sidecars should (by default) do some sort of retry before propagating 503s back to the client. The question is: what should they do exactly?
|
Gateway unavailable (503 yes), gateway timeout (504) absolutely not as
you'll indirectly facilitate a DoS.
Also this solution only helps http. How about grpc?
It does still feel like a plaster on a problem even if it is handled
transparently. Isn't the root cause of the problem the fact that upstream
proxies are trying to send requests to downstream proxies that no longer
exist due to the delay in eds propagation?
…On Tue, 11 Dec 2018, 5:18 pm Nathan Mittler ***@***.*** wrote:
@hzxuzhonghu <https://github.com/hzxuzhonghu> @frankbu
<https://github.com/frankbu> @rshriram <https://github.com/rshriram>
@costinm <https://github.com/costinm> @louiscryan
<https://github.com/louiscryan> @PiotrSikora
<https://github.com/PiotrSikora> @Stono <https://github.com/Stono>
I think we're all in agreement that the sidecars should (by default) do
some sort of retry before propagating 503s back to the client. The question
is: what should it do exactly?
1.
First, what should we retry on by default? The question here revolves
around idempotence. I suspect that connect-failure and refused-stream
should be safe, regardless. But what about gateway-error (i.e. 502,
503, 504) for non-idempotent requests?
2.
How many attempts by default? 5? 10? Dynamically determined by the
current number of endpoints in the cluster?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9840 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABaviY1ikeZwgsgUHfeTYizJJW0J_lOiks5u3-jogaJpZM4YV0_A>
.
|
Also the retries need to go to a different instance, so thinking about
circuit breaking too.
…On Tue, 11 Dec 2018, 5:22 pm Karl Stoney ***@***.*** wrote:
Gateway unavailable (503 yes), gateway timeout (504) absolutely not as
you'll indirectly facilitate a DoS.
Also this solution only helps http. How about grpc?
It does still feel like a plaster on a problem even if it is handled
transparently. Isn't the root cause of the problem the fact that upstream
proxies are trying to send requests to downstream proxies that no longer
exist due to the delay in eds propagation?
On Tue, 11 Dec 2018, 5:18 pm Nathan Mittler ***@***.***
wrote:
> @hzxuzhonghu <https://github.com/hzxuzhonghu> @frankbu
> <https://github.com/frankbu> @rshriram <https://github.com/rshriram>
> @costinm <https://github.com/costinm> @louiscryan
> <https://github.com/louiscryan> @PiotrSikora
> <https://github.com/PiotrSikora> @Stono <https://github.com/Stono>
>
> I think we're all in agreement that the sidecars should (by default) do
> some sort of retry before propagating 503s back to the client. The question
> is: what should it do exactly?
>
> 1.
>
> First, what should we retry on by default? The question here revolves
> around idempotence. I suspect that connect-failure and refused-stream
> should be safe, regardless. But what about gateway-error (i.e. 502,
> 503, 504) for non-idempotent requests?
> 2.
>
> How many attempts by default? 5? 10? Dynamically determined by the
> current number of endpoints in the cluster?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#9840 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABaviY1ikeZwgsgUHfeTYizJJW0J_lOiks5u3-jogaJpZM4YV0_A>
> .
>
|
gRPC is still based on http/2, so I suspect that this solution should work for gRPC as well. This won't work for TCP, however.
Yes, that's the issue here. We can get better and propagate EDS faster, but there will always be a delay. I suspect that some number of 503s will be unavoidable during periods of pod churn, so we do need a bit of plaster here, unfortunately. |
Make retry by default sounds good to me, but we should consider the How many times of retry is also decided by the total timeout RouteAction.Timeout
for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make things even worse. But 503s is just during periods of pod churn. So we should not simply set |
Interesting point about the timeouts. Really the timeout should be some
sort of estimation as to the maximum permissible duration of eds updates
reaching a proxy. It's not a connection, nor transfer timeout it's purely
a "iterate over what I believe the destination endpoints to be for N
duration until i find one which doesn't result in a 503, blacklisting those
which do as I go"
…On Wed, 12 Dec 2018, 2:06 am Zhonghu Xu ***@***.*** wrote:
Make retry by default sounds good to me, but we should consider the
Timeout,
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/route/route.proto#envoy-api-field-route-routeaction-timeout
How many times of retry is also decided by the total timeout
RouteAction.Timeout
*RetryOn*: what should we retry on by default? The question here revolves
around idempotence. I suspect that connect-failure and refused-stream
should be safe, regardless. But what about gateway-error (i.e. 502, 503,
504) for non-idempotent requests?
for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make
things even worse. But 503s is just during periods of pod churn. So we
should not simply set gateway-error by default.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9840 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABaviczshyUw2C_2rDI1GGjs0vn9mdI2ks5u4GSVgaJpZM4YV0_A>
.
|
BTW our current retry setup retries on gateway-error, connect-failure, and refused-stream and we do it with envoy headers due to the fact isito currently hardcoded retires any 5xx error (this pr addressing that), which in our opinion is bad. |
Agreed, we should only retry on 503, connect-failure, and refused-stream. @hzxuzhonghu @Stono WRT timeouts, do you have a suggestion of how they might be used? Making it a function of EDS updates makes sense for our pod churn/retry use case, but the timeout would also affect successful requests. If we make it too small, we run the risk of cancelling actual long(ish)-running requests. We currently are using the 15s default for all routes ... not sure if that's our magic number or if we need a new one :). Thoughts? |
I think timeouts here are interesting and confusing.
You don't want a per try timeout because really we are only retrying
connection issues, therefore it needs to be very low (say 10ms). 10x tries
at 10ms = 100ms which is less than the time it could take to push out eds.
So we don't achieve much.
What we actually need is for envoy to only try and endpoint once if it
results in a gateway error, move to the next endpoint. When it runs out of
endpoints, it should fail all together.
Is it then possible for it to fetch the latest eds config before outright
failing perhaps?
Karl
…On Wed, 12 Dec 2018, 4:41 pm Nathan Mittler ***@***.*** wrote:
@hzxuzhonghu <https://github.com/hzxuzhonghu>
for non-idempotent requests, gateway-error (i.e. 502, 503, 504) may make
things even worse. But 503s is just during periods of pod churn. So we
should not simply set gateway-error by default.
Agreed, we should only retry on 503, connect-failure, and refused-stream.
@hzxuzhonghu <https://github.com/hzxuzhonghu> @Stono
<https://github.com/Stono> WRT timeouts, do you have a suggestion of how
they might be used? Making it a function of EDS updates makes sense for our
pod churn/retry use case, but the timeout would also affect successful
requests. If we make it too small, we run the risk of cancelling actual
long(ish)-running requests. We currently are using the 15s default for all
routes ... not sure if that's our magic number or if we need a new one :).
Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9840 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABaviU0qSf3zGvpj-B4CgRWIQOJAYlfZks5u4TGwgaJpZM4YV0_A>
.
|
Agreed, I think the timeout should be a timeout for the request (defaults to 15s), not each attempt.
So we could use the number of endpoints in the outbound cluster as the number of retry attempts (I think I had suggested this somewhere above). Of course, this would still be limited by the overall request timeout, so depending on how responsive the system is (503s due to connect-failure should return pretty quickly) as well as how many endpoints there are, we may or may not get through them all. But this would get us pretty close to what you're asking, I suspect.
It's an interesting idea. Initially, I'd be worried regarding the extra strain it would be putting on Pilot during a period of pod churn ... especially since Pilot would be getting per-cluster requests from each Envoy. That could pretty quickly blow up. Better to wait for the next push that contains all the updates. |
@hzxuzhonghu @rshriram @costinm one thing that seems to be missing here is individual status codes. Envoy separates them out into the |
pair programming, CLA is signed by both contributors. |
update istio api
Add support RetryOn config
fixes: #8081