Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support retries of failed proxy requests #2414

Merged
merged 19 commits into from May 12, 2023

Conversation

mikemherron
Copy link
Contributor

@mikemherron mikemherron commented Mar 7, 2023

Implements #2372

Support for retrying proxy requests that fail due to an unavailable backend instance.

@mikemherron mikemherron changed the title Draft proposal for supporting retries of failed proxy requests Support retries of failed proxy requests Mar 10, 2023
middleware/proxy.go Outdated Show resolved Hide resolved
@mikemherron mikemherron marked this pull request as ready for review March 10, 2023 15:35
Copy link
Contributor

@aldas aldas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think we need to edit proxyRaw so all errors are actually set as error. c.Set("_error", fmt.Sprintf("proxy raw, hijack error=%v, url=%s", t.URL, err)) is not error

middleware/proxy.go Outdated Show resolved Hide resolved
middleware/proxy.go Outdated Show resolved Hide resolved
middleware/proxy.go Outdated Show resolved Hide resolved
middleware/proxy.go Show resolved Hide resolved
middleware/proxy.go Show resolved Hide resolved
middleware/proxy.go Outdated Show resolved Hide resolved
middleware/proxy_test.go Outdated Show resolved Hide resolved
@mikemherron
Copy link
Contributor Author

I think we need to edit proxyRaw so all errors are actually set as error. c.Set("_error", fmt.Sprintf("proxy raw, hijack error=%v, url=%s", t.URL, err)) is not error

Good catch, I'll do this as part of this PR.

@mikemherron mikemherron requested a review from aldas March 17, 2023 17:45
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

❗ No coverage uploaded for pull request base (master@ec642f7). Click here to learn what that means.
Patch coverage: 80.82% of modified lines in pull request are covered.

❗ Current head f3472cd differs from pull request most recent head 5560254. Consider uploading reports for the commit 5560254 to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #2414   +/-   ##
=========================================
  Coverage          ?   92.84%           
=========================================
  Files             ?       39           
  Lines             ?     4555           
  Branches          ?        0           
=========================================
  Hits              ?     4229           
  Misses            ?      237           
  Partials          ?       89           
Impacted Files Coverage Δ
middleware/proxy.go 74.24% <80.82%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@aldas aldas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, better than my ideas.

just fix these linting errors and we are good.

p.s. makefile default target is helpful here.

@aldas
Copy link
Contributor

aldas commented Mar 22, 2023

@lammel, do you want to take a look?

@aldas
Copy link
Contributor

aldas commented Mar 22, 2023

@mikemherron
Copy link
Contributor Author

@mikemherron could you please add PR for docs also https://github.com/labstack/echox/blob/master/website/content/middleware/proxy.md

Yes - I will try do this later today or failing that tomorrow at the latest.

@mikemherron
Copy link
Contributor Author

@mikemherron could you please add PR for docs also https://github.com/labstack/echox/blob/master/website/content/middleware/proxy.md

Yes - I will try do this later today or failing that tomorrow at the latest.

Docs PR: labstack/echox#281

@lammel
Copy link
Contributor

lammel commented Mar 24, 2023

Nice work! Although I don't like the name RetryFilter I failed to come up with something better. So lets stick with it.

What I'd like to see is an additional test for timeout of an proxy target (quite common due to firewall or load issues).
Thanks for the docs PR, that is great. Will respond there too.

@mikemherron
Copy link
Contributor Author

Nice work! Although I don't like the name RetryFilter I failed to come up with something better. So lets stick with it.

Yes, I know what you mean, there is nothing else I could see in the code base with *Filter....I did consider RetryHandler, but that sounds like the function is responsible for doing the retry which didn't feel right either, but was more in line with Echo conventions. Let me know if you come up with an alternative, happy to change it.

What I'd like to see is an additional test for timeout of an proxy target (quite common due to firewall or load issues). Thanks for the docs PR, that is great. Will respond there too.

That's a good point. I wouldn't have time to look at it this week but will have time next week if you'd prefer to hold off merging until then.

@lammel
Copy link
Contributor

lammel commented Mar 27, 2023

Yes, I know what you mean, there is nothing else I could see in the code base with *Filter....I did consider RetryHandler, but that sounds like the function is responsible for doing the retry which didn't feel right either, but was more in line with Echo conventions. Let me know if you come up with an alternative, happy to change it.

I failed too. So let's stick with it for now.

What I'd like to see is an additional test for timeout of an proxy target (quite common due to firewall or load issues). Thanks for the docs PR, that is great. Will respond there too.

That's a good point. I wouldn't have time to look at it this week but will have time next week if you'd prefer to hold off merging until then.

I'd prefer to wait as we are not in a hurry. Take your time, looking forward to it.
Multiple timeouts might cause some unexpected delay and side effects, so let's see how it behaves.

@mikemherron
Copy link
Contributor Author

I'd prefer to wait as we are not in a hurry. Take your time, looking forward to it. Multiple timeouts might cause some unexpected delay and side effects, so let's see how it behaves.

Added a new test with a timing out backend that sends 20 concurrent requests. The behaviour of the timeouts seems fine, but it does raise another interesting issue: when using the round robin load balancer, the index of the current target is shared amongst all requests. This means that is is possible for a failing request to end up retrying against the same backend, since other concurrent requests will have incremented the current target index.

This makes sense, but probably won't be what users expect. In the test, we have 2 backends configured, 1 that will always timeout, and another that will always succeed, and RetryCount set to 1. I would expect that every request will succeed, since if we get a failure we should retry with "the next backend" - but "next" backend can be the same depending on how many other concurrent requests we have.

I'm not sure there is a simple fix to this. The expected behaviour, IMO, should be for an individual request to somehow keep track of what backend it tried, and then ask the balancer for the next one relative to that, rather than the "next" one as determined by some global state. This could be done by adding another argument on to ProxyBalancer with the last backend, or have the NextTarget method also return some state to act like a cursor that can be passed back in on retry calls. However, both of these would change the ProxyBalancer interface and cause a breaking change.

This limitation means the retry feature is useful for skipping over intermittent failures, but less useful in cases with an entire instance becomes unavailable.

For now, I just made the test pass on 502 errors (backend unavailable) - interested to hear any thoughts on potential solitions.

@aldas
Copy link
Contributor

aldas commented Mar 28, 2023

@mikemherron as we/you moved c.Set(config.ContextKey, tgt) around and removed clearing that - in case of retry provider can actually check context for previous target with c.Get(config.ContextKey) and try to avoid getting same target for next value. All this without changing the existing API.

@mikemherron
Copy link
Contributor Author

@mikemherron as we/you moved c.Set(config.ContextKey, tgt) around and removed clearing that - in case of retry provider can actually check context for previous target and c.Get(config.ContextKey) and try to avoid getting same target for next value.

Yes, that's a good point. Should we change the provided round robin and random balancer implementations to do this? Or leave it to users that want that behaviour?

@aldas
Copy link
Contributor

aldas commented Mar 28, 2023

It makes sense for these default implementations to avoid serving same target for next try. Do you feel adventurous and have time for this enchantment? If this feature solves more problems that it creates - it is probably worth implementing.

@mikemherron
Copy link
Contributor Author

It makes sense for these default implementations to avoid serving same target for next try. Do you feel adventurous and have time for this enchantment? If this feature solves more problems that it creates - it is probably worth implementing.

I went to do that but realised the balancers don't have access to the ContextKey config attribute - so instead I updated the round robin balancer to store the last index used inside the context, and then check on each call if there was a previous index and if so, start from there. I think this solves the issue well, and was able to update the previous test so all requests now pass.

I didn't seem to make as much sense to do this on the random balancer. We could keep getting random balancers until we get one that is not the same, but it seems sort of wasteful. We could start iterating from the previous index on retires (similar to what's being done in the round robine balancer) but I think if you make a choice to use the random balancer that would be unexpected.

Let me know what you think...

middleware/proxy.go Show resolved Hide resolved
@aldas aldas requested a review from lammel April 15, 2023 22:04
@aldas
Copy link
Contributor

aldas commented May 12, 2023

allright, I am merging this PR. That part I do not agree is private/internal so we can revise it if needs be.

@mikemherron Thank you for the work and being patient with me.

@aldas aldas merged commit 0ae7464 into labstack:master May 12, 2023
@mikemherron
Copy link
Contributor Author

allright, I am merging this PR. That part I do not agree is private/internal so we can revise it if needs be.

@mikemherron Thank you for the work and being patient with me.

No problem at all @aldas, I totally understand you need to do what you think best for the project. Thanks for all your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants