Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Supporting 429 Throttling #78

Closed
andybradshaw opened this issue Feb 6, 2020 · 5 comments · Fixed by #79
Closed

Consider Supporting 429 Throttling #78

andybradshaw opened this issue Feb 6, 2020 · 5 comments · Fixed by #79

Comments

@andybradshaw
Copy link

What happened?

Currently, when a client recieves a 429, it just continues to the next URI, which can result in more requests ultimately being made when a server is trying to indicate that it would like you to slow down. The code around handling this response code indicates that a service mesh should handle this throttling... but it would be nice to have some basic backoff in the client here.

// 429: throttle
// Ideally we should avoid hitting this URI until it's next available. In the interest of avoiding
// complex state in the client that will be replaced with a service-mesh, we will simply move on to the next
// available URI

What did you want to happen?

Upon receiving a 429, the client backs off before retrying the same node internally. The same max retries behavior can be preserved, but this would give the server a little bit of time to recover/catch up before continuing to hit it with requests.

@whickman
Copy link
Contributor

whickman commented Feb 11, 2020

This is challenging to support given the current implementation of the error decode middleware, which will return an error and no response if it applies [0].

We need the response in order to do any advanced 429 handling, as we'll need the retry-after header. It's also required for handling other statuses above 400 in the current implementation. We could potentially try to extract the status out of the error, although that seems pretty undesirable given we should just have it in the response.

Returning an error causes the response to be filtered out many other places as well (including core libraries it appears), so it is not a matter of simply also returning the response.

In the current implementation, this means that this line [1] and this line [2] are never reached, because the error code is >400 and the decode middleware appies. The former is not unit tested and the latter has the same behavior as a nil response body, which is why its test passes.

I would like to understand the need for the error decoding middleware -- @bmoylan @jamesross3 can you please describe what exactly we need it for? I would like to remove it and force consumers to do this sort of operation, rather than embedding it in the client.

[0] https://github.com/palantir/conjure-go-runtime/blob/develop/conjure-go-client/httpclient/response_error_decoder_middleware.go#L46

[1] https://github.com/palantir/conjure-go-runtime/blob/develop/conjure-go-client/httpclient/client.go#L103

[2] https://github.com/palantir/conjure-go-runtime/blob/develop/conjure-go-client/httpclient/client.go#L117

@nmiyake
Copy link
Contributor

nmiyake commented Feb 11, 2020

I'm not fully familiar with this, but given that specifying the decoding middleware is a pretty central part of the current design, I think that removing that would be a pretty big change and could impact a lot of current clients.

Also, although we don't have to be beholden to this, what is the behavior of the Java Conjure clients with regards to 429 handling? Would at least be helpful to know for reference.

Adding logic that "handles" these kinds of errors with its own logic can often lead to behavior that can be hard to reason about down the line, so also want to be mindful of that.

@whickman
Copy link
Contributor

whickman commented Feb 12, 2020

I'm not fully familiar with this, but given that specifying the decoding middleware is a pretty central part of the current design, I think that removing that would be a pretty big change and could impact a lot of current clients.

Right, but the current behavior allows clients near-arbitrary control over request handling. This API is unreasonably large, and while it is painful to shrink the surface area of an API, I think it is something we should certainly do in this case, because without it we are signing ourselves up for more issues like this in the future, only ones that are harder to debug.

Also, although we don't have to be beholden to this, what is the behavior of the Java Conjure clients with regards to 429 handling? Would at least be helpful to know for reference.

The behavior used by c-j-r is the behavior andy described "back off and try the next node", instead of "try the next node, back off if current node has previously failed". If the current code behaved as it is intended to, this would be a 5 line change.

Adding logic that "handles" these kinds of errors with its own logic can often lead to behavior that can be hard to reason about down the line, so also want to be mindful of that.

These error codes are pretty standard across environments, and while I agree we don't want to be handling a large number of cases here, being able to define different failure modes is useful. I actually think the ability of clients to define their own middleware is much more likely to lead to the scenario you described, as it reduces consistency across clients, increases entropy, and can lead to staleness down the line.

@whickman
Copy link
Contributor

whickman commented Feb 12, 2020

A less invasive option is to move retry logic to middleware and put that a layer below the error decode middleware. I am going to investigate the feasibility of this option, but even if that is tractable still strongly believe we need to remove the ability to specify custom error handling.

The issue with doing this is it will (more subtly) break the existing middleware API, as we'll be effectively hiding some error codes from the middleware

@whickman
Copy link
Contributor

I have put up a fix for the immediate issue, however it is pretty limited to the immediate problem. PR is #79 , issues to get to a better state are #80 and #81

bulldozer-bot bot pushed a commit that referenced this issue Feb 14, 2020
Fixes #78 
Extracts status code from the returned error and acts on that. This limits more complex behavior, such as respecting the retry-after header.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants