Consider Supporting 429 Throttling #78

andybradshaw · 2020-02-06T19:47:35Z

What happened?

Currently, when a client recieves a 429, it just continues to the next URI, which can result in more requests ultimately being made when a server is trying to indicate that it would like you to slow down. The code around handling this response code indicates that a service mesh should handle this throttling... but it would be nice to have some basic backoff in the client here.

conjure-go-runtime/conjure-go-client/httpclient/client.go

Lines 104 to 107 in a9e7ef7

    
           // 429: throttle 
        
           // Ideally we should avoid hitting this URI until it's next available. In the interest of avoiding 
        
           // complex state in the client that will be replaced with a service-mesh, we will simply move on to the next 
        
           // available URI

What did you want to happen?

Upon receiving a 429, the client backs off before retrying the same node internally. The same max retries behavior can be preserved, but this would give the server a little bit of time to recover/catch up before continuing to hit it with requests.

whickman · 2020-02-11T20:22:40Z

This is challenging to support given the current implementation of the error decode middleware, which will return an error and no response if it applies [0].

We need the response in order to do any advanced 429 handling, as we'll need the retry-after header. It's also required for handling other statuses above 400 in the current implementation. We could potentially try to extract the status out of the error, although that seems pretty undesirable given we should just have it in the response.

Returning an error causes the response to be filtered out many other places as well (including core libraries it appears), so it is not a matter of simply also returning the response.

In the current implementation, this means that this line [1] and this line [2] are never reached, because the error code is >400 and the decode middleware appies. The former is not unit tested and the latter has the same behavior as a nil response body, which is why its test passes.

I would like to understand the need for the error decoding middleware -- @bmoylan @jamesross3 can you please describe what exactly we need it for? I would like to remove it and force consumers to do this sort of operation, rather than embedding it in the client.

[0] https://github.com/palantir/conjure-go-runtime/blob/develop/conjure-go-client/httpclient/response_error_decoder_middleware.go#L46

[1] https://github.com/palantir/conjure-go-runtime/blob/develop/conjure-go-client/httpclient/client.go#L103

[2] https://github.com/palantir/conjure-go-runtime/blob/develop/conjure-go-client/httpclient/client.go#L117

nmiyake · 2020-02-11T23:14:35Z

I'm not fully familiar with this, but given that specifying the decoding middleware is a pretty central part of the current design, I think that removing that would be a pretty big change and could impact a lot of current clients.

Also, although we don't have to be beholden to this, what is the behavior of the Java Conjure clients with regards to 429 handling? Would at least be helpful to know for reference.

Adding logic that "handles" these kinds of errors with its own logic can often lead to behavior that can be hard to reason about down the line, so also want to be mindful of that.

whickman · 2020-02-12T14:14:28Z

I'm not fully familiar with this, but given that specifying the decoding middleware is a pretty central part of the current design, I think that removing that would be a pretty big change and could impact a lot of current clients.

Right, but the current behavior allows clients near-arbitrary control over request handling. This API is unreasonably large, and while it is painful to shrink the surface area of an API, I think it is something we should certainly do in this case, because without it we are signing ourselves up for more issues like this in the future, only ones that are harder to debug.

Also, although we don't have to be beholden to this, what is the behavior of the Java Conjure clients with regards to 429 handling? Would at least be helpful to know for reference.

The behavior used by c-j-r is the behavior andy described "back off and try the next node", instead of "try the next node, back off if current node has previously failed". If the current code behaved as it is intended to, this would be a 5 line change.

Adding logic that "handles" these kinds of errors with its own logic can often lead to behavior that can be hard to reason about down the line, so also want to be mindful of that.

These error codes are pretty standard across environments, and while I agree we don't want to be handling a large number of cases here, being able to define different failure modes is useful. I actually think the ability of clients to define their own middleware is much more likely to lead to the scenario you described, as it reduces consistency across clients, increases entropy, and can lead to staleness down the line.

whickman · 2020-02-12T14:26:09Z

A less invasive option is to move retry logic to middleware and put that a layer below the error decode middleware. I am going to investigate the feasibility of this option, but even if that is tractable still strongly believe we need to remove the ability to specify custom error handling.

The issue with doing this is it will (more subtly) break the existing middleware API, as we'll be effectively hiding some error codes from the middleware

whickman · 2020-02-12T20:46:27Z

I have put up a fix for the immediate issue, however it is pretty limited to the immediate problem. PR is #79 , issues to get to a better state are #80 and #81

Fixes #78 Extracts status code from the returned error and acts on that. This limits more complex behavior, such as respecting the retry-after header.

This was referenced Feb 12, 2020

fix: correctly handle exceptions >400 #79

Merged

Move retry logic into middleware #81

Open

bulldozer-bot bot closed this as completed in #79 Feb 14, 2020

bulldozer-bot bot pushed a commit that referenced this issue Feb 14, 2020

fix: correctly handle exceptions >400 (#79)

9f720b3

Fixes #78 Extracts status code from the returned error and acts on that. This limits more complex behavior, such as respecting the retry-after header.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider Supporting 429 Throttling #78

Consider Supporting 429 Throttling #78

andybradshaw commented Feb 6, 2020

whickman commented Feb 11, 2020 •

edited

Loading

nmiyake commented Feb 11, 2020

whickman commented Feb 12, 2020 •

edited

Loading

whickman commented Feb 12, 2020 •

edited

Loading

whickman commented Feb 12, 2020

Consider Supporting 429 Throttling #78

Consider Supporting 429 Throttling #78

Comments

andybradshaw commented Feb 6, 2020

What happened?

What did you want to happen?

whickman commented Feb 11, 2020 • edited Loading

nmiyake commented Feb 11, 2020

whickman commented Feb 12, 2020 • edited Loading

whickman commented Feb 12, 2020 • edited Loading

whickman commented Feb 12, 2020

whickman commented Feb 11, 2020 •

edited

Loading

whickman commented Feb 12, 2020 •

edited

Loading

whickman commented Feb 12, 2020 •

edited

Loading