(PUP-7451) Add handling for Retry-After headers#6086
(PUP-7451) Add handling for Retry-After headers#6086branan merged 5 commits intopuppetlabs:5.3.xfrom
Conversation
|
This is a bit of a work in progress. The code is done, but the mocks in existing tests need to be adjusted and new tests added to exercise the new behavior. Also, it looks like the code path for handling HTTP responses may be duplicated over in: https://github.com/puppetlabs/puppet/blob/5.0.1/lib/puppet/util/http_proxy.rb#L169 So, handling for Retry-After headers would have to be duplicated over there if appropriate. |
| end | ||
|
|
||
| begin | ||
| retry_sleep = Integer(retry_after) |
There was a problem hiding this comment.
Technically, we should also accept a HTTP Date as a valid time to retry the request:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After
However, the Puppet Server implementation of sending a 503 when overloaded uses an integer number of seconds so the marginal utility gained by also accepting dates would be very small at this time.
There was a problem hiding this comment.
What if a load-balancer fronts the compile masters, and it sends a 503 with a Date?
There was a problem hiding this comment.
In that case, the Puppet agent would fail and move onto the next request and eventually onto the next Puppet Run. I would actually recommend against this sort of load balancer configuration if the Retry-After times weren't randomized -- as that would cause a thundering herd to form.
Puppet Server includes a randomized Retry-After with its 503 responses which should have the effect of breaking agent herds up.
There was a problem hiding this comment.
Another way of looking at it is that there would be no change from the current behavior that occurs when the agent receives a 503 with a Retry-After date.
0dbf2f6 to
c750efa
Compare
|
CLA signed by all contributors. |
|
Can i nominate this for the next PR triage? @joshcooper @Magisus |
| begin | ||
| retry_sleep = Integer(retry_after) | ||
| rescue TypeError, ArgumentError | ||
| Puppet.err(_("Recieved a %{status_code} response from the server, but the Retry-After header value of '%{retry_after}' could not be converted to an integer.") % |
There was a problem hiding this comment.
One of those words I can never spell right. Fixed.
| next | ||
| end | ||
|
|
||
| Puppet.warning(_('Recieved a %{status_code} response from the server. Sleeping for %{retry_after} seconds before retrying the request.') % |
| {status_code: current_response.code, | ||
| retry_after: retry_after}) | ||
|
|
||
| sleep(retry_sleep) |
There was a problem hiding this comment.
Should we set a max number of times we will try to sleep/delay so we don't get into an infinite loop?
There was a problem hiding this comment.
The outer loop will terminate after @redirect_limit iterations. I suppose we could add an additional setting that controls the number of sleeps that might occur.
c750efa to
982b61f
Compare
joshcooper
left a comment
There was a problem hiding this comment.
Looks good so far. Can you add tests for the various responses (503, 429) with nil, date, and integer Retry-After headers
d7e8128 to
654c0fd
Compare
|
@joshcooper Added some spec tests along with support for HTTP dates. Not quite sure where the codepath in |
| next | ||
| end | ||
|
|
||
| retry_after = begin |
There was a problem hiding this comment.
Minor nit - pull this out into a method rather than an inline anonymous for readability?
There was a problem hiding this comment.
Won't be straight-forward as the logic for terminating the outer retry loop is part of the error handling for this block.
654c0fd to
ef702ef
Compare
| {status_code: current_response.code, | ||
| retry_after: retry_after}) | ||
|
|
||
| ::Kernel.sleep(retry_sleep) if (retry_sleep > 0) |
There was a problem hiding this comment.
Do you want to upper bounds the maximum sleep time so they agent doesn't hang if a server provides bad data / the headers are modified by an intermediary?
There was a problem hiding this comment.
I've waffled a bit about this and ended up leaving it out for the sake of simplicity. As it stands, there's a single upper limit that can be set on the Puppet Server side that caps the size of the Retry-After headers it sends and that's that.
But maybe using Puppet[:runinterval] as an upper bound on the agent side would provide a measure of safety without being too confusing to users?
There was a problem hiding this comment.
Yeah, that might be a good compromise. I'm just concerned that this sets a bit of a booby-trap that can't be escaped, whether that's due to:
- server-side misconfiguration where the retry sleep is set too long, agents grabs it as part of a failed request, then is stuck with the incorrect value until it expires
- the code malfunctions in some unexpected way
- something is done in a nefarious way to disable puppet from running / remediating for an extended period of time
|
|
||
| retry_sleep = case retry_after | ||
| when DateTime | ||
| (retry_after.to_time - DateTime.now.to_time).to_i |
There was a problem hiding this comment.
Is DateTime.rfc822 in the same time zone / format as DateTime.now? Do we have to worry about ensuring both are in UTC? Or will Ruby handle ensuring we have the correctly specified time span here?
There was a problem hiding this comment.
DateTime objects retain time zone info, which allows for a meaningful comparison when both are cast to a local UNIX timestamp via to_time. RFC 2822 and its successors mandate that zone info be provided with a date string, so there should be an issue with ambiguous dates --- the rfc2822 parser will also reject date strings without zone info.
The spec tests for a Date value in the Retry-After header also set up a mixed time zone between the client and server to exercise this case.
ef702ef to
7d7f4b1
Compare
|
@Sharpie one other question here is ... does the server issue the same absolute future available time in the header to all clients? Or it skewed / distributed based on when the request was received? See the bit about auto-DDoS at: |
|
The server-side bit functions similar to splay in that there is a configurable max delay which defaults to 30 minutes. The server multiplies this max by a uniform random float between 0 and 1 and uses the result as the value of the So, each agent will get a different sleep time which should stagger their runs apart and break up a herd. We effectively suffer from the auto-DOS issue right now as a restart or failure of an overloaded Puppet Server will unblock all the waiting agents at once creating a herd of agents with synced runtimes. |
|
Added the |
|
@Iristyle this look OK to you now? what needs to happen before merging? |
|
The method for handling a request has grown in length and complexity to the point that I'm not 100% comfortable saying I can understand it. Ideally it would be more like 10 lines than 75. Would it be possible to refactor a bit and pull bits out to make the core logic clear? |
bba16e6 to
93380e2
Compare
|
@pcarlisle I've factored the header parsing out into a separate method. The remaining logic can't be moved due to the use of |
| current_request[header] = value | ||
| end | ||
| when 429, 503 | ||
| retry_after = current_response['Retry-After'] |
There was a problem hiding this comment.
This might be more straightforward if these retry-after cases were extracted into a separate method. I'm picturing a method that returns a response, where nil would mean a sleep just happened.
The nexts would be replaced with return current_response.
There was a problem hiding this comment.
Hmm. Using a nil vs. returning the response is a good idea. I'll take a look at re-factoring the case statement for that.
| when 429, 503 | ||
| retry_after = current_response['Retry-After'] | ||
|
|
||
| if retry_after.nil? |
There was a problem hiding this comment.
Can we add a comment here about why a code of 429 or 503 would be returned without a retry_after? What does it mean for us to return the current_response in that case?
There was a problem hiding this comment.
The Retry-After header is optional and its absence indicates that a request should not be retried. In this case we return the 503 or 429 response as the "final answer" from the server and do not retry --- which is the current behavior regardless of whether the header is present.
There was a problem hiding this comment.
Perfect, that puts my mind at ease a bit too about backwards compatibility.
| retry_sleep = parse_retry_after_header(retry_after) | ||
|
|
||
| if retry_sleep.nil? | ||
| Puppet.err(_('Received a %{status_code} response from the server, but the Retry-After header value of "%{retry_after}" could not be converted to an integer or RFC 2822 Date.') % |
There was a problem hiding this comment.
To check my understanding, this flow means there is a bug in our server-side code, correct?
There was a problem hiding this comment.
A bug in Puppet Server, which is unlikely as the handler sets either an integer or nothing at all. Mostly this is here for visibility if we receive a Retry-After header from some server other than Puppet Server that isn't following the HTTP RFC.
| {status_code: current_response.code, | ||
| retry_sleep: retry_sleep}) | ||
|
|
||
| ::Kernel.sleep(retry_sleep) if (retry_sleep > 0) |
There was a problem hiding this comment.
Do we expect retry_sleep to ever be <= 0 here? If I'm reading the code right, that'd mean Puppet[:runinterval] is <= 0. Do we allow that?
There was a problem hiding this comment.
This could happen if the server returned a RFC 2822 date that had already passed. Although, it might make more sense to have the parse_retry_after_header function return 0 in this case.
| # Cap maximum sleep at the run interval of the Puppet agent. | ||
| retry_sleep = [retry_sleep, Puppet[:runinterval]].min | ||
|
|
||
| Puppet.warning(_('Received a %{status_code} response from the server. Sleeping for %{retry_sleep} seconds before retrying the request.') % |
There was a problem hiding this comment.
Do we want to log when the first retry_sleep value is above the Puppet[:runinterval] value? I'd expect that to be an edge case.
There was a problem hiding this comment.
Seems like an edge case. Would probably be detectable without an additional message as the sleep would exactly equal the runinterval --- which would be a fluke for a random die roll.
| # @return [nil] Returns `nil` when the `header_value` can't be | ||
| # parsed as an Integer or RFC 2822 date. | ||
| def parse_retry_after_header(header_value) | ||
| retry_after = begin |
There was a problem hiding this comment.
A bit of code golf here to simplify, but maybe this?
begin
Integer(header_value)
rescue TypeError, ArgumentError
begin
(DateTime.rfc2822(header_value).to_time - DateTime.now.to_time).to_i
rescue ArgumentError
nil
end
end
I don't see a need for the retry_after variable, nor for doing this in two parts.
There was a problem hiding this comment.
The main reason for splitting this into two parts is to ensure we're only rescuing ArgumentError when thrown by the DateTime.rfc2822 parsing method as that is the only operation where failure would be expected.
01f23e3 to
6ddba43
Compare
|
@smcclellan I've re-factored the handling for Retry-After headers out into a separate method and added some docs. Let me know what you think! |
6ddba43 to
d7c2d90
Compare
smcclellan
left a comment
There was a problem hiding this comment.
This looks very close now, I like the changes. Can we get an additional test (mostly to prevent regressions) for if the RFC2822 date has already passed?
Side note, I'm a bit confused why all this logic is in connection.rb. Guessing this is functionality that just grew organically, but it may make sense to shuffle things around at some point.
7147f6b to
4a881d5
Compare
|
@smcclellan Added a test around the |
| # Net::HTTPResponse was generated by a method that | ||
| # did not fill in the uri attribute, so use a | ||
| # placeholder for the server name. | ||
| _('the remote server') |
There was a problem hiding this comment.
We're trying to clean up dangling standalone phrases like this that need to be translated, because they're easy to translate wrong (e.g. here you can't tell if this is the subject of the sentence or not, which will matter for some languages like German). But in this case I don't see a good way to refactor this to make a whole sentence without duplicating a bunch of code. Can you instead just add a TRANSLATORS comment here indicating how this phrase is used? Something like "used in the phrase 'Received a response from the remote server'".
There was a problem hiding this comment.
Thanks for the advice! Added a TRANSLATORS comment.
Prior to this commit, the HTTP client used by the Puppet agent had special handling for 3xx error codes in order to handle HTTP redirects. This patch re-factors the response handler so that 503 error codes with Retry-After headers set will result in a sleep followed by a retry of the request. In particular, this allows a Puppet Server to re-splay its agent population when overloaded by sending a 503 response with a randomized Retry-After header, as implemented in SERVER-1767.
This patch extends the handling of Retry-After headers in 503 responses to include datestamps formatted according to the rules laid down in RFC 2822.
This patch limits the amount of time an agent will sleep in response to a failed HTTP request containing a `Retry-After` header to no more than the value of the runinterval setting.
This patch moves the logic for parsing Retry-After headers to integer values into a separate method and moves the logic for handling Retry-After headers on 503 and 429 responses into a separate method.
f4fb60d to
2aaa24e
Compare
If a server returns a 503 or 429 response that has a Retry-After header, something exceptional is happening. This patch adds the hostname of the server, if present, to the warning messages that are logged when 503 responses are handled.
2aaa24e to
aa0967a
Compare
|
This looks clean, as much as it can be. I don't really like the idea of just sitting and blocking on requests - it would be nice to allow higher levels of Puppet to continue to do other work. That requires much bigger refactors (which I think might be worthwhile for other reasons), though. I'm fine seeing this land as-is, but I'm going to make sure I write up my thoughts on refactoring this. |
|
🎉 Thanks a bunch for the review and feedback! |
Prior to this commit, the HTTP client used by the Puppet agent had special
handling for 3xx error codes in order to handle HTTP redirects. This patch
re-factors the response handler so that 503 and 429 error codes with
Retry-After headers set will result in a sleep followed by a retry of the
request. In particular, this allows a Puppet Server to re-splay its agent
population when overloaded by sending a 503 response with a randomized
Retry-After header, as implemented in SERVER-1767.