Skip to content

(PUP-7451) Add handling for Retry-After headers#6086

Merged
branan merged 5 commits intopuppetlabs:5.3.xfrom
Sharpie:PUP-7451-add-retry-after-to-http-requests
Sep 14, 2017
Merged

(PUP-7451) Add handling for Retry-After headers#6086
branan merged 5 commits intopuppetlabs:5.3.xfrom
Sharpie:PUP-7451-add-retry-after-to-http-requests

Conversation

@Sharpie
Copy link
Contributor

@Sharpie Sharpie commented Jul 25, 2017

Prior to this commit, the HTTP client used by the Puppet agent had special
handling for 3xx error codes in order to handle HTTP redirects. This patch
re-factors the response handler so that 503 and 429 error codes with
Retry-After headers set will result in a sleep followed by a retry of the
request. In particular, this allows a Puppet Server to re-splay its agent
population when overloaded by sending a 503 response with a randomized
Retry-After header, as implemented in SERVER-1767.

@Sharpie
Copy link
Contributor Author

Sharpie commented Jul 25, 2017

This is a bit of a work in progress. The code is done, but the mocks in existing tests need to be adjusted and new tests added to exercise the new behavior. Also, it looks like the code path for handling HTTP responses may be duplicated over in:

https://github.com/puppetlabs/puppet/blob/5.0.1/lib/puppet/util/http_proxy.rb#L169

So, handling for Retry-After headers would have to be duplicated over there if appropriate.

end

begin
retry_sleep = Integer(retry_after)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we should also accept a HTTP Date as a valid time to retry the request:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

However, the Puppet Server implementation of sending a 503 when overloaded uses an integer number of seconds so the marginal utility gained by also accepting dates would be very small at this time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a load-balancer fronts the compile masters, and it sends a 503 with a Date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the Puppet agent would fail and move onto the next request and eventually onto the next Puppet Run. I would actually recommend against this sort of load balancer configuration if the Retry-After times weren't randomized -- as that would cause a thundering herd to form.

Puppet Server includes a randomized Retry-After with its 503 responses which should have the effect of breaking agent herds up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way of looking at it is that there would be no change from the current behavior that occurs when the agent receives a 503 with a Retry-After date.

@Sharpie Sharpie added the Agent label Jul 25, 2017
@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch 2 times, most recently from 0dbf2f6 to c750efa Compare July 25, 2017 20:12
@puppetcla
Copy link

CLA signed by all contributors.

@ahpook
Copy link
Contributor

ahpook commented Aug 2, 2017

Can i nominate this for the next PR triage? @joshcooper @Magisus

begin
retry_sleep = Integer(retry_after)
rescue TypeError, ArgumentError
Puppet.err(_("Recieved a %{status_code} response from the server, but the Retry-After header value of '%{retry_after}' could not be converted to an integer.") %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, "Recieved"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of those words I can never spell right. Fixed.

next
end

Puppet.warning(_('Recieved a %{status_code} response from the server. Sleeping for %{retry_after} seconds before retrying the request.') %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

{status_code: current_response.code,
retry_after: retry_after})

sleep(retry_sleep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a max number of times we will try to sleep/delay so we don't get into an infinite loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer loop will terminate after @redirect_limit iterations. I suppose we could add an additional setting that controls the number of sleeps that might occur.

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch from c750efa to 982b61f Compare August 2, 2017 17:54
@Magisus Magisus requested a review from joshcooper August 8, 2017 22:47
Copy link
Contributor

@joshcooper joshcooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Can you add tests for the various responses (503, 429) with nil, date, and integer Retry-After headers

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch 4 times, most recently from d7e8128 to 654c0fd Compare August 14, 2017 23:53
@Sharpie
Copy link
Contributor Author

Sharpie commented Aug 14, 2017

@joshcooper Added some spec tests along with support for HTTP dates. Not quite sure where the codepath in lib/puppet/util/http_proxy.rb falls, looks like it gets used by the Indirector --- should this handling be duplicated over there?

next
end

retry_after = begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit - pull this out into a method rather than an inline anonymous for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't be straight-forward as the logic for terminating the outer retry loop is part of the error handling for this block.

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch from 654c0fd to ef702ef Compare August 29, 2017 22:10
{status_code: current_response.code,
retry_after: retry_after})

::Kernel.sleep(retry_sleep) if (retry_sleep > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to upper bounds the maximum sleep time so they agent doesn't hang if a server provides bad data / the headers are modified by an intermediary?

Copy link
Contributor Author

@Sharpie Sharpie Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've waffled a bit about this and ended up leaving it out for the sake of simplicity. As it stands, there's a single upper limit that can be set on the Puppet Server side that caps the size of the Retry-After headers it sends and that's that.

But maybe using Puppet[:runinterval] as an upper bound on the agent side would provide a measure of safety without being too confusing to users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that might be a good compromise. I'm just concerned that this sets a bit of a booby-trap that can't be escaped, whether that's due to:

  • server-side misconfiguration where the retry sleep is set too long, agents grabs it as part of a failed request, then is stuck with the incorrect value until it expires
  • the code malfunctions in some unexpected way
  • something is done in a nefarious way to disable puppet from running / remediating for an extended period of time


retry_sleep = case retry_after
when DateTime
(retry_after.to_time - DateTime.now.to_time).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DateTime.rfc822 in the same time zone / format as DateTime.now? Do we have to worry about ensuring both are in UTC? Or will Ruby handle ensuring we have the correctly specified time span here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DateTime objects retain time zone info, which allows for a meaningful comparison when both are cast to a local UNIX timestamp via to_time. RFC 2822 and its successors mandate that zone info be provided with a date string, so there should be an issue with ambiguous dates --- the rfc2822 parser will also reject date strings without zone info.

The spec tests for a Date value in the Retry-After header also set up a mixed time zone between the client and server to exercise this case.

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch from ef702ef to 7d7f4b1 Compare August 29, 2017 22:48
@Iristyle
Copy link
Contributor

@Sharpie one other question here is ... does the server issue the same absolute future available time in the header to all clients? Or it skewed / distributed based on when the request was received?

See the bit about auto-DDoS at:
http://www.nurkiewicz.com/2015/02/retry-after-http-header-in-practice.html

@Sharpie
Copy link
Contributor Author

Sharpie commented Aug 30, 2017

The server-side bit functions similar to splay in that there is a configurable max delay which defaults to 30 minutes. The server multiplies this max by a uniform random float between 0 and 1 and uses the result as the value of the Retry-After header included with 503 responses.

So, each agent will get a different sleep time which should stagger their runs apart and break up a herd.

We effectively suffer from the auto-DOS issue right now as a restart or failure of an overloaded Puppet Server will unblock all the waiting agents at once creating a herd of agents with synced runtimes.

@Sharpie
Copy link
Contributor Author

Sharpie commented Aug 30, 2017

Added the $runinterval setting as a limit on Retry-After time.

@ahpook
Copy link
Contributor

ahpook commented Sep 1, 2017

@Iristyle this look OK to you now? what needs to happen before merging?

@pcarlisle
Copy link
Contributor

The method for handling a request has grown in length and complexity to the point that I'm not 100% comfortable saying I can understand it. Ideally it would be more like 10 lines than 75. Would it be possible to refactor a bit and pull bits out to make the core logic clear?

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch from bba16e6 to 93380e2 Compare September 6, 2017 16:39
@Sharpie
Copy link
Contributor Author

Sharpie commented Sep 6, 2017

@pcarlisle I've factored the header parsing out into a separate method. The remaining logic can't be moved due to the use of next to advance the outer loop that drives retries and redirects.

current_request[header] = value
end
when 429, 503
retry_after = current_response['Retry-After']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more straightforward if these retry-after cases were extracted into a separate method. I'm picturing a method that returns a response, where nil would mean a sleep just happened.

The nexts would be replaced with return current_response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Using a nil vs. returning the response is a good idea. I'll take a look at re-factoring the case statement for that.

when 429, 503
retry_after = current_response['Retry-After']

if retry_after.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here about why a code of 429 or 503 would be returned without a retry_after? What does it mean for us to return the current_response in that case?

Copy link
Contributor Author

@Sharpie Sharpie Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Retry-After header is optional and its absence indicates that a request should not be retried. In this case we return the 503 or 429 response as the "final answer" from the server and do not retry --- which is the current behavior regardless of whether the header is present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, that puts my mind at ease a bit too about backwards compatibility.

retry_sleep = parse_retry_after_header(retry_after)

if retry_sleep.nil?
Puppet.err(_('Received a %{status_code} response from the server, but the Retry-After header value of "%{retry_after}" could not be converted to an integer or RFC 2822 Date.') %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding, this flow means there is a bug in our server-side code, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bug in Puppet Server, which is unlikely as the handler sets either an integer or nothing at all. Mostly this is here for visibility if we receive a Retry-After header from some server other than Puppet Server that isn't following the HTTP RFC.

{status_code: current_response.code,
retry_sleep: retry_sleep})

::Kernel.sleep(retry_sleep) if (retry_sleep > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect retry_sleep to ever be <= 0 here? If I'm reading the code right, that'd mean Puppet[:runinterval] is <= 0. Do we allow that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could happen if the server returned a RFC 2822 date that had already passed. Although, it might make more sense to have the parse_retry_after_header function return 0 in this case.

# Cap maximum sleep at the run interval of the Puppet agent.
retry_sleep = [retry_sleep, Puppet[:runinterval]].min

Puppet.warning(_('Received a %{status_code} response from the server. Sleeping for %{retry_sleep} seconds before retrying the request.') %
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to log when the first retry_sleep value is above the Puppet[:runinterval] value? I'd expect that to be an edge case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an edge case. Would probably be detectable without an additional message as the sleep would exactly equal the runinterval --- which would be a fluke for a random die roll.

# @return [nil] Returns `nil` when the `header_value` can't be
# parsed as an Integer or RFC 2822 date.
def parse_retry_after_header(header_value)
retry_after = begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of code golf here to simplify, but maybe this?

begin
  Integer(header_value)
rescue TypeError, ArgumentError
  begin
    (DateTime.rfc2822(header_value).to_time - DateTime.now.to_time).to_i
  rescue ArgumentError
    nil
  end
end

I don't see a need for the retry_after variable, nor for doing this in two parts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for splitting this into two parts is to ensure we're only rescuing ArgumentError when thrown by the DateTime.rfc2822 parsing method as that is the only operation where failure would be expected.

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch 2 times, most recently from 01f23e3 to 6ddba43 Compare September 11, 2017 15:54
@Sharpie
Copy link
Contributor Author

Sharpie commented Sep 11, 2017

@smcclellan I've re-factored the handling for Retry-After headers out into a separate method and added some docs. Let me know what you think!

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch from 6ddba43 to d7c2d90 Compare September 11, 2017 16:11
Copy link
Contributor

@smcclellan smcclellan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very close now, I like the changes. Can we get an additional test (mostly to prevent regressions) for if the RFC2822 date has already passed?

Side note, I'm a bit confused why all this logic is in connection.rb. Guessing this is functionality that just grew organically, but it may make sense to shuffle things around at some point.

@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch 3 times, most recently from 7147f6b to 4a881d5 Compare September 11, 2017 20:38
@Sharpie
Copy link
Contributor Author

Sharpie commented Sep 11, 2017

@smcclellan Added a test around the parse_retry_after_header method returning zero when given a date that has already passed. I also made a small addition that adds the hostname of the remote server to the log messages if it is available.

# Net::HTTPResponse was generated by a method that
# did not fill in the uri attribute, so use a
# placeholder for the server name.
_('the remote server')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to clean up dangling standalone phrases like this that need to be translated, because they're easy to translate wrong (e.g. here you can't tell if this is the subject of the sentence or not, which will matter for some languages like German). But in this case I don't see a good way to refactor this to make a whole sentence without duplicating a bunch of code. Can you instead just add a TRANSLATORS comment here indicating how this phrase is used? Something like "used in the phrase 'Received a response from the remote server'".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice! Added a TRANSLATORS comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍

Prior to this commit, the HTTP client used by the Puppet agent had special
handling for 3xx error codes in order to handle HTTP redirects. This patch
re-factors the response handler so that 503 error codes with Retry-After
headers set will result in a sleep followed by a retry of the request. In
particular, this allows a Puppet Server to re-splay its agent population when
overloaded by sending a 503 response with a randomized Retry-After header, as
implemented in SERVER-1767.
This patch extends the handling of Retry-After headers in 503 responses to
include datestamps formatted according to the rules laid down in RFC 2822.
This patch limits the amount of time an agent will sleep in response to a
failed HTTP request containing a `Retry-After` header to no more than the
value of the runinterval setting.
This patch moves the logic for parsing Retry-After headers to integer values
into a separate method and moves the logic for handling Retry-After headers
on 503 and 429 responses into a separate method.
@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch 2 times, most recently from f4fb60d to 2aaa24e Compare September 12, 2017 17:30
If a server returns a 503 or 429 response that has a Retry-After header,
something exceptional is happening. This patch adds the hostname of the
server, if present, to the warning messages that are logged when 503
responses are handled.
@Sharpie Sharpie force-pushed the PUP-7451-add-retry-after-to-http-requests branch from 2aaa24e to aa0967a Compare September 12, 2017 18:10
@branan branan changed the base branch from master to 5.3.x September 14, 2017 17:41
@branan branan changed the base branch from 5.3.x to master September 14, 2017 17:52
@branan branan changed the base branch from master to 5.3.x September 14, 2017 17:52
@branan
Copy link
Contributor

branan commented Sep 14, 2017

This looks clean, as much as it can be.

I don't really like the idea of just sitting and blocking on requests - it would be nice to allow higher levels of Puppet to continue to do other work. That requires much bigger refactors (which I think might be worthwhile for other reasons), though.

I'm fine seeing this land as-is, but I'm going to make sure I write up my thoughts on refactoring this.

@branan branan merged commit 3ec7589 into puppetlabs:5.3.x Sep 14, 2017
@Sharpie
Copy link
Contributor Author

Sharpie commented Sep 15, 2017

🎉 Thanks a bunch for the review and feedback!

@Sharpie Sharpie deleted the PUP-7451-add-retry-after-to-http-requests branch September 15, 2017 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants