New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curl: Add methods to parse response #11252
Curl: Add methods to parse response #11252
Conversation
Review period will end on 2021-04-27 at 16:48:56 UTC. |
Review period ended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see some internal usage of these methods before we add them. That will help inform the APIs and review here.
I'll need to see the planned usage of these methods before I can give decent review. I don't think it's worth designing them divorced from their planned usage. If that means #10834 is the demonstration of this: ok but please let's try and do what we can to get that merged ASAP, thanks. |
f5bfb59
to
ae57da3
Compare
I have a separate branch [based on this one] for the commits that make changes to adopt the methods in this PR but I've merged it into this branch so you can see what code it's intended to replace and how it works in practice. At the moment, this branch is based on #10834, so I could properly include the If we still want to handle those changes separately (per #10834 (review)), I can always pare this back to the original commit before we consider merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope the comments are helpful, nice work so far.
Library/Homebrew/utils/curl.rb
Outdated
@@ -303,6 +308,85 @@ def curl_http_content_headers_and_checksum(url, specs: {}, hash_needed: false, u | |||
def http_status_ok?(status) | |||
(100..299).cover?(status.to_i) | |||
end | |||
|
|||
# Separates the output text from `curl` into an array of response heads and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a response "head"? Headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samford said:
A "head" here is the status line and header lines for one HTTP response (i.e., not the body). The heads array can contain head hashes from multiple responses. [Examples of the head hash format can be found in curl_spec.rb.]
For a response with no redirections, the curl output can consist of a head (the status line and headers) and/or a body (the content). When there are redirections, the curl output would contain heads for multiple responses. The head(s) and the body are separated by \r\n\r\n and that's what we use to #partition the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think heads
could get a better name here, it's not obvious what it means from this alone. Perhaps incorporate some of the above into the comment?
Library/Homebrew/utils/curl.rb
Outdated
# body output, if found. | ||
def parse_curl_output(output) | ||
heads = [] | ||
return { heads: heads, body: "" } unless output.is_a?(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is the output not a string and what else can it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samford said:
I'm not sure it would ever be anything other than a string but my original thinking was that this would allow us to gracefully handle bad input and avoid an error (since output needs to be a string for output.lstrip, output.match(...), and output.include?(...) to work).
Looking at a later comment in the previous review, it does feel more appropriate to use a Sorbet type signature instead of this explicit guard, so I'll update this accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's lean into Sorbet here instead of sniffing types. I also think this could probably raise
rather than silently returning no headers/body.
Library/Homebrew/utils/curl.rb
Outdated
# @param head_text [String] The head text of a `curl` response. | ||
# @return [Hash] A hash containing the status information and headers | ||
# (as a hash with header names as keys). | ||
def parse_curl_head(head_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to only be used once so I'd suggest it may be better moved from utils
.
ae57da3
to
6970fe0
Compare
This comment has been minimized.
This comment has been minimized.
6970fe0
to
d9c56da
Compare
@samford What's the latest here? Thanks ❤️ |
I remembered this PR last week and had been meaning to come back to it, so thanks for the reminder. I had previously put this on hold until I could look into the Typhoeus (or Ethon) gem but that approach may not end up being feasible. When I originally brought it up, you suggested it could be a possibility if it doesn't build native extensions. It initially looked promising because it doesn't build native extensions on macOS but I later discovered it does on Linux. Namely, typhoeus depends on ethon which depends on ffi, which builds native extensions on Linux. There may be some way around it but I think it makes sense to move this PR forward in the interim time. Regardless of how we approach curl in the long-term, this PR will clean up some duplicated code for the time being. I've incorporated some of the suggested changes locally and I'll update this branch after I've responded to the unresolved comments. |
I receive a - return if !appcast_contents || appcast_contents.include?(adjusted_version_stanza)
+ return if appcast_contents&.include?(adjusted_version_stanza) Unfortunately, using the safe navigation operator wouldn't achieve the same goal. If As one example, this would occur if the curl request in It may be better to handle a falsy If we want to go that route, we could avoid duplicating the
A "head" here is the status line and header lines for one HTTP response (i.e., not the body). The For a response with no redirections, the curl output can consist of a head (the status line and headers) and/or a body (the content). When there are redirections, the curl output would contain heads for multiple responses. The head(s) and the body are separated by
I'm not sure it would ever be anything other than a string but my original thinking was that this would allow us to gracefully handle bad input and avoid an error (since Looking at a later comment in the previous review, it does feel more appropriate to use a Sorbet type signature instead of this explicit guard, so I'll update this accordingly.
Same as above, I don't imagine this would ever be anything other than a string and I'll replace the
It's only used in |
d9c56da
to
0bd6e11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, nice work @samford! General themes:
- prefer
.blank?
and.present?
when we're dealing with strings to handle the empty string case more nicely - avoid sniffing types. Instead, rely on Sorbet and exceptions being raise on unexpected data.
Library/Homebrew/utils/curl.rb
Outdated
@@ -303,6 +308,85 @@ def curl_http_content_headers_and_checksum(url, specs: {}, hash_needed: false, u | |||
def http_status_ok?(status) | |||
(100..299).cover?(status.to_i) | |||
end | |||
|
|||
# Separates the output text from `curl` into an array of response heads and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samford said:
A "head" here is the status line and header lines for one HTTP response (i.e., not the body). The heads array can contain head hashes from multiple responses. [Examples of the head hash format can be found in curl_spec.rb.]
For a response with no redirections, the curl output can consist of a head (the status line and headers) and/or a body (the content). When there are redirections, the curl output would contain heads for multiple responses. The head(s) and the body are separated by \r\n\r\n and that's what we use to #partition the output.
Library/Homebrew/utils/curl.rb
Outdated
@@ -303,6 +308,85 @@ def curl_http_content_headers_and_checksum(url, specs: {}, hash_needed: false, u | |||
def http_status_ok?(status) | |||
(100..299).cover?(status.to_i) | |||
end | |||
|
|||
# Separates the output text from `curl` into an array of response heads and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think heads
could get a better name here, it's not obvious what it means from this alone. Perhaps incorporate some of the above into the comment?
Library/Homebrew/utils/curl.rb
Outdated
# body output, if found. | ||
def parse_curl_output(output) | ||
heads = [] | ||
return { heads: heads, body: "" } unless output.is_a?(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samford said:
I'm not sure it would ever be anything other than a string but my original thinking was that this would allow us to gracefully handle bad input and avoid an error (since output needs to be a string for output.lstrip, output.match(...), and output.include?(...) to work).
Looking at a later comment in the previous review, it does feel more appropriate to use a Sorbet type signature instead of this explicit guard, so I'll update this accordingly.
Library/Homebrew/utils/curl.rb
Outdated
# body output, if found. | ||
def parse_curl_output(output) | ||
heads = [] | ||
return { heads: heads, body: "" } unless output.is_a?(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's lean into Sorbet here instead of sniffing types. I also think this could probably raise
rather than silently returning no headers/body.
8f9d6f9
to
670cb4e
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now! Thanks for all the work here and sorry for back and forth. Good to 🚢 when you're happy.
88ce5f5
to
9e37a03
Compare
Library/Homebrew/utils/curl.rb
Outdated
parsed_output = parse_curl_output(range_stdout) | ||
|
||
headers = if parsed_output[:responses].present? | ||
parsed_output[:responses].first[:headers] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one area that may benefit from some discussion. range_stdout.split("\r\n\r\n").first
seems to be intended to only parse the headers of the first response and ignore any others (i.e., #parse_headers
was only able to handle one response), so I simply replicated the existing behavior in the new code.
However, this may not be appropriate for a URL that redirects, where responses before the final response would be redirections. Since the behavior in this method is determined by the presence/value of accept-ranges
and content-length
headers, any header differences between the first and last responses could lead to unintended behavior.
When we're downloading something, I imagine we would be primarily interested in the accept-ranges
and content-length
headers from the last response (after any redirections). If that makes sense, it's as simple as using parsed_output[:responses].last[:headers]
instead of #first
.
Alternatively, we could merge the headers from all responses into one, where headers from later responses would overwrite headers from earlier responses. livecheck's HeaderMatch
strategy does this internally (to simplify working with headers) and similar code here could look like: parsed_output[:responses].collect { |res| res[:headers] }.reduce(&:merge)
. I would only be useful if the last response is missing headers that earlier responses include, which may not be applicable in this specific context.
I think #last
is probably appropriate/sufficient here but I'll have to test this idea. I figured I would mention it in the interim time, to get your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one area that may benefit from some discussion.
range_stdout.split("\r\n\r\n").first
seems to be intended to only parse the headers of the first response and ignore any others (i.e.,#parse_headers
was only able to handle one response), so I simply replicated the existing behavior in the new code.
Given that: can we merge this and then discuss in a future PR? I'd really like to just get this one merged out and then be able to discuss/review smaller changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. I'll create a follow-up PR after I merge this in the morning (EDT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot, I forgot that I had committed the #last
change when I was testing it yesterday, so it was merged with the other changes earlier. I'll open a follow-up PR that would revert this one change and explain the reasoning behind why #last
is the correct behavior in this context. If we agree on #last
, then we can simply close the PR without merging but it will be there if we need to revert this change (highly unlikely, based on how #curl_download
is used).
Sorry about this. I intended to incorporate this change in a follow-up PR but it was so small that I missed it in my final review.
9e37a03
to
1c4faaa
Compare
I did some manual testing over the past couple days and didn't encounter any related failures. I looked over these changes another time and the only thing that stuck out was that I forgot to commit the aforementioned I'm going to bed over here and I'll merge this in the morning (EDT), so I'll be around if anything starts smoking 😆 |
Merging now. Thanks for your patience and all the review along the way, @MikeMcQuaid! |
Great work thanks @samford! |
Before Homebrew#11252 was merged, #curl_download used the headers from the first response to check for `accept-ranges` and `content-length` headers. This behavior was incorrect and it should have been checking the last response headers, as earlier responses would be redirections that may omit the `accept-ranges` header and/or use a different `content-length` value than the final response. I intended to maintain the "first" behavior in Homebrew#11252 and switch to `#last` in a follow-up PR but I accidentally incorporated this change into the aforementioned PR. This commit should not be merged and it's simply for the sake of explaining this change after-the-fact.
Hey @samford, there's something in this PR that is causing some audits to fail on the cask side of things. From what I can see, I've drilled it down to this PR. Here's a couple of examples; The files are fetched correctly, however the URL is still failing the audit
I may be able to do some more digging tonight - but not for a few hours. |
@bevanjkay I wasn't able to replicate this locally with the This issue was introduced in the I tested this and it should be as simple as moving the parsing logic back to its previous location, so it isn't guarded by |
How do we best handle third party taps which may be hit by the new redirection limit? Just tell them to change their formula? https://github.com/orgs/Homebrew/discussions/3215 Not sure if proxies are having a further effect on this or not. I suppose another question would be: any reason for 5 specifically over something more conservative? |
This shouldn't be something that anyone needs to think about, so I think we should just increase the default Long term, we may want to figure out a programmatic way of setting the limit for the
5 was just an arbitrary number and there's nothing really special about it. If I remember correctly, our intention was to start low and increase it if we start seeing the related error under normal circumstances. Originally the code in
I wasn't able to replicate it locally when installing the |
👍🏻 fine with me to jump bump this limit to whatever is necessary. |
Bumped to 25 in #13202 and tagged in 3.4.9. |
brew style
with your changes locally?brew typecheck
with your changes locally?brew tests
with your changes locally?Overview
This work was originally part of #9535 and then #10834 (a recreation of the former PR) but I've refactored the methods to make them a bit more general purpose here. While working on migrating livecheck to curl, it became apparent that there are a few areas in brew where similar code exists to parse curl output.
The intention of this PR is to provide methods for working with curl output in a standardized way, so we can avoid unnecessary code duplication. Notable changes are as follows:
parse_curl_output
andparse_curl_head
methods to parse curl output (a string containing response head(s) and/or the final response body). The code in these methods is partly taken from existing code in brew, as the aim is to replace existing code with these methods. The main difference is that these methods parse response heads into hashes and organize the data in a specific fashion (so we're not reinventing the wheel in multiple locations).curl_response_status_code
andcurl_response_last_location
methods that take the parsed output from the aforementioned methods and extract specific information (the status code of the last response and the lastlocation
header, respectively).Utils::Curl
tests to cover the new methods.Example
If we have curl output like:
parse_curl_output
will produce a hash like:The response heads are parsed using
parse_curl_head
, which produces a hash containing the status code and header strings.If we provide the heads array (the first member of the output array from
parse_curl_output
), it returns"200"
. If we do the same withcurl_response_last_location
, it returns"https://example.com/"
.Conclusion
This PR only focuses on adding these methods and I would appreciate feedback on the shape of this. I'll add comments in places that I think may benefit from discussion.
If/when this is merged, I'll create a follow-up PR that replaces existing code with these methods (as it was suggested that it may be better to keep this separate). I'll be rebasing #10834 onto this shortly and I'll also update that PR to account for any changes that we make here in review.