New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223
Conversation
Review period ended. |
Library/Homebrew/utils/curl.rb
Outdated
@@ -359,6 +359,7 @@ def curl_http_content_headers_and_checksum( | |||
|
|||
if status.success? | |||
file_contents = File.read(file.path) | |||
file_contents.encode!(Encoding::UTF_8, invalid: :replace) if headers["content-type"]&.start_with?("text/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to or not to do this unconditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern here is handling binary data. But I don't really know if forced encoding could break something for our use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah invalid: :replace
is dangerous for binary files.
The correct way to handle this IMO would be:
- If
charset=
exists in the Content-Type HTTP header, pass that toEncoding.find
and check if it's valid (i.e. noArgumentError
). If so the charset should be passed toFile.read
via theencoding:
kwarg.- To be comprehensive for when the HTTP header doesn't specify a charset, we could also parse for
<meta>
tags and correct the charset withString#force_encoding
, but it's a bit overkill. - There's also a per-media-type default charset if nothing exists, and that's probably going too far down the rabbit hole.
- To be comprehensive for when the HTTP header doesn't specify a charset, we could also parse for
- Instead of deleting invalid characters here, we maybe should instead scope this to the
gsub
calls to be safer (since we must already be dealing with text data in those cases). There is also aString#scrub
method you can use rather thanString#encode!
. This is already used in livecheck.rb.
Or for a quick fix: just the last point really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah
invalid: :replace
is dangerous for binary files.
Good point, 👍🏻 to not trying to change them 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I'll revise the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If charset= exists in the Content-Type HTTP header, pass that to Encoding.find and check if it's valid (i.e. no ArgumentError). If so the charset should be passed to File.read via the encoding: kwarg.
Added
Instead of deleting invalid characters here, we maybe should instead scope this to the gsub calls to be safer (since we must already be dealing with text data in those cases). There is also a String#scrub method you can use rather than String#encode!. This is already used in livecheck.rb.
Changed
brew style
with your changes locally?brew typecheck
with your changes locally?brew tests
with your changes locally?This PR fixes an audit error for non-utf-8 URL content.
An example (spotted in Homebrew/homebrew-core#100559):
Here https://opencsg.org/ has
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">