Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223

Merged
merged 2 commits into from May 4, 2022

Conversation

bayandin
Copy link
Member

@bayandin bayandin commented May 2, 2022

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

This PR fixes an audit error for non-utf-8 URL content.

An example (spotted in Homebrew/homebrew-core#100559):

$ brew audit opencsg --online --skip-style --verbose
Error: invalid byte sequence in UTF-8
Please report this issue:
  https://docs.brew.sh/Troubleshooting
/usr/local/Homebrew/Library/Homebrew/utils/curl.rb:299:in `gsub'
/usr/local/Homebrew/Library/Homebrew/utils/curl.rb:299:in `curl_check_http_content'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:469:in `audit_homepage'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:845:in `block in audit'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:840:in `each'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:840:in `audit'
/usr/local/Homebrew/Library/Homebrew/dev-cmd/audit.rb:196:in `block in audit'
/usr/local/Homebrew/Library/Homebrew/dev-cmd/audit.rb:180:in `to_h'
/usr/local/Homebrew/Library/Homebrew/dev-cmd/audit.rb:180:in `audit'
/usr/local/Homebrew/Library/Homebrew/brew.rb:110:in `<main>'

Here https://opencsg.org/ has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

@bayandin bayandin added the critical Critical change which should be shipped as soon as possible. label May 2, 2022
@BrewTestBot
Copy link
Member

BrewTestBot commented May 2, 2022

Review period ended.

@@ -359,6 +359,7 @@ def curl_http_content_headers_and_checksum(

if status.success?
file_contents = File.read(file.path)
file_contents.encode!(Encoding::UTF_8, invalid: :replace) if headers["content-type"]&.start_with?("text/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to or not to do this unconditionally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern here is handling binary data. But I don't really know if forced encoding could break something for our use cases.

Copy link
Member

@Bo98 Bo98 May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah invalid: :replace is dangerous for binary files.

The correct way to handle this IMO would be:

  • If charset= exists in the Content-Type HTTP header, pass that to Encoding.find and check if it's valid (i.e. no ArgumentError). If so the charset should be passed to File.read via the encoding: kwarg.
    • To be comprehensive for when the HTTP header doesn't specify a charset, we could also parse for <meta> tags and correct the charset with String#force_encoding, but it's a bit overkill.
    • There's also a per-media-type default charset if nothing exists, and that's probably going too far down the rabbit hole.
  • Instead of deleting invalid characters here, we maybe should instead scope this to the gsub calls to be safer (since we must already be dealing with text data in those cases). There is also a String#scrub method you can use rather than String#encode!. This is already used in livecheck.rb.

Or for a quick fix: just the last point really.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah invalid: :replace is dangerous for binary files.

Good point, 👍🏻 to not trying to change them 😅

Copy link
Member Author

@bayandin bayandin May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I'll revise the PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If charset= exists in the Content-Type HTTP header, pass that to Encoding.find and check if it's valid (i.e. no ArgumentError). If so the charset should be passed to File.read via the encoding: kwarg.

Added

Instead of deleting invalid characters here, we maybe should instead scope this to the gsub calls to be safer (since we must already be dealing with text data in those cases). There is also a String#scrub method you can use rather than String#encode!. This is already used in livecheck.rb.

Changed

@bayandin bayandin merged commit 08e94d0 into Homebrew:master May 4, 2022
@bayandin bayandin deleted the force-text-encoding branch May 4, 2022 15:55
@bayandin bayandin changed the title utils/curl: force utf-8 encoding for text content utils/curl: get encoding from headers & scrub non-utf8 chars from content May 4, 2022
@github-actions github-actions bot added the outdated PR was locked due to age label Jun 4, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
critical Critical change which should be shipped as soon as possible. outdated PR was locked due to age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants