utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223

bayandin · 2022-05-02T13:08:11Z

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same change?
Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your changes? Here's an example.
Have you successfully run brew style with your changes locally?
Have you successfully run brew typecheck with your changes locally?
Have you successfully run brew tests with your changes locally?

This PR fixes an audit error for non-utf-8 URL content.

An example (spotted in Homebrew/homebrew-core#100559):

$ brew audit opencsg --online --skip-style --verbose
Error: invalid byte sequence in UTF-8
Please report this issue:
  https://docs.brew.sh/Troubleshooting
/usr/local/Homebrew/Library/Homebrew/utils/curl.rb:299:in `gsub'
/usr/local/Homebrew/Library/Homebrew/utils/curl.rb:299:in `curl_check_http_content'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:469:in `audit_homepage'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:845:in `block in audit'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:840:in `each'
/usr/local/Homebrew/Library/Homebrew/formula_auditor.rb:840:in `audit'
/usr/local/Homebrew/Library/Homebrew/dev-cmd/audit.rb:196:in `block in audit'
/usr/local/Homebrew/Library/Homebrew/dev-cmd/audit.rb:180:in `to_h'
/usr/local/Homebrew/Library/Homebrew/dev-cmd/audit.rb:180:in `audit'
/usr/local/Homebrew/Library/Homebrew/brew.rb:110:in `<main>'

Here https://opencsg.org/ has <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

BrewTestBot · 2022-05-02T13:08:30Z

Review period ended.

MikeMcQuaid · 2022-05-02T13:39:03Z

Library/Homebrew/utils/curl.rb

@@ -359,6 +359,7 @@ def curl_http_content_headers_and_checksum(

      if status.success?
        file_contents = File.read(file.path)
+        file_contents.encode!(Encoding::UTF_8, invalid: :replace) if headers["content-type"]&.start_with?("text/")


Any reason to or not to do this unconditionally?

My only concern here is handling binary data. But I don't really know if forced encoding could break something for our use cases.

Yeah invalid: :replace is dangerous for binary files.

The correct way to handle this IMO would be:

If charset= exists in the Content-Type HTTP header, pass that to Encoding.find and check if it's valid (i.e. no ArgumentError). If so the charset should be passed to File.read via the encoding: kwarg.

To be comprehensive for when the HTTP header doesn't specify a charset, we could also parse for <meta> tags and correct the charset with String#force_encoding, but it's a bit overkill.

There's also a per-media-type default charset if nothing exists, and that's probably going too far down the rabbit hole.

Instead of deleting invalid characters here, we maybe should instead scope this to the gsub calls to be safer (since we must already be dealing with text data in those cases). There is also a String#scrub method you can use rather than String#encode!. This is already used in livecheck.rb.

Or for a quick fix: just the last point really.

Yeah invalid: :replace is dangerous for binary files.

Good point, 👍🏻 to not trying to change them 😅

Thanks for the feedback! I'll revise the PR

If charset= exists in the Content-Type HTTP header, pass that to Encoding.find and check if it's valid (i.e. no ArgumentError). If so the charset should be passed to File.read via the encoding: kwarg.

Added

Instead of deleting invalid characters here, we maybe should instead scope this to the gsub calls to be safer (since we must already be dealing with text data in those cases). There is also a String#scrub method you can use rather than String#encode!. This is already used in livecheck.rb.

Changed

utils/curl: force utf-8 encoding for text content

4575ddf

bayandin added the critical Critical change which should be shipped as soon as possible. label May 2, 2022

BrewTestBot approved these changes May 2, 2022

View reviewed changes

MikeMcQuaid approved these changes May 2, 2022

View reviewed changes

utils/curl: get encoding from header

6643f58

MikeMcQuaid approved these changes May 4, 2022

View reviewed changes

bayandin merged commit 08e94d0 into Homebrew:master May 4, 2022

bayandin deleted the force-text-encoding branch May 4, 2022 15:55

bayandin changed the title ~~utils/curl: force utf-8 encoding for text content~~ utils/curl: get encoding from headers & scrub non-utf8 chars from content May 4, 2022

github-actions bot added the outdated PR was locked due to age label Jun 4, 2022

github-actions bot locked as resolved and limited conversation to collaborators Jun 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223

utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223

bayandin commented May 2, 2022

BrewTestBot commented May 2, 2022 •

edited

MikeMcQuaid May 2, 2022

bayandin May 2, 2022

Bo98 May 2, 2022 •

edited

MikeMcQuaid May 3, 2022

bayandin May 3, 2022 •

edited

bayandin May 3, 2022

utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223

utils/curl: get encoding from headers & scrub non-utf8 chars from content #13223

Conversation

bayandin commented May 2, 2022

BrewTestBot commented May 2, 2022 • edited

MikeMcQuaid May 2, 2022

Choose a reason for hiding this comment

bayandin May 2, 2022

Choose a reason for hiding this comment

Bo98 May 2, 2022 • edited

Choose a reason for hiding this comment

MikeMcQuaid May 3, 2022

Choose a reason for hiding this comment

bayandin May 3, 2022 • edited

Choose a reason for hiding this comment

bayandin May 3, 2022

Choose a reason for hiding this comment

BrewTestBot commented May 2, 2022 •

edited

Bo98 May 2, 2022 •

edited

bayandin May 3, 2022 •

edited