Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed encoding guessing: only search for meta tags in HTML bodies #4566

Merged
merged 1 commit into from
Mar 17, 2022

Conversation

Prinzhorn
Copy link
Member

Description

Real world issue: https://www.youtube.com/s/desktop/245f415e/jsbin/desktop_polymer_inlined_html_polymer_flags.vflset/desktop_polymer_inlined_html_polymer_flags.js

This JavaScript file contains a matching meta tag and the encoding ended up being a literal backslash \ which I passed along to codecs.lookup(enc).name. The added check will catch html and xhtml.

I think we want to rewrite/improvie the guessing anyway with better tests, e.g. #4415

Checklist

  • I have updated tests where applicable.
  • I have added an entry to the CHANGELOG.

@mhils
Copy link
Member

mhils commented Apr 21, 2021

I don't think we should rely on a content-type header being set, that's precisely the point here - the content type is specified in the HTML only. To me it looks like we should much rather improve our detection heuristic to not trigger on your file?

While we're at it, we should also reduce the search to the first 1024 bytes by using re.match(r".{0,1024}")... :)

@Prinzhorn
Copy link
Member Author

Prinzhorn commented Apr 21, 2021

I don't think we should rely on a content-type header being set, that's precisely the point here - the content type is specified in the HTML only

Could you point to some real world test-case here? I thought we're only looking at the meta tag because of the encoding, not because of the content-type. I don't think any browser beyond Internet Explorer does any content-type sniffing aynmore so if there is no content-type header it will never be upgraded to HTML and hence the meta tag does not have any effect. So why should mitmproxy do it? But maybe I'm missing a use-case here?

To me it looks like we should much rather improve our detection heuristic to not trigger on your file?

It could be html in content_type or content_type is None, that should catch all cases? In my case it's correctly served as JavaScript so there is absolutely no reason to sniff a different content-type (which this code doesn't, it just tries to guess the encoding).

While we're at it, we should also reduce the search to the first 1024 bytes by using re.match(r".{0,1024}")... :)

Yes, especially since that YouTube JavaScript has 7.7 MB 😄

@mhils
Copy link
Member

mhils commented Apr 21, 2021

Could you point to some real world test-case here? I thought we're only looking at the meta tag because of the encoding, not because of the content-type. I don't think any browser beyond Internet Explorer does any content-type sniffing aynmore so if there is no content-type header it will never be upgraded to HTML and hence the meta tag does not have any effect. So why should mitmproxy do it? But maybe I'm missing a use-case here?

If you have a proper content-type header, that header could/should already set the charset (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type#directives). So assuming a proper HTML content type is slightly optimistic. That being said, I have no idea what modern browsers are doing.

I know that requests has given up at some point and now exclusively relies on chardet. Not sure what's the current state, but that used to have terrible performance. Chromium apparently does something similar with https://opensource.google/projects/ced. Firefox has an even newer thing: https://hsivonen.fi/chardetng/.

It could be html in content_type or content_type is None, that should catch all cases?

That sounds good to me! We can then also do the same "is None" for the CSS rule right below. :)

@Prinzhorn
Copy link
Member Author

Prinzhorn commented Apr 26, 2021

I'll look into this eventually, but I just came across another issue in-the-wild: https://csync.smartadserver.com/diff/rtb/csync/CookieSyncV.html?hasrtb=true&nwid=2416&dcid=6&iscname=false&cname=

It is served as text/html (without specified charset) but contains UTF-8 BOM 🎉

Selection_864

Selection_865

Given the amount of edge cases that I alone ran into (without even searching for them) I'd love to evaluate new options for v7. I'll look into chardet as well, for my use-case I don't depend on what mitmproxy detects (I exclusively work with content).

I know that a chardet-like approach will have new edge cases as well. Not sure how it would handle the content-type: image/gif; charset=utf-8 thingy and in what funny ways this will break.

@Prinzhorn
Copy link
Member Author

Prinzhorn commented Apr 26, 2021

that's precisely the point here - the content type is specified in the HTML only

That's not necessarily true, e.g. <meta charset="utf-8"> without http-equiv.

Sorry I'm just brain dumping here 😄

@Prinzhorn
Copy link
Member Author

I think one option that #4415 (comment) is missing is to try decoding inside text and if it fails transparently fall back to latin-1. Does that make sense? This would be useful regardless of what method we're going to use to detect the charset to make sure that text always works. And if the charset was correct or if you never even access text then there is no downside.

@Prinzhorn
Copy link
Member Author

One issue with the chardet approach that neither Google, nor Firefox or request have: in mitmproxy bodies are not immutable. So there might be a content-type header with a funny charset but the body does not use funny characters. Now chardet boldly claims this is latin-1 or utf-8. And now someone uses the text setter and we update content with utf-8 bytes and mitmproxy forwards the HTML to the browser. The browser sees the content-type header and might break the utf-8 characters inside the body, depending on what they actually do 😅
I think other projects don't suffer from that, because the bodies are immutable. If you're scraping or analyzing dumps you only care about being able to read the bodies properly, not write them.

@mhils
Copy link
Member

mhils commented Apr 27, 2021

that's precisely the point here - the content type is specified in the HTML only

That's not necessarily true, e.g. <meta charset="utf-8"> without http-equiv.

My previous comment wasn't phrased well. What I meant is: Ideally everyone would also have proper Content-Type headers, but in practice we check for the meta tag because people can't get their headers right.

You may also "enjoy" reading https://www.w3.org/International/questions/qa-html-encoding-declarations.en.

I think one option that #4415 (comment) is missing is to try decoding inside text and if it fails transparently fall back to latin-1. Does that make sense?

That's actually a neat idea. 👍 Latin1 is quite terrible, but in contrast to UTF-8 it's wonderfully bijective. 😁

We do need to make sure that we also re-encode it properly in set_text then. For all instances where we fall back to decoding with latin1, we need to also then re-encode with latin1. That probably means refactoring _guess_encoding to not take arguments and always return an encoding that can be used to decode the current body. But I'd be happy to see that sort of change!

One issue with the chardet approach that neither Google, nor Firefox or request have: in mitmproxy bodies are not immutable. So there might be a content-type header with a funny charset but the body does not use funny characters. Now chardet boldly claims this is latin-1 or utf-8. And now someone uses the text setter and we update content with utf-8 bytes and mitmproxy forwards the HTML to the browser. The browser sees the content-type header and might break the utf-8 characters inside the body, depending on what they actually do 😅

Yes, that's also a good point. I'd generally be leaning heavily towards not using chardet anyways, it's just another dependency for relatively little benefit.

@Prinzhorn
Copy link
Member Author

Prinzhorn commented Feb 22, 2022

I don't think we should rely on a content-type header being set, that's precisely the point here - the content type is specified in the HTML only

Since this just came up again, could you please double check what you said there? I think the point here is that the content-type header does not specify a charset. Not that the content-type is missing completely. Without a content-type header we should just straight up assume application/octet-stream and force latin-1. Anything else would be insane. Unless you have a convincing argument and a real world URL to proof it (one you didn't host yourself or someone you know or is related to you).

Here's how I'm currently wrapping _guess_encoding in my addon. I need to normalize the encoding name and make sure it doesn't crash. I never use text at all, Node.js only passes content back and forth, that's why I need to make sure I can properly decode it on the other end.

def __guess_encoding(self, message: mitmproxy.http.Message, content: bytes) -> str:
    enc = message._guess_encoding(content)

    try:
        # Normalize the name, e.g. latin-1 becomes iso8859-1 or UTF8 becomes utf-8.
        enc = codecs.lookup(enc).name
    except LookupError:
        # Could not look up and normalize the name, fall back to iso8859-1.
        enc = "iso8859-1"

    # If we detected a more specific encoding than iso8859-1, make sure the bytes actually decode with it.
    if enc != "iso8859-1":
        try:
            codecs.decode(content, enc)
        except ValueError:
            enc = "iso8859-1"

    return enc

I am happy with this so far, as it is robust (e.g. the GIF with charset=UTF-8 will fall back to iso8859-1). But now we need to improve our heuristics to get some better results.

Ideas:

  1. Missing content-type -> early exit latin-1
  2. Content-type has charset -> done, we trust it (but make sure it actually decodes, evil UTF-8 GIFs beware)
  3. Otherwise check the content-type for known types and apply heuristics, e.g. for HTML look at meta tags, for CSS look at @charset, for JSON force UTF-8, for other known TEXT (xml, svg, plain, ...) types look at BOM.
  4. If the above fails add some defaults, e.g. JavaScript as UTF-8 (but make sure it decodes and doesn't crash)

@Prinzhorn
Copy link
Member Author

Without a content-type header we should just straight up assume application/octet-stream and force latin-1

I will soften this up a bit, as it makes sense to look for BOM. Chances that it is a text file is much higher than that your random binary blob starts with BOM.

But it's complicated.

Also see https://github.com/codeprentice-org/sniffpy

Which doesn't concern charset/encoding at all, just the mime type. Did I say it's complicated? We need a third party.

@Prinzhorn
Copy link
Member Author

Refs #5152

@mhils mhils enabled auto-merge (squash) March 17, 2022 14:18
@mhils mhils merged commit e8ae38c into mitmproxy:main Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants