-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed encoding guessing: only search for meta tags in HTML bodies #4566
Conversation
I don't think we should rely on a content-type header being set, that's precisely the point here - the content type is specified in the HTML only. To me it looks like we should much rather improve our detection heuristic to not trigger on your file? While we're at it, we should also reduce the search to the first 1024 bytes by using |
Could you point to some real world test-case here? I thought we're only looking at the meta tag because of the encoding, not because of the content-type. I don't think any browser beyond Internet Explorer does any content-type sniffing aynmore so if there is no content-type header it will never be upgraded to HTML and hence the meta tag does not have any effect. So why should mitmproxy do it? But maybe I'm missing a use-case here?
It could be
Yes, especially since that YouTube JavaScript has 7.7 MB 😄 |
If you have a proper content-type header, that header could/should already set the charset (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type#directives). So assuming a proper HTML content type is slightly optimistic. That being said, I have no idea what modern browsers are doing. I know that requests has given up at some point and now exclusively relies on chardet. Not sure what's the current state, but that used to have terrible performance. Chromium apparently does something similar with https://opensource.google/projects/ced. Firefox has an even newer thing: https://hsivonen.fi/chardetng/.
That sounds good to me! We can then also do the same "is None" for the CSS rule right below. :) |
I'll look into this eventually, but I just came across another issue in-the-wild: https://csync.smartadserver.com/diff/rtb/csync/CookieSyncV.html?hasrtb=true&nwid=2416&dcid=6&iscname=false&cname= It is served as Given the amount of edge cases that I alone ran into (without even searching for them) I'd love to evaluate new options for v7. I'll look into chardet as well, for my use-case I don't depend on what mitmproxy detects (I exclusively work with I know that a chardet-like approach will have new edge cases as well. Not sure how it would handle the |
That's not necessarily true, e.g. Sorry I'm just brain dumping here 😄 |
I think one option that #4415 (comment) is missing is to |
One issue with the chardet approach that neither Google, nor Firefox or request have: in mitmproxy bodies are not immutable. So there might be a content-type header with a funny charset but the body does not use funny characters. Now chardet boldly claims this is latin-1 or utf-8. And now someone uses the |
My previous comment wasn't phrased well. What I meant is: Ideally everyone would also have proper Content-Type headers, but in practice we check for the meta tag because people can't get their headers right. You may also "enjoy" reading https://www.w3.org/International/questions/qa-html-encoding-declarations.en.
That's actually a neat idea. 👍 Latin1 is quite terrible, but in contrast to UTF-8 it's wonderfully bijective. 😁 We do need to make sure that we also re-encode it properly in
Yes, that's also a good point. I'd generally be leaning heavily towards not using chardet anyways, it's just another dependency for relatively little benefit. |
Since this just came up again, could you please double check what you said there? I think the point here is that the Here's how I'm currently wrapping def __guess_encoding(self, message: mitmproxy.http.Message, content: bytes) -> str:
enc = message._guess_encoding(content)
try:
# Normalize the name, e.g. latin-1 becomes iso8859-1 or UTF8 becomes utf-8.
enc = codecs.lookup(enc).name
except LookupError:
# Could not look up and normalize the name, fall back to iso8859-1.
enc = "iso8859-1"
# If we detected a more specific encoding than iso8859-1, make sure the bytes actually decode with it.
if enc != "iso8859-1":
try:
codecs.decode(content, enc)
except ValueError:
enc = "iso8859-1"
return enc I am happy with this so far, as it is robust (e.g. the GIF with charset=UTF-8 will fall back to iso8859-1). But now we need to improve our heuristics to get some better results. Ideas:
|
I will soften this up a bit, as it makes sense to look for BOM. Chances that it is a text file is much higher than that your random binary blob starts with BOM. But it's complicated. Also see https://github.com/codeprentice-org/sniffpy Which doesn't concern charset/encoding at all, just the mime type. Did I say it's complicated? We need a third party. |
Refs #5152 |
Description
Real world issue: https://www.youtube.com/s/desktop/245f415e/jsbin/desktop_polymer_inlined_html_polymer_flags.vflset/desktop_polymer_inlined_html_polymer_flags.js
This JavaScript file contains a matching meta tag and the encoding ended up being a literal backslash
\
which I passed along tocodecs.lookup(enc).name
. The added check will catch html and xhtml.I think we want to rewrite/improvie the guessing anyway with better tests, e.g. #4415
Checklist