New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use <meta> tags to discover the per-page encoding of html previews #4183

Merged
merged 4 commits into from Nov 15, 2018

Conversation

Projects
None yet
2 participants
@hawkowl
Copy link
Contributor

hawkowl commented Nov 14, 2018

No description provided.

@hawkowl hawkowl requested a review from matrix-org/synapse-core Nov 14, 2018

@hawkowl

This comment has been minimized.

Copy link
Contributor

hawkowl commented Nov 14, 2018

Fixes #2891

hawkowl added some commits Nov 14, 2018

fix
fix
@@ -53,6 +53,9 @@

logger = logging.getLogger(__name__)

_charset_match = re.compile(br"<\s*meta[^>]*charset\s*=\s*([a-z0-9-]+)", flags=re.I)

This comment has been minimized.

@richvdh

richvdh Nov 15, 2018

Member

Obligatory reference to https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 here.

(yeah, I don't know how you're supposed to parse the HTML before you know what encoding it is, either)

This comment has been minimized.

@hawkowl

hawkowl Nov 15, 2018

Contributor

Yeah, when I was coming up with this, I had two solutions:

  1. Decode as latin1 (which will decode any 8 bit stream) or ascii with errors set to ignore
  2. Regex looking for a <meta beginning block and then charset=<foo>, which is likely to be in a http-equiv="Content-Type".

Since I consider this best-effort (correctly configured servers ought to be serving the correct header in the Content-Type), I decided a regex was better than parsing it, then parsing it again. I chose not to check for the http equiv part since it can go before or after the content="charset=utf8" bit and make the regex far more complex.

This comment has been minimized.

@richvdh

richvdh Nov 15, 2018

Member

fully agreed

This comment has been minimized.

@hawkowl

hawkowl Nov 15, 2018

Contributor

It's only a 2/10 on the HE COMES scale :)

@richvdh
Copy link
Member

richvdh left a comment

lgtm

@hawkowl hawkowl merged commit df758e1 into develop Nov 15, 2018

5 checks passed

ci/circleci: sytestpy2merged Your tests passed on CircleCI!
Details
ci/circleci: sytestpy2postgresmerged Your tests passed on CircleCI!
Details
ci/circleci: sytestpy3merged Your tests passed on CircleCI!
Details
ci/circleci: sytestpy3postgresmerged Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@hawkowl hawkowl deleted the hawkowl/http-equiv-encodings branch Nov 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment