Wrong encoding used if charset only specified in content #1588

jdennis · 2013-09-10T19:40:58Z

HTML pages which declared their charset encoding only in their content
had the wrong encoding applied because the content was never checked.

Currently only the request headers are checked for the charset
encoding, if absent the apparent_encoding heuristic is applied. But
the W3.org doc says one should check the header first for a charset
declaration, then if that's absent check the meta tags in the content
for a charset encoding declaration. It also says if no charset
encoding declaration is found one should assume UTF-8, not ISO-8859-1
(a bad recommendation from the early days of the web).

I have a patch (pull request), more details in the commit comment.

Lukasa · 2013-09-10T20:53:33Z

All relevant discussion is in #1589, so I'm closing this to centralise the discussion there. =)

Lukasa closed this as completed Sep 10, 2013

Lukasa mentioned this issue Dec 3, 2013

Response should not return 'ISO-8859-1' as default encoding #1774

Closed

Lukasa mentioned this issue Aug 7, 2014

add auto detect charset from http body when http headers not seted #2161

Closed

github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong encoding used if charset only specified in content #1588

Wrong encoding used if charset only specified in content #1588

jdennis commented Sep 10, 2013

Lukasa commented Sep 10, 2013

Wrong encoding used if charset only specified in content #1588

Wrong encoding used if charset only specified in content #1588

Comments

jdennis commented Sep 10, 2013

Lukasa commented Sep 10, 2013