Improve encoding detection in WebsiteAgent #1751

knu · 2016-10-25T16:12:31Z

Previously, WebsiteAgent always assumed that a content with no charset specified in the Content-Type header would be encoded in UTF-8. This enhancement is to make use of the encoding detector implemented in Nokogiri for HTML/XML documents, instead of blindly falling back to UTF-8.

When the document type is html or xml, WebsiteAgent tries to detect the encoding of a fetched document from the presence of a BOM, XML declaration, or HTML meta tag.

This fixes #1742.

Previously, WebsiteAgent always assumed that a content with no charset specified in the Content-Type header would be encoded in UTF-8. This enhancement is to make use of the encoding detector implemented in Nokogiri for HTML/XML documents, instead of blindly falling back to UTF-8. When the document `type` is `html` or `xml`, WebsiteAgent tries to detect the encoding of a fetched document from the presence of a BOM, XML declaration, or HTML `meta` tag. This fixes #1742.

knu force-pushed the encoding_detection branch from f157ad2 to cf74d0a Compare October 25, 2016 16:14

knu force-pushed the encoding_detection branch from cf74d0a to 50b5833 Compare October 27, 2016 04:00

knu merged commit 1e14358 into master Oct 27, 2016

knu deleted the encoding_detection branch October 27, 2016 04:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve encoding detection in WebsiteAgent #1751

Improve encoding detection in WebsiteAgent #1751

knu commented Oct 25, 2016 •

edited

Improve encoding detection in WebsiteAgent #1751

Improve encoding detection in WebsiteAgent #1751

Conversation

knu commented Oct 25, 2016 • edited

knu commented Oct 25, 2016 •

edited