Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve encoding detection in WebsiteAgent #1751

Merged
merged 1 commit into from
Oct 27, 2016
Merged

Improve encoding detection in WebsiteAgent #1751

merged 1 commit into from
Oct 27, 2016

Conversation

knu
Copy link
Member

@knu knu commented Oct 25, 2016

Previously, WebsiteAgent always assumed that a content with no charset specified in the Content-Type header would be encoded in UTF-8. This enhancement is to make use of the encoding detector implemented in Nokogiri for HTML/XML documents, instead of blindly falling back to UTF-8.

When the document type is html or xml, WebsiteAgent tries to detect the encoding of a fetched document from the presence of a BOM, XML declaration, or HTML meta tag.

This fixes #1742.

Previously, WebsiteAgent always assumed that a content with no charset
specified in the Content-Type header would be encoded in UTF-8.  This
enhancement is to make use of the encoding detector implemented in
Nokogiri for HTML/XML documents, instead of blindly falling back to
UTF-8.

When the document `type` is `html` or `xml`, WebsiteAgent tries to
detect the encoding of a fetched document from the presence of a BOM,
XML declaration, or HTML `meta` tag.

This fixes #1742.
@knu knu merged commit 1e14358 into master Oct 27, 2016
@knu knu deleted the encoding_detection branch October 27, 2016 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when fetching url: source sequence is illegal/malformed utf-8" error.
1 participant