Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Russian Webpage parsing support. #263

Closed
mrgodhani opened this issue Feb 8, 2019 · 7 comments
Closed

Russian Webpage parsing support. #263

mrgodhani opened this issue Feb 8, 2019 · 7 comments
Labels

Comments

@mrgodhani
Copy link

mrgodhani commented Feb 8, 2019

  • Platform: Mac
  • Mercury Parser Version: Web based api (at moment)
  • Node Version (if a Node bug):
  • Browser Version (if a browser bug):

Expected Behavior

Proper encoding for Russian language.

Current Behavior

When parsing this link https://www.finam.ru/analysis/newsitem/putin-nagradil-grefa-ordenom-20190208-203615/?utm_source=rss&utm_medium=new_compaigns&utm_campaign=news_to_finamb it doesn't give proper encode output and hence format is messed up when rendering in html.

Steps to Reproduce

  1. Parse link https://www.finam.ru/analysis/newsitem/putin-nagradil-grefa-ordenom-20190208-203615/?utm_source=rss&utm_medium=new_compaigns&utm_campaign=news_to_finamb
  2. Check the content output
  3. Try to render that content with Cyrillic font
  4. You will see instead of proper format it shows bunch of '�'

Detailed Description

I use this API for parsing articles in my reader app. And there are some Russian news feed try to use and are not able to get proper format output.

Possible Solution

@mrgodhani
Copy link
Author

More reference hello-efficiency-inc/raven-reader#269

@HenryQW
Copy link

HenryQW commented Feb 11, 2019

The same applies to Chinese and other Asian languages, you get a bunch of unicodes rather than the actual content. See #264

@adampash
Copy link
Contributor

adampash commented Feb 11, 2019

Thanks for reporting @mrgodhani — and @HenryQW. I'll be honest: I don't have a ton of experience with encoding in these scenarios. This is where the encoding currently takes place:

https://github.com/postlight/mercury-parser/blob/e033835c7287904371371f922c487e6d0d7d7db8/src/resource/index.js#L63-L79

Does anything stand out to you as doing it wrong? Or any other suggestions? We're more than happy to accept help.

@vjyanand
Copy link

vjyanand commented Feb 12, 2019

My findings #267

  1. Change regex case-insensitive

var ENCODING_RE = /charset=([\w-]+)\b/i;

  1. Check truthfulness of metaContentType before comparing

if (metaContentType && properEncoding !== encoding) {

Thanks

@mrgodhani
Copy link
Author

@adampash I tested @vjyanand's fix and that's working well.

@toufic-m
Copy link
Contributor

toufic-m commented Mar 4, 2019

This fix has been merged, and will be included in the next release.

@toufic-m toufic-m closed this as completed Mar 4, 2019
@mindfulme
Copy link

Did you get it released? I have the following when parsing Russian language articles
Screenshot 2020-08-03 at 02 19 06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants