Russian Webpage parsing support. #263

mrgodhani · 2019-02-08T18:15:00Z

Platform: Mac
Mercury Parser Version: Web based api (at moment)
Node Version (if a Node bug):
Browser Version (if a browser bug):

Expected Behavior

Proper encoding for Russian language.

Current Behavior

When parsing this link https://www.finam.ru/analysis/newsitem/putin-nagradil-grefa-ordenom-20190208-203615/?utm_source=rss&utm_medium=new_compaigns&utm_campaign=news_to_finamb it doesn't give proper encode output and hence format is messed up when rendering in html.

Steps to Reproduce

Parse link https://www.finam.ru/analysis/newsitem/putin-nagradil-grefa-ordenom-20190208-203615/?utm_source=rss&utm_medium=new_compaigns&utm_campaign=news_to_finamb
Check the content output
Try to render that content with Cyrillic font
You will see instead of proper format it shows bunch of '�'

Detailed Description

I use this API for parsing articles in my reader app. And there are some Russian news feed try to use and are not able to get proper format output.

Possible Solution

mrgodhani · 2019-02-08T18:35:55Z

More reference hello-efficiency-inc/raven-reader#269

HenryQW · 2019-02-11T15:29:20Z

The same applies to Chinese and other Asian languages, you get a bunch of unicodes rather than the actual content. See #264

adampash · 2019-02-11T23:34:19Z

Thanks for reporting @mrgodhani — and @HenryQW. I'll be honest: I don't have a ton of experience with encoding in these scenarios. This is where the encoding currently takes place:

https://github.com/postlight/mercury-parser/blob/e033835c7287904371371f922c487e6d0d7d7db8/src/resource/index.js#L63-L79

Does anything stand out to you as doing it wrong? Or any other suggestions? We're more than happy to accept help.

vjyanand · 2019-02-12T02:27:32Z

My findings #267

Change regex case-insensitive

var ENCODING_RE = /charset=([\w-]+)\b/i;

Check truthfulness of metaContentType before comparing

if (metaContentType && properEncoding !== encoding) {

Thanks

mrgodhani · 2019-02-12T03:29:46Z

@adampash I tested @vjyanand's fix and that's working well.

toufic-m · 2019-03-04T08:55:25Z

This fix has been merged, and will be included in the next release.

mindfulme · 2020-08-02T23:19:18Z

Did you get it released? I have the following when parsing Russian language articles

adampash added the bug label Feb 8, 2019

mrgodhani mentioned this issue Feb 8, 2019

Please support cyrilic fonts hello-efficiency-inc/raven-reader#269

Closed

adampash mentioned this issue Feb 11, 2019

feat: add ftchinese parser #264

Closed

toufic-m closed this as completed Mar 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Russian Webpage parsing support. #263

Russian Webpage parsing support. #263

mrgodhani commented Feb 8, 2019 •

edited

Loading

mrgodhani commented Feb 8, 2019

HenryQW commented Feb 11, 2019 •

edited

Loading

adampash commented Feb 11, 2019 •

edited

Loading

vjyanand commented Feb 12, 2019 •

edited

Loading

mrgodhani commented Feb 12, 2019

toufic-m commented Mar 4, 2019

mindfulme commented Aug 2, 2020

Russian Webpage parsing support. #263

Russian Webpage parsing support. #263

Comments

mrgodhani commented Feb 8, 2019 • edited Loading

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description

Possible Solution

mrgodhani commented Feb 8, 2019

HenryQW commented Feb 11, 2019 • edited Loading

adampash commented Feb 11, 2019 • edited Loading

vjyanand commented Feb 12, 2019 • edited Loading

mrgodhani commented Feb 12, 2019

toufic-m commented Mar 4, 2019

mindfulme commented Aug 2, 2020

mrgodhani commented Feb 8, 2019 •

edited

Loading

HenryQW commented Feb 11, 2019 •

edited

Loading

adampash commented Feb 11, 2019 •

edited

Loading

vjyanand commented Feb 12, 2019 •

edited

Loading