Fail to detect charset from certain ShiftJIS page #199

jokester · 2023-10-16T04:04:17Z

Describe the bug

OpenGraphScrapter v6.3.0 couldn't detect charset from a webpage I saw.

The page had <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> and maybe no other clue.

To Reproduce

const openGraphScrapter = require('open-graph-scraper')

openGraphScrapter({url: 'http://abehiroshi.la.coocan.jp/'}).then(result => console.debug(result) )

Expected behavior

      result: {
        ogTitle: '阿部 寛のホームページ',  // Not very confident on this. Would openGraphScrapter convert it if correct encoding was extracted?
        charset: 'ShiftJIS',
        requestUrl: 'http://abehiroshi.la.coocan.jp/',
        success: true
      },

Actual behavior

      result: {
        ogTitle: '�������̃z�[���y�[�W',
        charset: 'UTF-8',
        requestUrl: 'http://abehiroshi.la.coocan.jp/',
        success: true
      },

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

OS: Mac if that matters
Node Version: 18.8
openGraphScraper Version: 6.3.0

The text was updated successfully, but these errors were encountered:

jokester · 2023-10-16T04:11:49Z

The code used chardet(buffer) as fallback, but given the result I think it didn't guess well.

// lib/fallback.ts
    ogObject.charset = chardet.detect(Buffer.from(body)) || '';

jshemas · 2023-10-16T04:51:53Z

I've never seen the http-equiv meta tag before. I can add a fallback to this later in the week.

jokester · 2023-10-16T11:36:40Z

Another detail I found during debugging is, I guess there is some cornerer case where we cannot just read text with body = await res.text(), and feed it to chardet.detect(Buffer.from(body)).

At least for this specific webpage Buffer.from(await res.arraybuffer()) and Buffer.from(await res.text()) gave different bytes . Maybe res.text() lacked correct encoding or was not for this purpose.

This is a gist to show the difference of bytes and chardet.analyze(): https://gist.github.com/jokester/937c43eb8918e141ef43dc320f38b8d8

jokester · 2023-10-16T11:41:55Z

In my use case I managed to detect encoding, convert the bytes, and use openGraphScraper({html}) to get what I needed.

Considering the tricky things in encoding problem I guess it's hard to do a perfect fix. The API was flexible enough to allow my workaround 👍🏽 .

jshemas · 2023-10-22T21:00:02Z

I've updating the charset fallback in open-graph-scraper@6.3.2. I'm also getting weird/different results between Buffer.from(await res.arraybuffer()) and Buffer.from(await res.text()) for this page, but other ShiftJIS pages seem to work just fine. Are you seeing this issue with other sites?

jokester · 2023-10-23T15:07:23Z

Sorry I don't have other similar cases at hand. Thanks for the fix, it should make this library more complete 👍🏽

jokester · 2023-10-30T17:22:50Z

I had another look at "corrupted" ShiftJIS text in gist. In the suspicious res.text() bytes, a lot of Japanese characters are replaced by U+FFFD "Replacement Character".

cm-dyoshikawa · 2024-02-02T07:16:41Z

@jshemas @jokester

#206

I'm working on this issue.

I hope that users of openGraphScraper won't have to worry about character sets. Therefore, I will suggest implementing a feature to check the character set and decode it to UTF-8 when fetching a website.

jshemas · 2024-02-15T03:01:30Z

@jokester @cm-dyoshikawa fix is live in open-graph-scraper@6.4.0 !

cm-dyoshikawa · 2024-02-15T05:07:50Z

With this change, users of openGraphScraper should no longer need to be aware of character encodings. This will be very useful since I am in a Japanese-speaking country and still have Shift_JIS sites. Thank you.

cm-dyoshikawa mentioned this issue Feb 2, 2024

Add character encoding detection and decoding logic #206

Merged

2 tasks

jshemas closed this as completed in #206 Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to detect charset from certain ShiftJIS page #199

Fail to detect charset from certain ShiftJIS page #199

jokester commented Oct 16, 2023 •

edited

jokester commented Oct 16, 2023

jshemas commented Oct 16, 2023

jokester commented Oct 16, 2023

jokester commented Oct 16, 2023

jshemas commented Oct 22, 2023

jokester commented Oct 23, 2023

jokester commented Oct 30, 2023

cm-dyoshikawa commented Feb 2, 2024

jshemas commented Feb 15, 2024

cm-dyoshikawa commented Feb 15, 2024

Fail to detect charset from certain ShiftJIS page #199

Fail to detect charset from certain ShiftJIS page #199

Comments

jokester commented Oct 16, 2023 • edited

jokester commented Oct 16, 2023

jshemas commented Oct 16, 2023

jokester commented Oct 16, 2023

jokester commented Oct 16, 2023

jshemas commented Oct 22, 2023

jokester commented Oct 23, 2023

jokester commented Oct 30, 2023

cm-dyoshikawa commented Feb 2, 2024

jshemas commented Feb 15, 2024

cm-dyoshikawa commented Feb 15, 2024

jokester commented Oct 16, 2023 •

edited