Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to detect charset from certain ShiftJIS page #199

Closed
jokester opened this issue Oct 16, 2023 · 10 comments · Fixed by #206
Closed

Fail to detect charset from certain ShiftJIS page #199

jokester opened this issue Oct 16, 2023 · 10 comments · Fixed by #206

Comments

@jokester
Copy link

jokester commented Oct 16, 2023

Describe the bug

OpenGraphScrapter v6.3.0 couldn't detect charset from a webpage I saw.

The page had <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> and maybe no other clue.

To Reproduce

const openGraphScrapter = require('open-graph-scraper')

openGraphScrapter({url: 'http://abehiroshi.la.coocan.jp/'}).then(result => console.debug(result) )

Expected behavior

      result: {
        ogTitle: '阿部 寛のホームページ',  // Not very confident on this. Would openGraphScrapter convert it if correct encoding was extracted?
        charset: 'ShiftJIS',
        requestUrl: 'http://abehiroshi.la.coocan.jp/',
        success: true
      },

Actual behavior

      result: {
        ogTitle: '�������̃z�[���y�[�W',
        charset: 'UTF-8',
        requestUrl: 'http://abehiroshi.la.coocan.jp/',
        success: true
      },

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

  • OS: Mac if that matters
  • Node Version: 18.8
  • openGraphScraper Version: 6.3.0
@jokester
Copy link
Author

The code used chardet(buffer) as fallback, but given the result I think it didn't guess well.

// lib/fallback.ts
    ogObject.charset = chardet.detect(Buffer.from(body)) || '';

@jshemas
Copy link
Owner

jshemas commented Oct 16, 2023

I've never seen the http-equiv meta tag before. I can add a fallback to this later in the week.

@jokester
Copy link
Author

Another detail I found during debugging is, I guess there is some cornerer case where we cannot just read text with body = await res.text(), and feed it to chardet.detect(Buffer.from(body)).

At least for this specific webpage Buffer.from(await res.arraybuffer()) and Buffer.from(await res.text()) gave different bytes . Maybe res.text() lacked correct encoding or was not for this purpose.

This is a gist to show the difference of bytes and chardet.analyze(): https://gist.github.com/jokester/937c43eb8918e141ef43dc320f38b8d8

@jokester
Copy link
Author

In my use case I managed to detect encoding, convert the bytes, and use openGraphScraper({html}) to get what I needed.

Considering the tricky things in encoding problem I guess it's hard to do a perfect fix. The API was flexible enough to allow my workaround 👍🏽 .

@jshemas
Copy link
Owner

jshemas commented Oct 22, 2023

I've updating the charset fallback in open-graph-scraper@6.3.2. I'm also getting weird/different results between Buffer.from(await res.arraybuffer()) and Buffer.from(await res.text()) for this page, but other ShiftJIS pages seem to work just fine. Are you seeing this issue with other sites?

@jokester
Copy link
Author

Sorry I don't have other similar cases at hand. Thanks for the fix, it should make this library more complete 👍🏽

@jokester
Copy link
Author

I had another look at "corrupted" ShiftJIS text in gist. In the suspicious res.text() bytes, a lot of Japanese characters are replaced by U+FFFD "Replacement Character".

@cm-dyoshikawa
Copy link
Contributor

@jshemas @jokester

#206

I'm working on this issue.

I hope that users of openGraphScraper won't have to worry about character sets. Therefore, I will suggest implementing a feature to check the character set and decode it to UTF-8 when fetching a website.

@jshemas
Copy link
Owner

jshemas commented Feb 15, 2024

@jokester @cm-dyoshikawa fix is live in open-graph-scraper@6.4.0 !

@cm-dyoshikawa
Copy link
Contributor

With this change, users of openGraphScraper should no longer need to be aware of character encodings. This will be very useful since I am in a Japanese-speaking country and still have Shift_JIS sites. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants