New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.WordCount not accurate for Japanese pages #1266

Closed
RickCogley opened this Issue Jul 11, 2015 · 9 comments

Comments

Projects
None yet
2 participants
@RickCogley
Contributor

RickCogley commented Jul 11, 2015

Hello - when I look at .WordCount results against an English page and a Japanese page, the count is correct for English of course, but incorrect for Japanese, returning a very small number.

I assume it is counting words using spaces as the delimiter, but Japanese and other CJK languages are inherently space-less. This also causes trouble for search engines, as an aside.

I am wondering if .WordCount could detect the language, and count characters for CJK language, instead of returning an incorrect number.

Best regards,
Rick

@bep bep added the Bug label Jul 11, 2015

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 11, 2015

Member

It can hardly detect the language ...? but should be able to use the languageCode. What is the correct way to count words in Japanese?

Member

bep commented Jul 11, 2015

It can hardly detect the language ...? but should be able to use the languageCode. What is the correct way to count words in Japanese?

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jul 11, 2015

Contributor

Hi @bep, it's not straightforward since there are so many combinations, and no spaces. One would need a really large dictionary file of all possible combinations of kanji characters, and then run that against text to try to guess the number of words. You could not even guess what a "word" was, since sometimes what looks like a four-character combination is actually two two-character combinations.

I think it is just better to stick to counting the double-byte characters and giving a count of those.

Edit: noting also that it's possible to intersperse single-byte English and Japanese, as well, among Japanese characters.

Contributor

RickCogley commented Jul 11, 2015

Hi @bep, it's not straightforward since there are so many combinations, and no spaces. One would need a really large dictionary file of all possible combinations of kanji characters, and then run that against text to try to guess the number of words. You could not even guess what a "word" was, since sometimes what looks like a four-character combination is actually two two-character combinations.

I think it is just better to stick to counting the double-byte characters and giving a count of those.

Edit: noting also that it's possible to intersperse single-byte English and Japanese, as well, among Japanese characters.

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jul 11, 2015

Contributor

An interesting aside: today I was working on SEO or social partials, and discovered that the .languageCode var is used for the RSS feed, which in turn uses en-US for English, but ja for Japanese (not ja-JP).

And that's not conducive to using for "locale" because:

  • Facebook og - uses underbar like en_US, ja_JP
  • Schema.org - uses hyphen like en-US, ja-JP
  • (Twitter card uses no locale)

What I ended up doing was to settle on the hyphen version in locale in site and page params, then use the replace function like {{ replace . "-" "_" }} to change to the underscore version, for Facebook og, as needed.

Contributor

RickCogley commented Jul 11, 2015

An interesting aside: today I was working on SEO or social partials, and discovered that the .languageCode var is used for the RSS feed, which in turn uses en-US for English, but ja for Japanese (not ja-JP).

And that's not conducive to using for "locale" because:

  • Facebook og - uses underbar like en_US, ja_JP
  • Schema.org - uses hyphen like en-US, ja-JP
  • (Twitter card uses no locale)

What I ended up doing was to settle on the hyphen version in locale in site and page params, then use the replace function like {{ replace . "-" "_" }} to change to the underscore version, for Facebook og, as needed.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 11, 2015

Member

@RickCogley one part of me just loves having a language guy like you on the team coming up with problems like these, the other part ...

Member

bep commented Jul 11, 2015

@RickCogley one part of me just loves having a language guy like you on the team coming up with problems like these, the other part ...

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 11, 2015

Member

Reading what you say, I guess what we need here is to skip the discussion about what a word is -- and export a new method on page: RuneCount.

https://golang.org/pkg/unicode/utf8/#RuneCount
http://blog.golang.org/strings

Member

bep commented Jul 11, 2015

Reading what you say, I guess what we need here is to skip the discussion about what a word is -- and export a new method on page: RuneCount.

https://golang.org/pkg/unicode/utf8/#RuneCount
http://blog.golang.org/strings

@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jul 12, 2015

Contributor

@bep, hehe, "sorry". :-)
RuneCount, yes!, that would work, because we can check for locale then show either WordCount or RuneCount as appropriate.

Contributor

RickCogley commented Jul 12, 2015

@bep, hehe, "sorry". :-)
RuneCount, yes!, that would work, because we can check for locale then show either WordCount or RuneCount as appropriate.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 12, 2015

Member

@RickCogley just to check: In my head it wouldn't make sense to include whitespace in that count, right?

Member

bep commented Jul 12, 2015

@RickCogley just to check: In my head it wouldn't make sense to include whitespace in that count, right?

@bep bep closed this in 77c60a3 Jul 12, 2015

bep added a commit that referenced this issue Jul 12, 2015

Optimize RuneCount
Do not create it unless used.

See #1266
@RickCogley

This comment has been minimized.

Show comment
Hide comment
@RickCogley

RickCogley Jul 12, 2015

Contributor

@bep, no, I don't think we need whitespace counted.

Japanese can use a normal ASCII space, and, there is a double byte space. Sometimes we use them in names:

田中 太郎
田中 太郎

Those have a single byte space and a double byte space between the last and first names.

But this usage is pretty rare.

Contributor

RickCogley commented Jul 12, 2015

@bep, no, I don't think we need whitespace counted.

Japanese can use a normal ASCII space, and, there is a double byte space. Sometimes we use them in names:

田中 太郎
田中 太郎

Those have a single byte space and a double byte space between the last and first names.

But this usage is pretty rare.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 12, 2015

Member

Whitespace also includes newlines and tabs etc, so I think it would give a skewed count for small texts with lots of paragraphs. I will keep it as implemented.

Member

bep commented Jul 12, 2015

Whitespace also includes newlines and tabs etc, so I think it would give a skewed count for small texts with lots of paragraphs. I will keep it as implemented.

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Optimize RuneCount
Do not create it unless used.

See #1266
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment