Join GitHub today
.WordCount not accurate for Japanese pages #1266
Hello - when I look at .WordCount results against an English page and a Japanese page, the count is correct for English of course, but incorrect for Japanese, returning a very small number.
I assume it is counting words using spaces as the delimiter, but Japanese and other CJK languages are inherently space-less. This also causes trouble for search engines, as an aside.
I am wondering if .WordCount could detect the language, and count characters for CJK language, instead of returning an incorrect number.
Hi @bep, it's not straightforward since there are so many combinations, and no spaces. One would need a really large dictionary file of all possible combinations of kanji characters, and then run that against text to try to guess the number of words. You could not even guess what a "word" was, since sometimes what looks like a four-character combination is actually two two-character combinations.
I think it is just better to stick to counting the double-byte characters and giving a count of those.
Edit: noting also that it's possible to intersperse single-byte English and Japanese, as well, among Japanese characters.
An interesting aside: today I was working on SEO or social partials, and discovered that the .languageCode var is used for the RSS feed, which in turn uses en-US for English, but ja for Japanese (not ja-JP).
And that's not conducive to using for "locale" because:
What I ended up doing was to settle on the hyphen version in
@bep, no, I don't think we need whitespace counted.
Japanese can use a normal ASCII space, and, there is a double byte space. Sometimes we use them in names:
Those have a single byte space and a double byte space between the last and first names.
But this usage is pretty rare.