-
-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make number_of_words respect CJK characters #7813
Conversation
CJK languages generally consider every single character a word. This commit changed Liquid filter "number_of_words" so it counts CJK characters correctly.
@iBug Thanks for the contribution! It looks like there's one more change to make for rubocop, and then this should be good to merge |
Co-Authored-By: Matt Rogers <mattr-@github.com>
@mattr- Thank you for your guidance. I've reviewed RuboCop output and applied your suggested changes. I'm rather new to Ruby actually. By the way, I've updated the documentation text as well. Hope it helps. |
@iBug Thank you for submitting this enhancement. |
Code taken from iBug. jekyll/jekyll#7813
Sorry for the hibernation of this PR. I spent a lot of time learning Ruby development and Jekyll. With the last commit 360ecda, |
@iBug Thank you for returning to this. The test looks sufficient to me. |
I'm not sure whether including Katakana, Hiragana and Hangul (which all encode syllables, not words) in the regex is an ideal solution, but I do agree that it's better than the status quo. Also note that while Japanese (Katakana, Hiragana) doesn't make use of ASCII spaces in writing, Korean (Hangul) does. |
@doersino I'm aware of that, but in general CJK languages don't count "words", but "characters" instead (think how Microsoft Word does it). For ASCII spaces, since CJK chars are normalized into spaces before western (Latin, Cyrillic etc.) "words" are split, they don't matter here. |
@ashmaroli Any chance someone could take a look at this? I think this is whatsoever better than before and thus worth having. |
Yes, I believe this is better than current |
Looking at this again...
In other words, a situation that was already bloated with Since, the percentage of users that would have CJK chars in their site's contents would be much lower than the percentage of users without CJK chars in their site's contents, I'm withholding approval on the change. @iBug, I recommend that you come up with a solution acceptable for both sides of the table. |
Suggestion:
|
That is one solution. Benchmarking (and optionally profiling) the two alternatives can help decide which would be the better choice. |
@ashmaroli @doersino I've pushed a patch for this. Thank you for your precious analyses and suggestions! A summary of the new commit:
|
- Split with CJK characters only when present - Avoid bloating from String#gub by using String#split with regex directly
@ashmaroli With the last commit 21bf5a5 pushed, I've changed the code to the "Overhead Mode" (optional but disabled by default). I've also updated the tests so both the new behavior and the old behavior are tested (essentially testing the optional parameter). On my local environment both Again, I'm very grateful to you two for being super helpful and patient on guiding me through this seemingly simple PR. It's certainly been a wonderful experience participating in this great open-source project and there's no denying that I've learned a lot from this. |
You're welcome @iBug def number_of_words(input, mode = nil)
cjk_regex = %r![\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]!
word_regex = %r![^\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}\s]+!
case mode
when "cjk"
input.scan(cjk_regex).length + input.scan(word_regex).length
when "auto"
cjk_count = input.scan(cjk_regex).length
cjk_count.zero? ? input.split.length : cjk_count + input.scan(word_regex).length
else
input.split.length
end
end |
@ashmaroli Sure. I've put the new test code on top of the original one in order to reduce clutter (avoid making this thread too long). The new results deserve a separate post so here they are. (Keep in mind that each function is run
Though, I still prefer the "auto mode" and would like to have it as the default behavior. Benchmark results show an increase of less than 1 ms increase per 20 KB source processed. This level of performance penalty falls far behind the deviation from computers' intrinsics. To be exact, this is not going to be observable in real runs of Jekyll. |
@iBug i guess you should add some tests on CJK languages other than Chinese. :) |
@iBug The takeaway from my suggestion above is:
|
@crispgm Will you be able to provide sample texts in Japanese or Korean? |
Is it possible to make the default option a site-wide setting? Let's say, something like
|
That would delay this PR a lot.
|
@ashmaroli okay, here they are
and powered by Google Translate. |
@crispgm Ah! good ol' Google Translate. :) But you forgot to mention how many words they are. |
This is true. I updated on last posts. :) |
@iBug Please update the documentation as well so that users can differentiate when to use the two modes. Once again, thank you for your patience and effort. |
@ashmaroli I was already planning to do so, but I failed to identify which pieces of files I should look into. |
Adding to what you have in |
@ashmaroli With the latest commit, I think it's almost there. |
Thank you @iBug and everyone that participated here. |
Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)
[b3e0b34] Use new number_of_words plugin Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)
Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)
This is a 🙋 feature or enhancement.
Summary
CJK languages generally consider every single character a word.
This PR modifies Liquid filter "number_of_words" so it counts CJK characters correctly.
Context
My personal blog uses Minimal Mistakes theme, which relies on this filter to generate a "reading time". However, as I write both English and Chinese, the latter of which doesn't use spaces to separate words, my Chinese articles get a "1 minute read" as a result regardless of how much I write. This is wrong.
I am currently using a hand-crafted plugin to resolve this issue, but I hope that it can be solved on the Jekyll side. The changes is mostly a port of my own plugin.