Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make number_of_words respect CJK characters #7813

Merged
merged 15 commits into from May 22, 2020
Merged

Make number_of_words respect CJK characters #7813

merged 15 commits into from May 22, 2020

Conversation

iBug
Copy link
Contributor

@iBug iBug commented Sep 5, 2019

This is a 馃檵 feature or enhancement.

Summary

CJK languages generally consider every single character a word.

This PR modifies Liquid filter "number_of_words" so it counts CJK characters correctly.

Context

My personal blog uses Minimal Mistakes theme, which relies on this filter to generate a "reading time". However, as I write both English and Chinese, the latter of which doesn't use spaces to separate words, my Chinese articles get a "1 minute read" as a result regardless of how much I write. This is wrong.

I am currently using a hand-crafted plugin to resolve this issue, but I hope that it can be solved on the Jekyll side. The changes is mostly a port of my own plugin.

CJK languages generally consider every single character a word.
This commit changed Liquid filter "number_of_words"
so it counts CJK characters correctly.
@mattr-
Copy link
Member

@mattr- mattr- commented Sep 5, 2019

@iBug Thanks for the contribution!

It looks like there's one more change to make for rubocop, and then this should be good to merge

mattr-
mattr- previously requested changes Sep 5, 2019
lib/jekyll/filters.rb Outdated Show resolved Hide resolved
@iBug
Copy link
Contributor Author

@iBug iBug commented Sep 5, 2019

@mattr- Thank you for your guidance. I've reviewed RuboCop output and applied your suggested changes. I'm rather new to Ruby actually.

By the way, I've updated the documentation text as well. Hope it helps.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented Sep 5, 2019

@iBug Thank you for submitting this enhancement.
But I feel it'd be better if there were tests as well. The necessity being to add coverage for the regular expression used. If any future complaints arise regarding the filter with CJK chars, one can always refer to those tests to understand the expected results.

@mattr- mattr- dismissed their stale review Sep 5, 2019

Requested changes were made

weirane added a commit to weirane/blog that referenced this issue Jan 1, 2020
@iBug
Copy link
Contributor Author

@iBug iBug commented Mar 7, 2020

Sorry for the hibernation of this PR. I spent a lot of time learning Ruby development and Jekyll. With the last commit 360ecda, bundle exec rake test gives no errors. Is that enough?

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented Mar 7, 2020

@iBug Thank you for returning to this. The test looks sufficient to me.

@iBug iBug requested a review from mattr- Mar 8, 2020
@doersino
Copy link

@doersino doersino commented Apr 21, 2020

I'm not sure whether including Katakana, Hiragana and Hangul (which all encode syllables, not words) in the regex is an ideal solution, but I do agree that it's better than the status quo.

Also note that while Japanese (Katakana, Hiragana) doesn't make use of ASCII spaces in writing, Korean (Hangul) does.

@iBug
Copy link
Contributor Author

@iBug iBug commented Apr 21, 2020

@doersino I'm aware of that, but in general CJK languages don't count "words", but "characters" instead (think how Microsoft Word does it). For ASCII spaces, since CJK chars are normalized into spaces before western (Latin, Cyrillic etc.) "words" are split, they don't matter here.

@iBug
Copy link
Contributor Author

@iBug iBug commented May 12, 2020

@ashmaroli Any chance someone could take a look at this? I think this is whatsoever better than before and thus worth having.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 12, 2020

Yes, I believe this is better than current master.
Pinging @mattr- for input.

@ashmaroli ashmaroli removed the request for review from mattr- May 21, 2020
@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 21, 2020

Looking at this again...
Say, I were to have {{ page.content | number_of_words }} in my layout, with the page containing only latin characters.
Based on the current proposed implementation:

input.scan(cjk_regex).length + input.gsub(cjk_regex, " ").split.length

input.scan(cjk_regex).length would allocate an empty array.
input.gsub(cjk_regex, " ") would duplicate input or page.content in our sample case (String#gsub duplicates given string irrespective of substitution) and then
input.gsub(cjk_regex, " ").split will generate an array of all the words in page.

In other words, a situation that was already bloated with master version (page.content is usually a large string) is going to bloat further with no apparent gain for the user.

Since, the percentage of users that would have CJK chars in their site's contents would be much lower than the percentage of users without CJK chars in their site's contents, I'm withholding approval on the change.

@iBug, I recommend that you come up with a solution acceptable for both sides of the table.

@ashmaroli ashmaroli marked this pull request as draft May 21, 2020
@doersino
Copy link

@doersino doersino commented May 21, 2020

Suggestion:

  1. Check whether the input contains any CJK characters. This could be implemented relatively cheaply.
  2. If so, do what this pull request proposes.
  3. Otherwise, do what's currently implemented.

@iBug iBug marked this pull request as ready for review May 21, 2020
@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 21, 2020

Check whether the input contains any CJK characters

That is one solution.
Another would be to have an optional parameter (set to ignore CJK chars by default). That way, the input string need not be checked on each run (input string can be long; checking the parameter would be a constant effort; the number of times this filter is called can be large in large sites.)

Benchmarking (and optionally profiling) the two alternatives can help decide which would be the better choice.

@iBug
Copy link
Contributor Author

@iBug iBug commented May 21, 2020

@ashmaroli @doersino I've pushed a patch for this. Thank you for your precious analyses and suggestions!

A summary of the new commit:

  • According to @doersino 's suggestions, I first check for CJK characters, and go the old way if none is found
  • According to @ashmaroli 's analyses, String#gsub is heavily bloated for creating copies of (usually) large strings. Using String#split with a regex as the argument (presumably) solves this by avoiding the big copy.

- Split with CJK characters only when present
- Avoid bloating from String#gub by using String#split with regex directly
@iBug
Copy link
Contributor Author

@iBug iBug commented May 22, 2020

@ashmaroli With the last commit 21bf5a5 pushed, I've changed the code to the "Overhead Mode" (optional but disabled by default).

I've also updated the tests so both the new behavior and the old behavior are tested (essentially testing the optional parameter). On my local environment both rake test and rubocop -D passes (of course preceded by bundle exec). Please let me know if any extra work is required.

Again, I'm very grateful to you two for being super helpful and patient on guiding me through this seemingly simple PR. It's certainly been a wonderful experience participating in this great open-source project and there's no denying that I've learned a lot from this.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

You're welcome @iBug
If you still have the benchmark script on hand, try the following version:

def number_of_words(input, mode = nil)
  cjk_regex = %r![\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]!
  word_regex = %r![^\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}\s]+!
  case mode
  when "cjk"
    input.scan(cjk_regex).length + input.scan(word_regex).length
  when "auto"
    cjk_count = input.scan(cjk_regex).length
    cjk_count.zero? ? input.split.length : cjk_count + input.scan(word_regex).length
  else
    input.split.length
  end
end

@iBug
Copy link
Contributor Author

@iBug iBug commented May 22, 2020

@ashmaroli Sure. I've put the new test code on top of the original one in order to reduce clutter (avoid making this thread too long). The new results deserve a separate post so here they are.

(Keep in mind that each function is run 10000.times)

                 user     system      total        real
Original 1   4.099273   0.000369   4.099642 (  4.099745)
Original 2   4.565014   0.000003   4.565017 (  4.565135)
Original 3   1.514981   0.000000   1.514981 (  1.515023)
Auto 1      11.171122   0.000000  11.171122 ( 11.171594)
Auto 2      17.005940   0.000001  17.005941 ( 17.006658)
Auto 3      16.100305   0.000000  16.100305 ( 16.100708)
User1- 1     4.443806   0.000000   4.443806 (  4.444016)
User1- 2     4.111661   0.000000   4.111661 (  4.111847)
User1- 3     1.607525   0.000000   1.607525 (  1.607591)
User1+ 1    20.131011   0.000000  20.131011 ( 20.131919)
User1+ 2    16.699124   0.000000  16.699124 ( 16.699834)
User1+ 3    16.022139   0.000000  16.022139 ( 16.022594)
User2- 1     4.704363   0.000000   4.704363 (  4.704513)
User2- 2     3.944270   0.000000   3.944270 (  3.944377)
User2- 3     2.134202   0.000000   2.134202 (  2.134379)
User2* 1    11.197304   0.000000  11.197304 ( 11.197632)
User2* 2    16.641564   0.000000  16.641564 ( 16.642144)
User2* 3    16.273725   0.000000  16.273725 ( 16.274143)
User2+ 1    19.845742   0.000000  19.845742 ( 19.846762)
User2+ 2    16.796103   0.000000  16.796103 ( 16.796858)
User2+ 3    15.658962   0.003737  15.662699 ( 15.663274)

Though, I still prefer the "auto mode" and would like to have it as the default behavior. Benchmark results show an increase of less than 1 ms increase per 20 KB source processed. This level of performance penalty falls far behind the deviation from computers' intrinsics. To be exact, this is not going to be observable in real runs of Jekyll.

@crispgm
Copy link
Member

@crispgm crispgm commented May 22, 2020

@iBug i guess you should add some tests on CJK languages other than Chinese. :)

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

@iBug The takeaway from my suggestion above is:

  • The default behavior will be the existing behavior.
  • I'm willing to ship the auto mode option as well.
  • Using %r!...! is better than Regexp.new('...') (the latter generates a regex on each call).

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

@crispgm Will you be able to provide sample texts in Japanese or Korean?

@iBug
Copy link
Contributor Author

@iBug iBug commented May 22, 2020

Is it possible to make the default option a site-wide setting? Let's say, something like site.number_of_words_mode. So it works this way:

  • If the key is not present in _config.yml, Ruby defaults it to nil, which matches the old behavior (or the planned update with mode = nil. This would hold true for nearly all users (no one sets a random key in their configs).
  • If the key exists, then it's used as the default value for mode parameter.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

Is it possible to make the default option a site-wide setting?

That would delay this PR a lot.
Better alternative would be to use include files.

// sample include file

{{ include.text | number_of_words: 'auto' }}
// template or page

{% assign foobar = 'some content' %}
{% include number_of_words.html text = foobar %}

@crispgm
Copy link
Member

@crispgm crispgm commented May 22, 2020

@crispgm Will you be able to provide sample texts in Japanese or Korean?

@ashmaroli okay, here they are

銇撱倱銇仭銇佷笘鐣岋紒 // number_of_words = 9
鞎堧厱頃橃劯鞖 靹胳儊! // number_of_words = 8

and powered by Google Translate.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

@crispgm Ah! good ol' Google Translate. :) But you forgot to mention how many words they are.
We can't know if the output is correct until there's something to check against.. 馃槈

@crispgm
Copy link
Member

@crispgm crispgm commented May 22, 2020

@crispgm Ah! good ol' Google Translate. :) But you forgot to mention how many words they are.
We can't know if the output is correct until there's something to check against.. 馃槈

This is true. I updated on last posts. :)

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

@iBug Please update the documentation as well so that users can differentiate when to use the two modes. Once again, thank you for your patience and effort.

@iBug
Copy link
Contributor Author

@iBug iBug commented May 22, 2020

@ashmaroli I was already planning to do so, but I failed to identify which pieces of files I should look into.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

which pieces of files

Adding to what you have in docs/_data/jekyll_filters.yml should be sufficient.

@iBug
Copy link
Contributor Author

@iBug iBug commented May 22, 2020

@ashmaroli With the latest commit, I think it's almost there.

@ashmaroli
Copy link
Member

@ashmaroli ashmaroli commented May 22, 2020

Thank you @iBug and everyone that participated here.
@jekyllbot: merge +minor

@jekyllbot jekyllbot merged commit 13b7291 into jekyll:master May 22, 2020
7 checks passed
iBug added a commit to iBug/iBug-source that referenced this issue May 22, 2020
iBug pushed a commit to iBug-web/iBug-web.github.io that referenced this issue May 22, 2020
[b3e0b34] Use new number_of_words plugin

Thanks to @ashmaroli's help:
jekyll/jekyll#7813 (comment)
iBug added a commit to iBug/iBug-source that referenced this issue Dec 18, 2020
@jekyll jekyll locked and limited conversation to collaborators May 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants