Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Optimize utf8 string length and byte offset functions #4529

Closed
wants to merge 2 commits into from
Closed

[WIP] Optimize utf8 string length and byte offset functions #4529

wants to merge 2 commits into from

Conversation

lopopolo
Copy link
Contributor

Replace existing utf8 strlen and byte offset functions with the UTF-8
length algorithm developed by George Pollard and discussed at

https://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

The algorithm assumes that the String is valid UTF-8, which I believe is
enforced by utf8_strlen. The algorithm uses an ASCII fastpath and skips
characters based on their UTF-8 continuation bytes. This approach can be
more than twice as fast as the existing implementation.

The core algorithm is adapted for the following functions:

  • utf8len
  • mrb_utf8_len
  • chars2bytes
  • bytes2chars

This patch allows removing the utf8len_codepage static array.

Benchmarks (default build with MRB_UTF8_STRING defined)

("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000).reverse!

master: 20.99s
patched: 7.36s

("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000)[1000000 - 1]

master: 18.89s
patched: 8.14s

("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { }

master: [does not complete in 4 minutes wall clock time]
patched: 0.02s

NOTE: I did not write the implementation of the algorithm, I only
adapted it to fit within the mruby APIs.

Fixes GH-4522.

Replace existing utf8 strlen and byte offset functions with the UTF-8
length algorithm developed by George Pollard and discussed at

    https://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

The algorithm assumes that the String is valid UTF-8, which I believe is
enforced by `utf8_strlen`. The algorithm uses an ASCII fastpath and skips
characters based on their UTF-8 continuation bytes. This approach can be
more than twice as fast as the existing implementation.

The core algorithm is adapted for the following functions:

- utf8len
- mrb_utf8_len
- chars2bytes
- bytes2chars

This patch allows removing the utf8len_codepage static array.

Benchmarks (default build with MRB_UTF8_STRING defined)

`("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000).reverse!`

master: 20.99s
patched: 7.36s

`("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000)[1000000 - 1]`

master: 18.89s
patched: 8.14s

`("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { }`

master: [does not complete in 4 minutes wall clock time]
patched: 0.02s

**NOTE**: I did not write the implementation of the algorithm, I only
adapted it to fit within the mruby APIs.
@lopopolo
Copy link
Contributor Author

cc @dearblue

@lopopolo
Copy link
Contributor Author

lopopolo commented Jun 23, 2019

I've generated a flamegraph with dtrace that compares master and this PR for the following benchmark for a 60 second sample:

./bin/mruby -e '100000.times { ("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { } }'

master

Screen Shot 2019-06-23 at 11 08 42 PM

PR 4529

Screen Shot 2019-06-23 at 11 09 01 PM

@shuujii
Copy link
Contributor

shuujii commented Jun 24, 2019

It is interesting. However, it doesn't seem to work well in some cases:

Before patched with MRB_UTF8_STRING (same as Ruby)

$ bin/mruby -e '["\xe3\x81", "a\x00z"].each{|s| p s.size}'
2
3

After patched with MRB_UTF8_STRING

$ bin/mruby -e '["\xe3\x81", "a\x00z"].each{|s| p s.size}'
1
1

@dearblue
Copy link
Contributor

@lopopolo

As a premise, I am imagined that policy of mruby is to handle binary data and strings without distinction.
Therefore, I think that even if there are mixed characters that are invalid as UTF-8, they need to be handled without problems.
I also want to hear other people's opinions about this.


Performing malicious tests this patch.

  • The character count of broken UTF-8 strings is different (This is similar to what @shuujii showed):

    % ./build/host/bin/mruby -e 'p ("\xf1" * 5).size'
    2
    % ruby26 -e 'str = "\xf1" * 5; p str.encoding, str.size'
    #<Encoding:UTF-8>
    5
  • Crash when reversing a broken string as UTF-8:

    % ./build/host/bin/mruby -e 'p ("\xf1" * 5).reverse'
    zsh: segmentation fault (core dumped)  ./mruby@test-utf8 -e 'p ("\xf1" * 5).reverse'
  • If missing trailing characters, it will make an out-of-range reference:

    % ./build/host/bin/mruby -e 'p "\xf1"[0]'
    "\xf1\x00\x00\x00"

@dearblue
Copy link
Contributor

mrbtest does not stop...

% CFLAGS='-O0 -Wall -DMRB_UTF8_STRING' ./minirake -j10 clean `pwd`/build/test/bin/mrbtest
...snip...
% ./build/test/bin/mrbtest
mrbtest - Embeddable Ruby Test

.F..............................................................................F..F...X?..................F..F....F.......F.F.FFFF....F...^C

@lopopolo
Copy link
Contributor Author

Oh no, it appears I was running the tests without MRB_UTF8_STRING defined. Thanks for the tip @dearblue. This patch is not ready for review. I'll mark it as WIP.

@lopopolo lopopolo changed the title Optimize utf8 string length and byte offset functions [WIP] Optimize utf8 string length and byte offset functions Jun 24, 2019
@lopopolo
Copy link
Contributor Author

I've pushed my latest experiment. Not being able to store the encoding on the RString means we must process each character in the char * to determine the length. The existing approach with the lookup table is faster than what I've been able to come up with.

It looks like CRuby is able to fast-path UTF-8 strlen by storing whether the string has a "broken" encoding or not.

I'm going to abandon this PR. Thanks for taking a look.

@lopopolo lopopolo closed this Jun 24, 2019
@lopopolo lopopolo deleted the utf8-strlen-optimize branch December 17, 2021 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String#each_line is slow with MRB_UTF8_STRING
3 participants