[WIP] Optimize utf8 string length and byte offset functions #4529

lopopolo · 2019-06-23T21:38:19Z

Replace existing utf8 strlen and byte offset functions with the UTF-8
length algorithm developed by George Pollard and discussed at

https://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

The algorithm assumes that the String is valid UTF-8, which I believe is
enforced by utf8_strlen. The algorithm uses an ASCII fastpath and skips
characters based on their UTF-8 continuation bytes. This approach can be
more than twice as fast as the existing implementation.

The core algorithm is adapted for the following functions:

utf8len
mrb_utf8_len
chars2bytes
bytes2chars

This patch allows removing the utf8len_codepage static array.

Benchmarks (default build with MRB_UTF8_STRING defined)

("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000).reverse!

master: 20.99s
patched: 7.36s

("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000)[1000000 - 1]

master: 18.89s
patched: 8.14s

("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { }

master: [does not complete in 4 minutes wall clock time]
patched: 0.02s

NOTE: I did not write the implementation of the algorithm, I only
adapted it to fit within the mruby APIs.

Fixes GH-4522.

Replace existing utf8 strlen and byte offset functions with the UTF-8 length algorithm developed by George Pollard and discussed at https://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html The algorithm assumes that the String is valid UTF-8, which I believe is enforced by `utf8_strlen`. The algorithm uses an ASCII fastpath and skips characters based on their UTF-8 continuation bytes. This approach can be more than twice as fast as the existing implementation. The core algorithm is adapted for the following functions: - utf8len - mrb_utf8_len - chars2bytes - bytes2chars This patch allows removing the utf8len_codepage static array. Benchmarks (default build with MRB_UTF8_STRING defined) `("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000).reverse!` master: 20.99s patched: 7.36s `("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000)[1000000 - 1]` master: 18.89s patched: 8.14s `("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { }` master: [does not complete in 4 minutes wall clock time] patched: 0.02s **NOTE**: I did not write the implementation of the algorithm, I only adapted it to fit within the mruby APIs.

lopopolo · 2019-06-23T21:38:35Z

cc @dearblue

lopopolo · 2019-06-23T22:10:40Z

I've generated a flamegraph with dtrace that compares master and this PR for the following benchmark for a 60 second sample:

./bin/mruby -e '100000.times { ("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { } }'

master

PR 4529

shuujii · 2019-06-24T10:52:39Z

It is interesting. However, it doesn't seem to work well in some cases:

Before patched with `MRB_UTF8_STRING` (same as Ruby)

$ bin/mruby -e '["\xe3\x81", "a\x00z"].each{|s| p s.size}'
2
3

After patched with `MRB_UTF8_STRING`

$ bin/mruby -e '["\xe3\x81", "a\x00z"].each{|s| p s.size}'
1
1

dearblue · 2019-06-24T11:11:49Z

@lopopolo

As a premise, I am imagined that policy of mruby is to handle binary data and strings without distinction.
Therefore, I think that even if there are mixed characters that are invalid as UTF-8, they need to be handled without problems.
I also want to hear other people's opinions about this.

Performing malicious tests this patch.

The character count of broken UTF-8 strings is different (This is similar to what @shuujii showed):

% ./build/host/bin/mruby -e 'p ("\xf1" * 5).size'
2
% ruby26 -e 'str = "\xf1" * 5; p str.encoding, str.size'
#<Encoding:UTF-8>
5

Crash when reversing a broken string as UTF-8:

% ./build/host/bin/mruby -e 'p ("\xf1" * 5).reverse'
zsh: segmentation fault (core dumped)  ./mruby@test-utf8 -e 'p ("\xf1" * 5).reverse'

If missing trailing characters, it will make an out-of-range reference:
```
% ./build/host/bin/mruby -e 'p "\xf1"[0]'
"\xf1\x00\x00\x00"
```

dearblue · 2019-06-24T11:24:02Z

mrbtest does not stop...

% CFLAGS='-O0 -Wall -DMRB_UTF8_STRING' ./minirake -j10 clean `pwd`/build/test/bin/mrbtest
...snip...
% ./build/test/bin/mrbtest
mrbtest - Embeddable Ruby Test

.F..............................................................................F..F...X?..................F..F....F.......F.F.FFFF....F...^C

lopopolo · 2019-06-24T17:05:09Z

Oh no, it appears I was running the tests without MRB_UTF8_STRING defined. Thanks for the tip @dearblue. This patch is not ready for review. I'll mark it as WIP.

Tests do not pass.

lopopolo · 2019-06-24T20:14:45Z

I've pushed my latest experiment. Not being able to store the encoding on the RString means we must process each character in the char * to determine the length. The existing approach with the lookup table is faster than what I've been able to come up with.

It looks like CRuby is able to fast-path UTF-8 strlen by storing whether the string has a "broken" encoding or not.

I'm going to abandon this PR. Thanks for taking a look.

lopopolo mentioned this pull request Jun 24, 2019

Lazily cache UTF-8 character length on RString #4531

Closed

lopopolo changed the title ~~Optimize utf8 string length and byte offset functions~~ [WIP] Optimize utf8 string length and byte offset functions Jun 24, 2019

Handle strings that contain invalid UTF-8 (binary)

b158f9a

Tests do not pass.

lopopolo closed this Jun 24, 2019

dearblue mentioned this pull request Jul 14, 2019

Improve utf8len() performance with UTF-8 #4578

Merged

lopopolo deleted the utf8-strlen-optimize branch December 17, 2021 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Optimize utf8 string length and byte offset functions #4529

[WIP] Optimize utf8 string length and byte offset functions #4529

lopopolo commented Jun 23, 2019

lopopolo commented Jun 23, 2019

lopopolo commented Jun 23, 2019 •

edited

shuujii commented Jun 24, 2019

dearblue commented Jun 24, 2019

dearblue commented Jun 24, 2019

lopopolo commented Jun 24, 2019

lopopolo commented Jun 24, 2019

[WIP] Optimize utf8 string length and byte offset functions #4529

[WIP] Optimize utf8 string length and byte offset functions #4529

Conversation

lopopolo commented Jun 23, 2019

lopopolo commented Jun 23, 2019

lopopolo commented Jun 23, 2019 • edited

master

PR 4529

shuujii commented Jun 24, 2019

Before patched with MRB_UTF8_STRING (same as Ruby)

After patched with MRB_UTF8_STRING

dearblue commented Jun 24, 2019

dearblue commented Jun 24, 2019

lopopolo commented Jun 24, 2019

lopopolo commented Jun 24, 2019

lopopolo commented Jun 23, 2019 •

edited

Before patched with `MRB_UTF8_STRING` (same as Ruby)

After patched with `MRB_UTF8_STRING`