New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Optimize utf8 string length and byte offset functions #4529
Conversation
Replace existing utf8 strlen and byte offset functions with the UTF-8 length algorithm developed by George Pollard and discussed at https://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html The algorithm assumes that the String is valid UTF-8, which I believe is enforced by `utf8_strlen`. The algorithm uses an ASCII fastpath and skips characters based on their UTF-8 continuation bytes. This approach can be more than twice as fast as the existing implementation. The core algorithm is adapted for the following functions: - utf8len - mrb_utf8_len - chars2bytes - bytes2chars This patch allows removing the utf8len_codepage static array. Benchmarks (default build with MRB_UTF8_STRING defined) `("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000).reverse!` master: 20.99s patched: 7.36s `("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000)[1000000 - 1]` master: 18.89s patched: 8.14s `("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { }` master: [does not complete in 4 minutes wall clock time] patched: 0.02s **NOTE**: I did not write the implementation of the algorithm, I only adapted it to fit within the mruby APIs.
cc @dearblue |
It is interesting. However, it doesn't seem to work well in some cases: Before patched with
|
As a premise, I am imagined that policy of mruby is to handle binary data and strings without distinction. Performing malicious tests this patch.
|
% CFLAGS='-O0 -Wall -DMRB_UTF8_STRING' ./minirake -j10 clean `pwd`/build/test/bin/mrbtest
...snip...
% ./build/test/bin/mrbtest
mrbtest - Embeddable Ruby Test
.F..............................................................................F..F...X?..................F..F....F.......F.F.FFFF....F...^C |
Oh no, it appears I was running the tests without |
Tests do not pass.
I've pushed my latest experiment. Not being able to store the encoding on the It looks like CRuby is able to fast-path UTF-8 strlen by storing whether the string has a "broken" encoding or not. I'm going to abandon this PR. Thanks for taking a look. |
Replace existing utf8 strlen and byte offset functions with the UTF-8
length algorithm developed by George Pollard and discussed at
The algorithm assumes that the String is valid UTF-8, which I believe is
enforced by
utf8_strlen
. The algorithm uses an ASCII fastpath and skipscharacters based on their UTF-8 continuation bytes. This approach can be
more than twice as fast as the existing implementation.
The core algorithm is adapted for the following functions:
This patch allows removing the utf8len_codepage static array.
Benchmarks (default build with MRB_UTF8_STRING defined)
("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000).reverse!
master: 20.99s
patched: 7.36s
("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000000)[1000000 - 1]
master: 18.89s
patched: 8.14s
("aaa\n\nbbbbb\n\n\n\n\ncccc" * 100000).each_line("\n") { }
master: [does not complete in 4 minutes wall clock time]
patched: 0.02s
NOTE: I did not write the implementation of the algorithm, I only
adapted it to fit within the mruby APIs.
Fixes GH-4522.