std: Speed up str::is_utf8 #8237

bluss · 2013-08-02T21:43:48Z

Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)

An improvement upon the previous pull #8133

Use unchecked vec indexing since the vector bounds are checked by the loop. Iterators are not easy to use in this case since we skip 1-4 bytes each lap. This part of the commit speeds up is_utf8 for ASCII input. Check codepoint ranges by checking the byte ranges manually instead of computing a full decoding for multibyte encodings. This is easy to read and corresponds to the UTF-8 syntax in the RFC. No changes to what we accept. A comment notes that surrogate halves are accepted. Before: test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3) test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5) After: test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1) test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)

bluss · 2013-08-02T21:44:56Z

Not addressing surrogate halves, only performance. With the code laid out in the new way, it's easy to see how we can disallow surrogates if wanted.

brson · 2013-08-02T21:55:18Z

Thanks for putting benchmark numbers in the commit message and pr.

bluss · 2013-08-02T22:03:40Z

thanks

Use unchecked vec indexing since the vector bounds are checked by the loop. Iterators are not easy to use in this case since we skip 1-4 bytes each lap. This part of the commit speeds up is_utf8 for ASCII input. Check codepoint ranges by checking the byte ranges manually instead of computing a full decoding for multibyte encodings. This is easy to read and corresponds to the UTF-8 syntax in the RFC. No changes to what we accept. A comment notes that surrogate halves are accepted. Before: test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3) test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5) After: test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1) test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3) An improvement upon the previous pull #8133

bors closed this Aug 4, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std: Speed up str::is_utf8 #8237

std: Speed up str::is_utf8 #8237

bluss commented Aug 2, 2013

bluss commented Aug 2, 2013

brson commented Aug 2, 2013

bluss commented Aug 2, 2013

std: Speed up str::is_utf8 #8237

std: Speed up str::is_utf8 #8237

Conversation

bluss commented Aug 2, 2013

bluss commented Aug 2, 2013

brson commented Aug 2, 2013

bluss commented Aug 2, 2013