Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: Speed up str::is_utf8 #8237

Closed
wants to merge 1 commit into from
Closed

std: Speed up str::is_utf8 #8237

wants to merge 1 commit into from

Conversation

bluss
Copy link
Member

@bluss bluss commented Aug 2, 2013

Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)

An improvement upon the previous pull #8133

Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

	test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
	test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
	test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
	test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)
@bluss
Copy link
Member Author

bluss commented Aug 2, 2013

Not addressing surrogate halves, only performance. With the code laid out in the new way, it's easy to see how we can disallow surrogates if wanted.

@brson
Copy link
Contributor

brson commented Aug 2, 2013

Thanks for putting benchmark numbers in the commit message and pr.

@bluss
Copy link
Member Author

bluss commented Aug 2, 2013

thanks

bors added a commit that referenced this pull request Aug 4, 2013
Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

	test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
	test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
	test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
	test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)

An improvement upon the previous pull #8133
@bors bors closed this Aug 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants