Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deny overlong encodings in UTF-8 #8133

Closed
wants to merge 4 commits into from
Closed

Deny overlong encodings in UTF-8 #8133

wants to merge 4 commits into from

Conversation

bluss
Copy link
Member

@bluss bluss commented Jul 30, 2013

Fix is_utf8 and UTF-8 char width functions to deny non-canonical 'overlong encodings' in UTF-8.

We address the function is_utf8 to make it more strict and correct, but no changes are made to the handling of invalid UTF-8.

Fixes issue #3787

blake2-ppc added 4 commits July 30, 2013 17:25
Bytes 0xC0, 0xC1 can only be used to start 2-byte codepoint encodings,
that are 'overlong encodings' of codepoints below 128.

The reference given in a comment -- https://tools.ietf.org/html/rfc3629
-- does in fact already exclude these bytes, so no additional comment
should be needed in the code.
An 'overlong encoding' is a codepoint encoded non-minimally using the
utf-8 format. Denying these enforce each codepoint to have only one
valid representation in utf-8.

An example is byte sequence 0xE0 0x80 0x80 which could be interpreted as
U+0, but it's an overlong encoding since the canonical form is just
0x00.

Another example is 0xE0 0x80 0xAF which was previously accepted and is
an overlong encoding of the solidus "/". Directory traversal characters
like / and . form the most compelling argument for why this commit is
security critical.

Factor out common UTF-8 decoding expressions as macros. This commit will
partly duplicate UTF-8 decoding, so it is now present in both
fn is_utf8() and .char_range_at(); the latter using an assumption of
a valid str.
static variables are pub by default, which is not reflected in our code
(we need to use priv).
@bluss
Copy link
Member Author

bluss commented Jul 30, 2013

The commit logs have more info, but I forgot to underline that the new codepoint decoding code in is_utf8 is only active in the case of 3- and 4-byte encodings, so it should not affect the fastpath of the function.

bors added a commit that referenced this pull request Jul 30, 2013
Fix is_utf8 and UTF-8 char width functions to deny non-canonical 'overlong encodings' in UTF-8.

We address the function is_utf8 to make it more strict and correct, but no changes are made to the handling of invalid UTF-8.

Fixes issue #3787
@bors bors closed this Jul 30, 2013
@bluss bluss mentioned this pull request Aug 2, 2013
bors added a commit that referenced this pull request Aug 4, 2013
Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

	test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
	test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
	test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
	test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)

An improvement upon the previous pull #8133
flip1995 pushed a commit to flip1995/rust that referenced this pull request Dec 30, 2021
Fix 8128

Fixes rust-lang#8128

changelog: Fix  error suggestion of `skip(..).next()` for immutable variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants