Deny overlong encodings in UTF-8 #8133

bluss · 2013-07-30T17:40:09Z

Fix is_utf8 and UTF-8 char width functions to deny non-canonical 'overlong encodings' in UTF-8.

We address the function is_utf8 to make it more strict and correct, but no changes are made to the handling of invalid UTF-8.

Fixes issue #3787

Bytes 0xC0, 0xC1 can only be used to start 2-byte codepoint encodings, that are 'overlong encodings' of codepoints below 128. The reference given in a comment -- https://tools.ietf.org/html/rfc3629 -- does in fact already exclude these bytes, so no additional comment should be needed in the code.

An 'overlong encoding' is a codepoint encoded non-minimally using the utf-8 format. Denying these enforce each codepoint to have only one valid representation in utf-8. An example is byte sequence 0xE0 0x80 0x80 which could be interpreted as U+0, but it's an overlong encoding since the canonical form is just 0x00. Another example is 0xE0 0x80 0xAF which was previously accepted and is an overlong encoding of the solidus "/". Directory traversal characters like / and . form the most compelling argument for why this commit is security critical. Factor out common UTF-8 decoding expressions as macros. This commit will partly duplicate UTF-8 decoding, so it is now present in both fn is_utf8() and .char_range_at(); the latter using an assumption of a valid str.

static variables are pub by default, which is not reflected in our code (we need to use priv).

bluss · 2013-07-30T17:47:24Z

The commit logs have more info, but I forgot to underline that the new codepoint decoding code in is_utf8 is only active in the case of 3- and 4-byte encodings, so it should not affect the fastpath of the function.

Fix is_utf8 and UTF-8 char width functions to deny non-canonical 'overlong encodings' in UTF-8. We address the function is_utf8 to make it more strict and correct, but no changes are made to the handling of invalid UTF-8. Fixes issue #3787

Use unchecked vec indexing since the vector bounds are checked by the loop. Iterators are not easy to use in this case since we skip 1-4 bytes each lap. This part of the commit speeds up is_utf8 for ASCII input. Check codepoint ranges by checking the byte ranges manually instead of computing a full decoding for multibyte encodings. This is easy to read and corresponds to the UTF-8 syntax in the RFC. No changes to what we accept. A comment notes that surrogate halves are accepted. Before: test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3) test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5) After: test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1) test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3) An improvement upon the previous pull #8133

Fix 8128 Fixes rust-lang#8128 changelog: Fix error suggestion of `skip(..).next()` for immutable variable.

blake2-ppc added 4 commits July 30, 2013 17:25

std: Add from_bytes test for utf-8 using codepoints above 0xffff

aa89325

std: Mark the static constants in str.rs as private

8f9014c

static variables are pub by default, which is not reflected in our code (we need to use priv).

bors closed this Jul 30, 2013

bluss mentioned this pull request Aug 2, 2013

std: Speed up str::is_utf8 #8237

Closed

flip1995 pushed a commit to flip1995/rust that referenced this pull request Dec 30, 2021

Auto merge of rust-lang#8133 - surechen:fix_8128, r=xFrednet

bb7b6be

Fix 8128 Fixes rust-lang#8128 changelog: Fix error suggestion of `skip(..).next()` for immutable variable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deny overlong encodings in UTF-8 #8133

Deny overlong encodings in UTF-8 #8133

bluss commented Jul 30, 2013

bluss commented Jul 30, 2013

Deny overlong encodings in UTF-8 #8133

Deny overlong encodings in UTF-8 #8133

Conversation

bluss commented Jul 30, 2013

bluss commented Jul 30, 2013