Incrementally parsed invalid sequences spanning multiple chunks write data #52

cgaebel · 2014-10-29T21:16:28Z

    #[test]
    fn test_invalid_multibyte_span() {
        use std::mem;
        let mut d = UTF8Encoding.decoder();
        // "ef bf be" is an invalid sequence.
        assert_feed_ok!(d, [], [0xef, 0xbf], "");
        let input: [u8, ..1] = [ 0xbe ];
        let (_, _, buf) = unsafe { d.test_feed(mem::transmute(input.as_slice())) };
        // Make sure no data was written to the buffer.
        assert_eq!(buf, String::new());
        // task 'codec::utf_8::tests::test_invalid_multibyte_span' failed at 'assertion failed: `(left == right) && (right == left)` (left: `�`, right: ``)', /Users/cgaebel/code/rust-encoding/src/codec/utf_8.rs:529
    }

This test successfully reports an error, but when it does it writes an invalid code sequence into the buffer.

(side note, github markup is eating the invalid UTF-8 char in left. Rest assured SOMETHING is in there.

The text was updated successfully, but these errors were encountered:

lifthrasiir · 2014-10-30T01:17:17Z

U+FFFE is a noncharacter, but it doesn't make the corresponding UTF-8 sequence (EF BF BE) invalid! Quoting the section 23.7 in the Unicode standard 7.0:

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed. The intent of noncharacters is that they are permanently prohibited from being assigned interchangeable meanings by the Unicode Standard. They are not prohibited from occurring in valid Unicode strings which happen to be interchanged. This distinction, which might be seen as too finely drawn, ensures that noncharacters are correctly preserved when "interchanged" internally, as when used in strings in APIs, in other interprocess protocols, or when stored.

There are also a number of noncharacters, including U+FDD0..FDEF reserved for the Arabic processing, and none of them are prohibited in UTF-8. Rust's char happily accepts them. (Try '\ufffe' :)

cgaebel · 2014-10-30T01:22:25Z

Ah. I tired looking up invalid utf-8 and that's what I found. Silly me! Can you give me an example of something which is invalid utf-8?

lifthrasiir · 2014-10-30T01:25:25Z

@cgaebel Rust-encoding has a full test suite for the invalid UTF-8 sequences.

cgaebel · 2014-10-30T01:36:50Z

Ahhh I missed the processed > 0 condition when looking for this "bug". Thanks for pointing me in the right direction!

lifthrasiir closed this as completed Oct 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incrementally parsed invalid sequences spanning multiple chunks write data #52

Incrementally parsed invalid sequences spanning multiple chunks write data #52

cgaebel commented Oct 29, 2014

lifthrasiir commented Oct 30, 2014

cgaebel commented Oct 30, 2014

lifthrasiir commented Oct 30, 2014

cgaebel commented Oct 30, 2014

Incrementally parsed invalid sequences spanning multiple chunks write data #52

Incrementally parsed invalid sequences spanning multiple chunks write data #52

Comments

cgaebel commented Oct 29, 2014

lifthrasiir commented Oct 30, 2014

cgaebel commented Oct 30, 2014

lifthrasiir commented Oct 30, 2014

cgaebel commented Oct 30, 2014