Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incrementally parsed invalid sequences spanning multiple chunks write data #52

Closed
cgaebel opened this issue Oct 29, 2014 · 4 comments
Closed

Comments

@cgaebel
Copy link
Contributor

cgaebel commented Oct 29, 2014

    #[test]
    fn test_invalid_multibyte_span() {
        use std::mem;
        let mut d = UTF8Encoding.decoder();
        // "ef bf be" is an invalid sequence.
        assert_feed_ok!(d, [], [0xef, 0xbf], "");
        let input: [u8, ..1] = [ 0xbe ];
        let (_, _, buf) = unsafe { d.test_feed(mem::transmute(input.as_slice())) };
        // Make sure no data was written to the buffer.
        assert_eq!(buf, String::new());
        // task 'codec::utf_8::tests::test_invalid_multibyte_span' failed at 'assertion failed: `(left == right) && (right == left)` (left: `�`, right: ``)', /Users/cgaebel/code/rust-encoding/src/codec/utf_8.rs:529
    }

This test successfully reports an error, but when it does it writes an invalid code sequence into the buffer.

(side note, github markup is eating the invalid UTF-8 char in left. Rest assured SOMETHING is in there.

@lifthrasiir
Copy link
Owner

U+FFFE is a noncharacter, but it doesn't make the corresponding UTF-8 sequence (EF BF BE) invalid! Quoting the section 23.7 in the Unicode standard 7.0:

Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed. The intent of noncharacters is that they are permanently prohibited from being assigned interchangeable meanings by the Unicode Standard. They are not prohibited from occurring in valid Unicode strings which happen to be interchanged. This distinction, which might be seen as too finely drawn, ensures that noncharacters are correctly preserved when "interchanged" internally, as when used in strings in APIs, in other interprocess protocols, or when stored.

There are also a number of noncharacters, including U+FDD0..FDEF reserved for the Arabic processing, and none of them are prohibited in UTF-8. Rust's char happily accepts them. (Try '\ufffe' :)

@cgaebel
Copy link
Contributor Author

cgaebel commented Oct 30, 2014

Ah. I tired looking up invalid utf-8 and that's what I found. Silly me! Can you give me an example of something which is invalid utf-8?

@lifthrasiir
Copy link
Owner

@cgaebel Rust-encoding has a full test suite for the invalid UTF-8 sequences.

@cgaebel
Copy link
Contributor Author

cgaebel commented Oct 30, 2014

Ahhh I missed the processed > 0 condition when looking for this "bug". Thanks for pointing me in the right direction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants