`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

zackw · 2023-06-29T03:06:47Z

This strategy function

fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
    prop::string::bytes_regex(
        r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)"
    ).unwrap()
}

is intended to generate, among other things, invalid UTF-8 byte sequences, because it's for testing a parser that works directly from data on disk that cannot be trusted. What it actually does is crash on the unwrap() with

thread 'attrs::test::parse_xattr_ts_invalid' panicked at 'called `Result::unwrap()` on an `Err` value:
    RegexSyntax(Translate(Error {
        kind: InvalidUtf8,
        pattern: "(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\\.[0-9]*[^0-9].*)",
        span: Span(Position(o: 7, l: 1, c: 8), Position(o: 13, l: 1, c: 14))
    }))'

Looking at the code, I believe the change needed is for bytes_regex to have its own version of regex_to_hir that calls ParserBuilder::allow_invalid_utf8(true).

This change should have no visible effect on any existing code that uses bytes_regex, since one must also opt into generation of invalid UTF-8 within the regex itself (that's one of the things the (?s-u: does) and any existing regex that uses that flag must be using it in a way that actually can't generate invalid UTF-8, or else they'd hit the same crash I'm hitting.

The text was updated successfully, but these errors were encountered:

It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex ``` [0-9]+(\.[0-9]*)? ``` Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex ``` (?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 _as well as_ valid UTF-8 but not within the accepted language. Then you want this slight variation: ``` (?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` But this regex will be rejected by `bytes_regex`, because by default `regex_syntax::Parser` errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for `bytes_regex` only. There should be no change to the behavior of any existing test suite, because opting to allow use of `(?-u)` does not change the semantics of any regex that _doesn’t_ contain `(?-u)`, and any existing regex that _does_ contain `(?-u)` must be incapable of generating invalid UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it. (For example, `(?-u:[a-z])` cannot generate invalid UTF-8.) This patch also adds a bunch of tests for `bytes_regex`, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — `do_test` and `do_test_bytes` are almost identical, as are `generate_values_matching_regex` and `generate_byte_values_matching_regex`. I am not good enough at generic metaprogramming in Rust to factor out the duplication.

matthew-russo · 2023-07-15T17:56:36Z

Hey sorry for the delay in getting to this. Gave this a look -- bytes_regex doesn't allow this but there is a public method bytes_regex_parsed that just takes in &regex_syntax::hir::Hir, allowing you to generate that yourself with a customized regex_syntax::ParserBuilder which should address your needs. Given the workaround is straightforward and works generally for other regex parsing options, I'm inclined to leave things as is and not introduce a new, specific helper but I'm open to hearing arguments for that.

I'm going to close this but feel free to reopen if there are outstanding questions/discussions to be had

The proptest method:
https://docs.rs/proptest/latest/proptest/string/fn.bytes_regex_parsed.html

A working example with your pattern:

╰─⠠⠵ cat Cargo.toml
[package]
name = "proptest-playground"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
proptest = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-derive = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
proptest-state-machine = { git = "https://github.com/proptest-rs/proptest.git", branch = "master" }
regex-syntax = { version = "0.7" }

╰─⠠⠵ cat src/main.rs
fn main() {
    println!("Hello, world!");
}

#[cfg(test)]
mod test {
    use proptest::prelude::*;
    use regex_syntax::hir::Hir;
    use regex_syntax::{ParserBuilder, Error};

    fn regex_to_hir(pattern: &str) -> Result<Hir, Error> {
        let mut parser = ParserBuilder::new().utf8(false).build();
        Ok(parser.parse(pattern)?)
    }

    fn invalid_ts() -> impl Strategy<Value = Vec<u8>> {
        let pattern = r"(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)";
        let hir = regex_to_hir(pattern).unwrap();
        prop::string::bytes_regex_parsed(&hir).unwrap()
    }

    proptest!(
        #![proptest_config(ProptestConfig::with_cases(3))]

        #[test]
        fn simple(bytes in invalid_ts()) {
            println!("{:?}", bytes);
        }
    );
}

╰─⠠⠵ cargo test -- --nocapture
   Compiling proptest-playground v0.1.0 (/Users/matthewrusso/projects/user/proptest-playground)
    Finished test [unoptimized + debuginfo] target(s) in 0.23s
     Running unittests src/main.rs (target/debug/deps/proptest_playground-e32ba8f312713645)

running 1 test
[57, 57, 56, 50, 53, 49, 57, 54, 48, 48, 49, 50, 54, 48, 48, 49, 55, 56, 52, 51, 56, 57, 51, 52, 55, 50, 52, 47, 188, 55, 82, 114, 39, 118, 18, 8, 205, 202, 3, 115, 54, 181, 186, 253, 21, 165, 164, 197, 71, 172, 43, 208, 128, 183]
[54, 54, 56, 57, 55, 53, 55, 48, 52, 51, 48, 52, 51, 56, 50, 51, 55, 56, 51, 53, 49, 56, 51, 56, 46, 53, 55, 35, 35, 11, 196, 222, 134, 231, 202, 233, 40, 95, 229, 91, 72, 4, 119, 112, 41, 25, 251, 237, 184, 36, 157, 51, 128, 215]
[57, 53, 57, 51, 55, 51, 49, 57, 52, 56, 48, 56, 56, 56, 49, 49, 55, 49, 50, 49, 56, 51, 55, 57, 49, 54, 47, 248, 53, 205, 23, 39, 137, 171, 14, 133, 3, 102, 253, 209, 228, 29, 98]
[52, 53, 5, 201, 75, 108, 145, 227, 247, 55, 123]
test test::simple ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

zackw · 2023-07-18T17:36:35Z

I'm going to ask you to reconsider closing this, and instead to merge #337, for two reasons:

bytes_regex already exists. Allowing people to use (?-u) in bytes_regex regexes makes it significantly more useful. I can't presently think of any situation where I would want to use bytes_regex without (?-u), although that may be biased by the kind of tests I'm currently writing (e.g. https://git.sr.ht/~zackw/file-integrity-rs/tree/c8ec4a33d3b4d75648e7568587eeb7a0f4b6c529/item/file_integrity/src/test/attrs.rs#L106).
The authors of the regex_syntax crate describe it as "an implementation detail of the regex crate" and "may experience breaking changes at [an unpredictable] cadence". (See https://github.com/rust-lang/regex/tree/master/regex-syntax#motivation.) proptest is already committed to keeping up with those changes, but users of proptest should not need to grow a dependency on an (allegedly) unstable API for something this simple.

matthew-russo · 2023-07-18T17:52:08Z

Thanks for the follow up -- I'll give the PR a look some time this week.

And just for clarity, the intention wasn't to bury any requests, just if there was a satisfactory solution without adding new code, I'll generally opt for that. It's clear that the existing solution is not satisfactory

It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex ``` [0-9]+(\.[0-9]*)? ``` Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex ``` (?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 _as well as_ valid UTF-8 but not within the accepted language. Then you want this slight variation: ``` (?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` But this regex will be rejected by `bytes_regex`, because by default `regex_syntax::Parser` errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for `bytes_regex` only. There should be no change to the behavior of any existing test suite, because opting to allow use of `(?-u)` does not change the semantics of any regex that _doesn’t_ contain `(?-u)`, and any existing regex that _does_ contain `(?-u)` must be incapable of generating invalid UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it. (For example, `(?-u:[a-z])` cannot generate invalid UTF-8.) This patch also adds a bunch of tests for `bytes_regex`, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — `do_test` and `do_test_bytes` are almost identical, as are `generate_values_matching_regex` and `generate_byte_values_matching_regex`. I am not good enough at generic metaprogramming in Rust to factor out the duplication.

* Permit use of (?-u) in byte-regex strategies (#336) It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex ``` [0-9]+(\.[0-9]*)? ``` Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex ``` (?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 _as well as_ valid UTF-8 but not within the accepted language. Then you want this slight variation: ``` (?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` But this regex will be rejected by `bytes_regex`, because by default `regex_syntax::Parser` errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for `bytes_regex` only. There should be no change to the behavior of any existing test suite, because opting to allow use of `(?-u)` does not change the semantics of any regex that _doesn’t_ contain `(?-u)`, and any existing regex that _does_ contain `(?-u)` must be incapable of generating invalid UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it. (For example, `(?-u:[a-z])` cannot generate invalid UTF-8.) This patch also adds a bunch of tests for `bytes_regex`, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — `do_test` and `do_test_bytes` are almost identical, as are `generate_values_matching_regex` and `generate_byte_values_matching_regex`. I am not good enough at generic metaprogramming in Rust to factor out the duplication. * [to squash] Correct for API change in regex-syntax 0.7 Commit rust-lang/regex@706b07d renamed ParserBuilder::allow_invalid_utf8 to ParserBuilder::utf8 and inverted the sense of its argument. Separate commit for review purposes; should be squashed before landing to preserve bisectability of trunk.

matthew-russo · 2023-09-24T05:39:31Z

Closed with merge of: #337

zackw mentioned this issue Jun 30, 2023

bytes_regex() only allows regexs that match valid UTF-8 #238

Closed

matthew-russo closed this as completed Jul 15, 2023

matthew-russo added the help-request This issue is asking for advice/help on using proptest label Jul 15, 2023

matthew-russo reopened this Jul 18, 2023

matthew-russo closed this as completed Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

zackw commented Jun 29, 2023

matthew-russo commented Jul 15, 2023

zackw commented Jul 18, 2023 •

edited

Loading

matthew-russo commented Jul 18, 2023

matthew-russo commented Sep 24, 2023

bytes_regex should permit generation of byte sequences that are invalid UTF-8 #336

bytes_regex should permit generation of byte sequences that are invalid UTF-8 #336

Comments

zackw commented Jun 29, 2023

matthew-russo commented Jul 15, 2023

zackw commented Jul 18, 2023 • edited Loading

matthew-russo commented Jul 18, 2023

matthew-russo commented Sep 24, 2023

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

`bytes_regex` should permit generation of byte sequences that are invalid UTF-8 #336

zackw commented Jul 18, 2023 •

edited

Loading