Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious matches with substring matching and non-ASCII #34

Closed
deifactor opened this issue Dec 22, 2023 · 8 comments
Closed

Spurious matches with substring matching and non-ASCII #34

deifactor opened this issue Dec 22, 2023 · 8 comments

Comments

@deifactor
Copy link

deifactor commented Dec 22, 2023

        let needle = Utf32String::from("lying");
        let haystack = Utf32String::from("Flibbertigibbet / イタズラっ子たち");
        let mut matcher = Matcher::new(Config::DEFAULT);
        assert_eq!(
            matcher.substring_match(haystack.slice(..), needle.slice(..)),
            None
        )

This should pass, but it fails with a score of 30; running with indices indicates that only the first codepoint in the haystack matches. If I get rid of the Japanese text the match goes away as expected. Fuzzy, postfix, and prefix match all indicate that there is no match; it's only substring match that breaks.

edit: If I use Utf32String::Unicode("lying".chars().collect()) there's no match, so I think the 'ascii needle, unicode haystack' codepath is the one with the problem.

@pascalkuthe
Copy link
Member

good find, I did have cases for checking that unicode substring search works for matches but not rejection test cases for cases where patterns did not match. The bug was a small oversight and has been fixed in 34553f0.

I included this in #33 just in time for the new release

@deifactor
Copy link
Author

I'm actually getting the opposite problem now where now it's not matching even though it should be.

@pascalkuthe
Copy link
Member

hmm yeah that is an orthogonal bug, fixed in c7893db

@deifactor
Copy link
Author

Now I'm getting a crash:

        let needle = Utf32String::from("adi");
        let haystack =
            Utf32String::from("At the Road’s End - Seeming - SOL: A Self-Banishment Ritual");
        let mut matcher = Matcher::new(Config::DEFAULT);
        assert_ne!(
            matcher.substring_match(haystack.slice(..), needle.slice(..)),
            None
        )
range end index 60 out of range for slice of length 59
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::slice::index::slice_end_index_len_fail_rt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:76:5
   3: core::slice::index::slice_end_index_len_fail
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:68:9
   4: <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:411:13
   5: core::slice::index::<impl core::ops::index::Index<I> for [T]>::index
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:18:9
   6: nucleo_matcher::exact::<impl nucleo_matcher::Matcher>::substring_match_non_ascii
             at /home/vector/.cargo/git/checkouts/nucleo-fe29e1ee969779b0/c7893db/matcher/src/exact.rs:248:28
   7: nucleo_matcher::Matcher::substring_match_impl
             at /home/vector/.cargo/git/checkouts/nucleo-fe29e1ee969779b0/c7893db/matcher/src/lib.rs:485:17
   8: nucleo_matcher::Matcher::substring_match
             at /home/vector/.cargo/git/checkouts/nucleo-fe29e1ee969779b0/c7893db/matcher/src/lib.rs:422:9

@deifactor
Copy link
Author

Seems like the index generation is also really glitchy:

image

@gabydd
Copy link
Member

gabydd commented Dec 22, 2023

Now I'm getting a crash:

        let needle = Utf32String::from("adi");
        let haystack =
            Utf32String::from("At the Road’s End - Seeming - SOL: A Self-Banishment Ritual");
        let mut matcher = Matcher::new(Config::DEFAULT);
        assert_ne!(
            matcher.substring_match(haystack.slice(..), needle.slice(..)),
            None
        )
range end index 60 out of range for slice of length 59
stack backtrace:
   0: rust_begin_unwind
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
   2: core::slice::index::slice_end_index_len_fail_rt
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:76:5
   3: core::slice::index::slice_end_index_len_fail
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:68:9
   4: <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:411:13
   5: core::slice::index::<impl core::ops::index::Index<I> for [T]>::index
             at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/slice/index.rs:18:9
   6: nucleo_matcher::exact::<impl nucleo_matcher::Matcher>::substring_match_non_ascii
             at /home/vector/.cargo/git/checkouts/nucleo-fe29e1ee969779b0/c7893db/matcher/src/exact.rs:248:28
   7: nucleo_matcher::Matcher::substring_match_impl
             at /home/vector/.cargo/git/checkouts/nucleo-fe29e1ee969779b0/c7893db/matcher/src/lib.rs:485:17
   8: nucleo_matcher::Matcher::substring_match
             at /home/vector/.cargo/git/checkouts/nucleo-fe29e1ee969779b0/c7893db/matcher/src/lib.rs:422:9

For this crash it looks like above the code in the commit that Pascal refrenced where it is enumerating to the end of the haystack it should be enumerating to the end minus the length of the needle.

@pascalkuthe
Copy link
Member

pascalkuthe commented Dec 22, 2023

Yeah I have a fix for that but I dont want to rush another fix (altough it is yet another existing bug, I guess I just need to weite moee unicode-unicode suabteing tests. I never type unicode so I rarely run into these).

@deifactor
Copy link
Author

Yeah, I don't type Unicode usually but I have a music collection with a lot of foreign stuff or weird smart quotes and I'm writing a music player so I've been hitting the ascii-needle/unicode haystack codepath a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants