fix: change offset calculation order to work with prefix matches #1

tazjin · 2023-02-03T23:22:00Z

If a match was found exactly at the beginning of a haystack (see the new supplied test), the scanner would panic when trying to subtract pat_len from pos, as pat_len would always be larger than pos in this case.

In case you're wondering "Why is someone randomly sending a PR to this 7 year old, seemingly unmaintained repo?": We are working on a Rust implementation of Nix, called Tvix. In a part of this project we have a potentially huge number of needles which are all equal length, very random strings (hashes of other data), and need to find their occurences in a set of other matches. The set of needles frequently changes and we need to do this scanning hundreds/thousands of times during a single run.

Especially the fact that the set changes quickly makes automaton-based solutions fairly inefficient, as we're doing expensive construction and throwing them away all the time. We tried the famous aho-corasick Rust implementation - it increased runtime by 5x, and ended up being most of what the program did (apart from disk IO). After reading some of the literature, I found that the Wu-Manber algorithm is likely a good fit for our problem space, and your crate was the only one implementing it already (I wasn't quite convinced enough to start a from-scratch implementation, though). It is significantly better performing for us already, but we had to squash this bug :)

Thanks for publishing this!

If a match was found exactly at the beginning of a haystack (see the new supplied test), the scanner would panic when trying to subtract `pat_len` from `pos`, as `pat_len` would always be larger than `pos` in this case.

Our fork fixes a small bug (jneem/wu-manber#1) but it's not clear whether upstream will accept patches, so for now lets point this directly at our fork. Change-Id: Iccdcedae3e9a8b783241431787c952561d032694 Reviewed-on: https://cl.tvl.fyi/c/depot/+/8031 Reviewed-by: tazjin <tazjin@tvl.su> Autosubmit: tazjin <tazjin@tvl.su> Tested-by: BuildkiteCI

tazjin · 2023-07-31T07:54:15Z

I'm closing this PR as this repo is unmaintained, and we are starting to land other changes in the master branch of our fork, on which this PR is based.

Feel free to pull our commits in the future if you are interested in the improvements!

fix: change offset calculation order to work with prefix matches

e77628c

If a match was found exactly at the beginning of a haystack (see the new supplied test), the scanner would panic when trying to subtract `pat_len` from `pos`, as `pat_len` would always be larger than `pos` in this case.

find(): support passing in &[u8] as well

0d5b22b

tazjin closed this Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: change offset calculation order to work with prefix matches #1

fix: change offset calculation order to work with prefix matches #1

tazjin commented Feb 3, 2023

tazjin commented Jul 31, 2023

fix: change offset calculation order to work with prefix matches #1

fix: change offset calculation order to work with prefix matches #1

Conversation

tazjin commented Feb 3, 2023

tazjin commented Jul 31, 2023