Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: change offset calculation order to work with prefix matches #1

Closed
wants to merge 2 commits into from

Conversation

tazjin
Copy link

@tazjin tazjin commented Feb 3, 2023

If a match was found exactly at the beginning of a haystack (see the new supplied test), the scanner would panic when trying to subtract pat_len from pos, as pat_len would always be larger than pos in this case.


In case you're wondering "Why is someone randomly sending a PR to this 7 year old, seemingly unmaintained repo?": We are working on a Rust implementation of Nix, called Tvix. In a part of this project we have a potentially huge number of needles which are all equal length, very random strings (hashes of other data), and need to find their occurences in a set of other matches. The set of needles frequently changes and we need to do this scanning hundreds/thousands of times during a single run.

Especially the fact that the set changes quickly makes automaton-based solutions fairly inefficient, as we're doing expensive construction and throwing them away all the time. We tried the famous aho-corasick Rust implementation - it increased runtime by 5x, and ended up being most of what the program did (apart from disk IO). After reading some of the literature, I found that the Wu-Manber algorithm is likely a good fit for our problem space, and your crate was the only one implementing it already (I wasn't quite convinced enough to start a from-scratch implementation, though). It is significantly better performing for us already, but we had to squash this bug :)

Thanks for publishing this!

If a match was found exactly at the beginning of a haystack (see the
new supplied test), the scanner would panic when trying to subtract
`pat_len` from `pos`, as `pat_len` would always be larger than `pos`
in this case.
tvlbot pushed a commit to tvlfyi/tvix that referenced this pull request Feb 4, 2023
Our fork fixes a small bug (jneem/wu-manber#1)
but it's not clear whether upstream will accept patches, so for now
lets point this directly at our fork.

Change-Id: Iccdcedae3e9a8b783241431787c952561d032694
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8031
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
tazjin added a commit to tvlfyi/tvix that referenced this pull request Feb 28, 2023
Our fork fixes a small bug (jneem/wu-manber#1)
but it's not clear whether upstream will accept patches, so for now
lets point this directly at our fork.

Change-Id: Iccdcedae3e9a8b783241431787c952561d032694
Reviewed-on: https://cl.tvl.fyi/c/depot/+/8031
Reviewed-by: tazjin <tazjin@tvl.su>
Autosubmit: tazjin <tazjin@tvl.su>
Tested-by: BuildkiteCI
@tazjin
Copy link
Author

tazjin commented Jul 31, 2023

I'm closing this PR as this repo is unmaintained, and we are starting to land other changes in the master branch of our fork, on which this PR is based.

Feel free to pull our commits in the future if you are interested in the improvements!

@tazjin tazjin closed this Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants