-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Returning the match including thr wildcards #23
Comments
Hi, I think byteseek should be able to do what you want. My documentation is not very good though! You almost got what's needed. First, instead of using a Parser, use a Compiler. The parser only parses the syntax and produces a parse tree from it. This itself isn't executable - but the compiler will turn it into something you can use to match or search with. And we should use a Matcher compiler, rather than the RegexCompiler. I did say my documentation is pretty bad - none of this is clear.
If you just want to match that byte sequence at a particular position, you can call match methods on a SequenceMatcher directly. However, you want to search in the file for that sequence. To do this efficiently, we need to use a Searcher. These use efficient search algorithms which significantly outperform simply matching at each position in turn. The Horspool searcher is generally the fastest of the algorithms currently in byteseek.
Now you can use the methods on the Searcher to search over the file. It will return SearchResult objects that tell you where a match is located and the length of the match. One more thing - instead of using the Hope that helps, any other questions please feel free to ask. |
One thing byteseek won't do is return the actual data for you that matched. It returns the match position and length of a match, but not the data itself. You would have to extract those byte sequences from the file once you found a match. It's not a bad idea to build that capability in - I'll consider adding that for a future release. |
I just realised that there is a problem with searching for your pattern, and it's because of some algorithmic issues. It's not hard to solve though. You're searching for The 512 wildcard bytes at the end are essentially impossible to search for efficiently with most sub-linear search algorithms. The way this is usually dealt with is to search for the non wildcard prefix, then extract the bytes after it. So you should search for If you're interested, the reason why it's hard to efficiently search for wildcards at the end of an sequence is because most sub-linear search algorithms work from the end of a search pattern, rather than the start. Since all of the wildcard pattern at the end matches everywhere it looks, it prevents the algorithmic optimisations from skipping ahead in the file. Conversely, a wildcard pattern at the start of a sequence doesn't really impact performance at all. |
So - better documentation is needed, but also a higher level interface, more like a normal regex so specialist knowledge isn't required to use it safely. |
One more thought. It's actually faster in the horspool algorithm to search for longer patterns than shorter ones. So if you could expand the number of bytes to recognise beyond This is because if the search finds something that isn't in your pattern, it can skip ahead the entire length of the pattern. Essentially, the longer the pattern, the longer the skip you can get, up to a few hundred bytes at least before the advantage goes away. |
Thanks a lot for that much help. Thats how much i got now:
All matches are saved in |
Great, I'm glad it working for you. |
Hello
First thanks for this library, it could make my task a lot easier.
Im trying to search through a large binary file and match a header (in hex) pattern. Following that header are 512 bytes of data that im actually interested in.
Thats how far i got. Sadly i couldnt get the pattern from the regexParser working, and the regular ByteMatcher only gives me a boolean, if the pattern is included in the file.
Is it possible to search through the file with the commented regex and return all the matches (as byte[] or char[]) found in the binary including the wildcard ( .{512} )data?
The text was updated successfully, but these errors were encountered: