Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning the match including thr wildcards #23

Closed
Yaldabaoth64 opened this issue Nov 12, 2020 · 7 comments
Closed

Returning the match including thr wildcards #23

Yaldabaoth64 opened this issue Nov 12, 2020 · 7 comments

Comments

@Yaldabaoth64
Copy link

Hello
First thanks for this library, it could make my task a lot easier.

Im trying to search through a large binary file and match a header (in hex) pattern. Following that header are 512 bytes of data that im actually interested in.

Thats how far i got. Sadly i couldnt get the pattern from the regexParser working, and the regular ByteMatcher only gives me a boolean, if the pattern is included in the file.

InputStreamReader inputStreamReader = new InputStreamReader(new FileInputStream(new File("test.bin")));

            //RegexParser regexParser = new RegexParser();
            //ParseTree tree = regexParser.parse("64 84 71 94 00 18 .{512});

            byte[] bytes = new byte[6];
            bytes[0] = (byte) 0x64;
            bytes[1] = (byte) 0x84;
            bytes[2] = (byte) 0x71;
            bytes[3] = (byte) 0x94;
            bytes[4] = (byte) 0x00;
            bytes[5] = (byte) 0x18;
            //...

            ByteMatcherCompiler byteMatcherCompiler = new ByteMatcherCompiler();
            ByteMatcher byteMatcher = ByteMatcherCompiler.compileFrom(bytes);

            System.out.println(byteMatcher.matches(inputStreamReader, 0));

Is it possible to search through the file with the commented regex and return all the matches (as byte[] or char[]) found in the binary including the wildcard ( .{512} )data?

@nishihatapalmer
Copy link
Owner

nishihatapalmer commented Nov 12, 2020

Hi,

I think byteseek should be able to do what you want. My documentation is not very good though! You almost got what's needed.

First, instead of using a Parser, use a Compiler. The parser only parses the syntax and produces a parse tree from it. This itself isn't executable - but the compiler will turn it into something you can use to match or search with. And we should use a Matcher compiler, rather than the RegexCompiler. I did say my documentation is pretty bad - none of this is clear.

SequenceMatcherCompiler compiler = new SequenceMatcherCompiler();
SequenceMatcher matcher = compiler.compile("64 84 71 94 00 18 .{512}");

If you just want to match that byte sequence at a particular position, you can call match methods on a SequenceMatcher directly. However, you want to search in the file for that sequence. To do this efficiently, we need to use a Searcher. These use efficient search algorithms which significantly outperform simply matching at each position in turn. The Horspool searcher is generally the fastest of the algorithms currently in byteseek.

HorspoolFinalFlagSearcher searcher = new HorspoolFinalFlagSearcher(matcher);

Now you can use the methods on the Searcher to search over the file. It will return SearchResult objects that tell you where a match is located and the length of the match.

One more thing - instead of using the InputStreamReader, why not use the FileReader? Having direct random access to the file is usually more efficient than copying it into a stream first.

Hope that helps, any other questions please feel free to ask.

@nishihatapalmer
Copy link
Owner

One thing byteseek won't do is return the actual data for you that matched. It returns the match position and length of a match, but not the data itself. You would have to extract those byte sequences from the file once you found a match.

It's not a bad idea to build that capability in - I'll consider adding that for a future release.

@nishihatapalmer
Copy link
Owner

I just realised that there is a problem with searching for your pattern, and it's because of some algorithmic issues. It's not hard to solve though. You're searching for 64 84 71 94 00 18 .{512}.

The 512 wildcard bytes at the end are essentially impossible to search for efficiently with most sub-linear search algorithms. The way this is usually dealt with is to search for the non wildcard prefix, then extract the bytes after it.

So you should search for 64 84 71 94 00 18 and if you find a match, extract the 512 bytes after it (assuming there are still 512 bytes, of course).

If you're interested, the reason why it's hard to efficiently search for wildcards at the end of an sequence is because most sub-linear search algorithms work from the end of a search pattern, rather than the start. Since all of the wildcard pattern at the end matches everywhere it looks, it prevents the algorithmic optimisations from skipping ahead in the file. Conversely, a wildcard pattern at the start of a sequence doesn't really impact performance at all.

@nishihatapalmer
Copy link
Owner

So - better documentation is needed, but also a higher level interface, more like a normal regex so specialist knowledge isn't required to use it safely.

@nishihatapalmer
Copy link
Owner

nishihatapalmer commented Nov 12, 2020

One more thought. It's actually faster in the horspool algorithm to search for longer patterns than shorter ones.

So if you could expand the number of bytes to recognise beyond 64 84 71 94 00 18, the search would work a lot faster.

This is because if the search finds something that isn't in your pattern, it can skip ahead the entire length of the pattern. Essentially, the longer the pattern, the longer the skip you can get, up to a few hundred bytes at least before the advantage goes away.

@Yaldabaoth64
Copy link
Author

Thanks a lot for that much help.
I could extend the pattern a little bit, but after that there would just follow more wildcards. But the results are pretty good already.

Thats how much i got now:


            FileReader reader = new FileReader(new File("test.bin"));

            SequenceMatcherCompiler smc = new SequenceMatcherCompiler();
            SequenceMatcher matcher = smc.compile("64 84 71 94 00 18 60 8c 0e 86 71 94");

            HorspoolFinalFlagSearcher searcher = new HorspoolFinalFlagSearcher(matcher);

            ArrayList<Long> results = new ArrayList<>();

            List<SearchResult<SequenceMatcher>> result = searcher.searchForwards(reader);
            while(result.size()>0){
                results.add(result.get(0).getMatchPosition());
                result = searcher.searchForwards(reader, results.get(results.size()-1)+1);
            }

            for(int i = 0; i< results.size(); i++){
                System.out.println(results.get(i));
            }

            //get packages from offset
            ArrayList<byte[]> packages = new ArrayList<>();
            File file = new File("test.bin");
            RandomAccessFile raf = new RandomAccessFile(file, "r");

            for(int i = 0; i< results.size(); i++){
                packages.add(new byte[512]);
                raf.seek(results.get(i));
                raf.read(packages.get(packages.size()-1), 0, 512);
            }

            for(int i = 0; i< packages.size(); i++){
                System.out.println(new String(packages.get(i)));
            }

All matches are saved in results and extracted with a RandomAccessFile.
Its really fast. Thanks a lot for that.

@nishihatapalmer
Copy link
Owner

Great, I'm glad it working for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants