New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database compilation memory leak? #9
Comments
It should not be a memory leak, but we have made a few changes in that time so it is possible that the way those patterns are compiled has changed. 100,000 patterns is a lot of patterns, though, if they are complex. Is it the same pattern set as 2 years ago? Are you able to describe the style of patterns? You can always contact us directly, or via the Hyperscan mailing list (on 01.org), if you prefer. |
The pattern set consists of randomly generated patterns of 6-30 random letters followed by “[0-9]{2,10}” just to get a little RE into it. An example pattern: /kqvqfalogr[0-9]{2,10}/ Is that a particularly “bad” pattern for the compilation? I just tried it with just strings of random letters, and that only took 25 seconds to compile. |
Very interesting. No, that's not a particularly bad pattern at all - the problem arises because there are many patterns with similarity, and we have an analysis phase that is spending a lot of time and memory trying to optimise for these patterns. We are now investigating improvements to this particular phase. |
We've pushed commit 7bcd2b0 to the develop branch that improves this particular case. It now takes a couple of minutes and a lot less memory to compile. This will be included in the next Hyperscan release. |
Thanks Matt, that sounds great. Looking forward to the release. |
I faced the same issue when compiling our regex file, even with the newest version of the master branch. The regex file has about 27w lines, each line seems similar. Compiling the db takes about 2 hours at a machine with 1.8 GHz CPU and 64G memory, but the process was killed because of out of memory. Piece of regex file below:
compile flags:
Is these regex too complicated to hyperscan, or something else wrong? |
Hi @YueHonghui, Your issue is a little bit different from the earlier one in this report, actually, and should probably become its own separate issue on Github -- can you create a new issue? (Or if you would prefer to contact us directly, feel free to email hyperscan@intel.com.) Although they are short, the regex patterns you quote are made complex because of their use of Unicode properties like While Hyperscan is still able to handle these, 270,000 patterns of this form is a very large case -- it's not surprising to see very large memory requirements here. Can you describe what your application is doing with these patterns in a bit more detail? Is the sub-pattern Finally, if you can share a larger sample of your patterns, either in a Github issue or via email, that would make it easier to see if there are improvements we can make that reduce the memory requirements for this case. |
@jviiret Thank you for your reply, I'v posted an email to hyperscan@intel.com. |
Compiling a database using hs_compile_multi() uses a lot of memory. I tried to compile 100,000 patterns on a system with 128G memory and it ran out of memory before finishing.
In a previous experience with Hyperscan (~2 years ago) it was able to handle that amount of patterns.
The text was updated successfully, but these errors were encountered: