Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database compilation memory leak? #9

Closed
madswj opened this issue Nov 20, 2015 · 8 comments
Closed

Database compilation memory leak? #9

madswj opened this issue Nov 20, 2015 · 8 comments
Assignees

Comments

@madswj
Copy link

madswj commented Nov 20, 2015

Compiling a database using hs_compile_multi() uses a lot of memory. I tried to compile 100,000 patterns on a system with 128G memory and it ran out of memory before finishing.
In a previous experience with Hyperscan (~2 years ago) it was able to handle that amount of patterns.

@mdb256
Copy link

mdb256 commented Nov 20, 2015

It should not be a memory leak, but we have made a few changes in that time so it is possible that the way those patterns are compiled has changed. 100,000 patterns is a lot of patterns, though, if they are complex.

Is it the same pattern set as 2 years ago? Are you able to describe the style of patterns? You can always contact us directly, or via the Hyperscan mailing list (on 01.org), if you prefer.

@madswj
Copy link
Author

madswj commented Nov 23, 2015

The pattern set consists of randomly generated patterns of 6-30 random letters followed by “[0-9]{2,10}” just to get a little RE into it. An example pattern: /kqvqfalogr[0-9]{2,10}/

Is that a particularly “bad” pattern for the compilation? I just tried it with just strings of random letters, and that only took 25 seconds to compile.

@mdb256
Copy link

mdb256 commented Nov 24, 2015

Very interesting. No, that's not a particularly bad pattern at all - the problem arises because there are many patterns with similarity, and we have an analysis phase that is spending a lot of time and memory trying to optimise for these patterns.

We are now investigating improvements to this particular phase.

@mdb256 mdb256 self-assigned this Nov 24, 2015
@mdb256
Copy link

mdb256 commented Dec 7, 2015

We've pushed commit 7bcd2b0 to the develop branch that improves this particular case. It now takes a couple of minutes and a lot less memory to compile.

This will be included in the next Hyperscan release.

@mdb256 mdb256 closed this as completed Dec 7, 2015
@madswj
Copy link
Author

madswj commented Dec 7, 2015

Thanks Matt, that sounds great. Looking forward to the release.

@YueHonghui
Copy link

I faced the same issue when compiling our regex file, even with the newest version of the master branch. The regex file has about 27w lines, each line seems similar. Compiling the db takes about 2 hours at a machine with 1.8 GHz CPU and 64G memory, but the process was killed because of out of memory. Piece of regex file below:

76870   中(\p{Han}?|\P{Han}{0,5})文(\p{Han}?|\P{Han}{0,5})汉(\p{Han}?|\P{Han}{0,5})字
76871   一(\p{Han}?|\P{Han}{0,5})个(\p{Han}?|\P{Han}{0,5})例(\p{Han}?|\P{Han}{0,5})子
76872   就(\p{Han}?|\P{Han}{0,5})像(\p{Han}?|\P{Han}{0,5})这(\p{Han}?|\P{Han}{0,5})样(\p{Han}?|\P{Han}{0,5})的

compile flags:

HS_FLAG_CASELESS|HS_FLAG_DOTALL|HS_FLAG_UTF8|HS_FLAG_UCP

Is these regex too complicated to hyperscan, or something else wrong?

@jviiret
Copy link
Contributor

jviiret commented Aug 8, 2016

Hi @YueHonghui,

Your issue is a little bit different from the earlier one in this report, actually, and should probably become its own separate issue on Github -- can you create a new issue? (Or if you would prefer to contact us directly, feel free to email hyperscan@intel.com.)

Although they are short, the regex patterns you quote are made complex because of their use of Unicode properties like \p{Han} and \P{Han}. During pattern compilation, Hyperscan expands these constructs into the byte sequences that can match them, which leads to very large pattern graphs -- especially when these constructs are wrapped in bounded repeats like {0,5}.

While Hyperscan is still able to handle these, 270,000 patterns of this form is a very large case -- it's not surprising to see very large memory requirements here.

Can you describe what your application is doing with these patterns in a bit more detail? Is the sub-pattern (\p{Han}?|\P{Han}{0,5}) precisely what you need to match, or could it be weakened to a simpler sub-pattern? Do your patterns need separate IDs, or could they use a single ID or small group of IDs? (This might allow for some merging of data structures at pattern compile time.)

Finally, if you can share a larger sample of your patterns, either in a Github issue or via email, that would make it easier to see if there are improvements we can make that reduce the memory requirements for this case.

@YueHonghui
Copy link

@jviiret Thank you for your reply, I'v posted an email to hyperscan@intel.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants