Database compilation memory leak? #9

madswj · 2015-11-20T09:00:02Z

Compiling a database using hs_compile_multi() uses a lot of memory. I tried to compile 100,000 patterns on a system with 128G memory and it ran out of memory before finishing.
In a previous experience with Hyperscan (~2 years ago) it was able to handle that amount of patterns.

mdb256 · 2015-11-20T23:13:16Z

It should not be a memory leak, but we have made a few changes in that time so it is possible that the way those patterns are compiled has changed. 100,000 patterns is a lot of patterns, though, if they are complex.

Is it the same pattern set as 2 years ago? Are you able to describe the style of patterns? You can always contact us directly, or via the Hyperscan mailing list (on 01.org), if you prefer.

madswj · 2015-11-23T07:47:45Z

The pattern set consists of randomly generated patterns of 6-30 random letters followed by “[0-9]{2,10}” just to get a little RE into it. An example pattern: /kqvqfalogr[0-9]{2,10}/

Is that a particularly “bad” pattern for the compilation? I just tried it with just strings of random letters, and that only took 25 seconds to compile.

mdb256 · 2015-11-24T05:06:44Z

Very interesting. No, that's not a particularly bad pattern at all - the problem arises because there are many patterns with similarity, and we have an analysis phase that is spending a lot of time and memory trying to optimise for these patterns.

We are now investigating improvements to this particular phase.

mdb256 · 2015-12-07T00:04:25Z

We've pushed commit 7bcd2b0 to the develop branch that improves this particular case. It now takes a couple of minutes and a lot less memory to compile.

This will be included in the next Hyperscan release.

madswj · 2015-12-07T07:24:37Z

Thanks Matt, that sounds great. Looking forward to the release.

YueHonghui · 2016-08-04T11:22:03Z

I faced the same issue when compiling our regex file, even with the newest version of the master branch. The regex file has about 27w lines, each line seems similar. Compiling the db takes about 2 hours at a machine with 1.8 GHz CPU and 64G memory, but the process was killed because of out of memory. Piece of regex file below:

76870   中(\p{Han}?|\P{Han}{0,5})文(\p{Han}?|\P{Han}{0,5})汉(\p{Han}?|\P{Han}{0,5})字
76871   一(\p{Han}?|\P{Han}{0,5})个(\p{Han}?|\P{Han}{0,5})例(\p{Han}?|\P{Han}{0,5})子
76872   就(\p{Han}?|\P{Han}{0,5})像(\p{Han}?|\P{Han}{0,5})这(\p{Han}?|\P{Han}{0,5})样(\p{Han}?|\P{Han}{0,5})的

compile flags:

HS_FLAG_CASELESS|HS_FLAG_DOTALL|HS_FLAG_UTF8|HS_FLAG_UCP

Is these regex too complicated to hyperscan, or something else wrong?

jviiret · 2016-08-08T01:42:28Z

Hi @YueHonghui,

Your issue is a little bit different from the earlier one in this report, actually, and should probably become its own separate issue on Github -- can you create a new issue? (Or if you would prefer to contact us directly, feel free to email hyperscan@intel.com.)

Although they are short, the regex patterns you quote are made complex because of their use of Unicode properties like \p{Han} and \P{Han}. During pattern compilation, Hyperscan expands these constructs into the byte sequences that can match them, which leads to very large pattern graphs -- especially when these constructs are wrapped in bounded repeats like {0,5}.

While Hyperscan is still able to handle these, 270,000 patterns of this form is a very large case -- it's not surprising to see very large memory requirements here.

Can you describe what your application is doing with these patterns in a bit more detail? Is the sub-pattern (\p{Han}?|\P{Han}{0,5}) precisely what you need to match, or could it be weakened to a simpler sub-pattern? Do your patterns need separate IDs, or could they use a single ID or small group of IDs? (This might allow for some merging of data structures at pattern compile time.)

Finally, if you can share a larger sample of your patterns, either in a Github issue or via email, that would make it easier to see if there are improvements we can make that reduce the memory requirements for this case.

YueHonghui · 2016-08-09T02:37:21Z

@jviiret Thank you for your reply, I'v posted an email to hyperscan@intel.com.

mdb256 self-assigned this Nov 24, 2015

mdb256 closed this as completed Dec 7, 2015

TidyHuang mentioned this issue May 10, 2016

hyper scan library crash #20

Closed

landataosong mentioned this issue Aug 2, 2017

deadloop in func roseRunProgram_i #65

Closed

sohant86 mentioned this issue Feb 18, 2021

Crash in hs_scan avx2 on CentOS 7 server #299

Closed

danlark1 mentioned this issue Feb 27, 2021

hs_compile_lit_multi ASAN failure for empty patterns #302

Closed

worldpeace365 mentioned this issue Jul 27, 2021

crashed when high traffic：doNormal16 (mode=CALLBACK_OUTPUT, do_accel=0 '\000', s=1, #326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database compilation memory leak? #9

Database compilation memory leak? #9

madswj commented Nov 20, 2015

mdb256 commented Nov 20, 2015

madswj commented Nov 23, 2015

mdb256 commented Nov 24, 2015

mdb256 commented Dec 7, 2015

madswj commented Dec 7, 2015

YueHonghui commented Aug 4, 2016

jviiret commented Aug 8, 2016

YueHonghui commented Aug 9, 2016

Database compilation memory leak? #9

Database compilation memory leak? #9

Comments

madswj commented Nov 20, 2015

mdb256 commented Nov 20, 2015

madswj commented Nov 23, 2015

mdb256 commented Nov 24, 2015

mdb256 commented Dec 7, 2015

madswj commented Dec 7, 2015

YueHonghui commented Aug 4, 2016

jviiret commented Aug 8, 2016

YueHonghui commented Aug 9, 2016