-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the regexp engine respect via Thread.interrupt(). #6
Conversation
The use case is that some regexp executions can take exponential time. A documented case and explanation is found here: http://swtch.com/~rsc/regexp/regexp1.html In cases where exponential runtime is undesirable, it would be nice if we could interrupt the engine. To do this, running the pattern in a separate thread and then interrupting that thread when it doesn't complete on schule.
@@ -186,6 +186,9 @@ protected final int matchAt(int range, int sstart, int sprev) { | |||
|
|||
final int[]code = this.code; | |||
while (true) { | |||
if (Thread.interrupted()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth doing some performance tests before this patch is merged. I don't want to have the Thread.interrupted() check impact much, if possible.
Maybe? I can line those up if you think it's worth doing.
Did some benchmarks on a regexp I know to execute long - SummaryThese tests were done on my home workstation which was not necessarily idle during the tests; so I would pick the 'best case' performance from each run. 13.72 seconds w/o patch, 14.37 w/ patch. Rough hand-waving, I'd say this incurred a 4% performance penalty, which isn't bad. WIthout this patch:
With this patch
CodeThe purpose of this code was to spend a long time in the regexp engine to see what the interrupted() check costs. ``ruby require "benchmark" Benchmark.bmbm do |x|
|
I see the travis-ci failures now, I will fix those. |
(still pending me fixing the tests, which is mostly just going to be adding 'throws InterruptedException' anywhere that complains) |
Sorry about the lack of progress on getting this in. I talked about it with @enebo and @lopex a bit. We all like the idea, but there's some concerns about excessive interrupt checking. Ideally we'd only need to do it on backtracking or backward branches, since otherwise the regexp is moving forward and will eventually complete. @enebo can probably explain this a little better, since he talked to @lopex before I did. So in short, we want the feature in, but we need to reduce the perf impact a bit. |
Ahh, totally makes sense. I can work in the interrupted checks to occur only at the branches you mentioned. |
1058b09 fixes the tests. |
Any status here? We're looking to push 1.7.4 toward end of week. |
I'll see what I can do to get a patch your way in a day or so. |
Make the regexp engine respect via Thread.interrupt(). Will move check elsewhere but most of the commit is good as-is.
Re: checking interrupt() only at the necessary places, I had a go reading the opcodes and state machine, but I didn't have enough time to fully sink it into my brain. If there's worry about performance, I'll spend more time on it when I can. |
I am working on improving the impact of this. For Java 7 the impact is not bad but for Java 6 it is nasty (15s versus 90s per match). I am going to perhaps only check every 20 instructions executed. I am wondering if that still gives reasonable granularity for this class of regexp? |
I think so. My main goal is to be able to interrupt it at all, every 20 or 100 instructions is probably a good start. I looked to see where backtracking occurred but I couldn't find anything obvious - trying to visualize the state machine from the opcode list left me a bit confused ;) |
I wrote a Java unit test for this and even only checking once every 30_000_000 instrs for your supplied regexp it still interrupts in 10ms. I think this is only one class of pathological regexp and I am wondering if you have any others which take a long time? |
I'll scan through the logstash bug reports related to this to see if I come up with a real-world example. |
If you have one against a large string it would also be helpful. Large strings can do lots of large scans between instructions potentially. I look at matches against huge strings as a second potential class of regexp to interrupt. We are less worried about interrupting someone doing a complicated regexp against a 10Mb string. That case seems more like user error, but if we can easily handle it we will... |
I went with once every 30_000 which is way overkill for the submitted regexp but should be more sensitive to regexps with less instrs. I also fixed a bug where when we throw we need to reset the interrupt state by calling interrupted (bonus: two more unit tests for this). Additionally, not covered in original patch was that we now will return -2 on interrupt for non *Interrupt versions of search and match. This is a significant enough change to change the major version on joni. Users using <0 will still know the regexp did not match but it might have been due to thread being interrupted. We can spin more releases to address newly found regexps which are not interrupting properly. I think the current performance is almost 10% slower on Java 6 for that regexp but nearly 0 for Java 7+. Probably more than worth it to be able to interrupt bad regexps. |
The use case is that some regexp executions can take exponential time.
A documented case and explanation is found here:
http://swtch.com/~rsc/regexp/regexp1.html
In cases where exponential runtime is undesirable, it would be nice if
we could interrupt the engine. To do this, running the pattern in a
separate thread and then interrupting that thread when it doesn't
complete on schule.