Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the regexp engine respect via Thread.interrupt(). #6

Merged
merged 2 commits into from
May 9, 2013

Conversation

jordansissel
Copy link
Contributor

The use case is that some regexp executions can take exponential time.
A documented case and explanation is found here:
http://swtch.com/~rsc/regexp/regexp1.html

In cases where exponential runtime is undesirable, it would be nice if
we could interrupt the engine. To do this, running the pattern in a
separate thread and then interrupting that thread when it doesn't
complete on schule.

The use case is that some regexp executions can take exponential time.
A documented case and explanation is found here:
  http://swtch.com/~rsc/regexp/regexp1.html

In cases where exponential runtime is undesirable, it would be nice if
we could interrupt the engine. To do this, running the pattern in a
separate thread and then interrupting that thread when it doesn't
complete on schule.
@@ -186,6 +186,9 @@ protected final int matchAt(int range, int sstart, int sprev) {

final int[]code = this.code;
while (true) {
if (Thread.interrupted()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth doing some performance tests before this patch is merged. I don't want to have the Thread.interrupted() check impact much, if possible.

Maybe? I can line those up if you think it's worth doing.

@jordansissel
Copy link
Contributor Author

Did some benchmarks on a regexp I know to execute long -

Summary

These tests were done on my home workstation which was not necessarily idle during the tests; so I would pick the 'best case' performance from each run. 13.72 seconds w/o patch, 14.37 w/ patch.

Rough hand-waving, I'd say this incurred a 4% performance penalty, which isn't bad.

WIthout this patch:

Rehearsal ----------------------------------------------------
/a{#count}...../  13.880000   0.000000  13.880000 ( 13.846000)
/a{#count}...../  13.970000   0.000000  13.970000 ( 13.959000)
/a{#count}...../  14.810000   0.000000  14.810000 ( 14.821000)
/a{#count}...../  15.040000   0.000000  15.040000 ( 15.058000)
/a{#count}...../  16.660000   0.000000  16.660000 ( 16.679000)
------------------------------------------ total: 74.360000sec

                       user     system      total        real
/a{#count}...../  13.720000   0.000000  13.720000 ( 13.737000)
/a{#count}...../  15.560000   0.000000  15.560000 ( 15.571000)
/a{#count}..../ . 15.350000   0.000000  15.350000 ( 15.374000)
/a{#count}...../  13.990000   0.000000  13.990000 ( 14.007000)
/a{#count}...../  15.650000   0.000000  15.650000 ( 15.676000)
------------------------------------------ total: 76.120000sec

With this patch

Rehearsal ----------------------------------------------------
/a{#count}...../  16.920000   0.000000  16.920000 ( 16.891000)
/a{#count}...../  14.370000   0.000000  14.370000 ( 14.357000)
/a{#count}...../  15.390000   0.000000  15.390000 ( 15.413000)
/a{#count}...../  14.800000   0.000000  14.800000 ( 14.812000)
/a{#count}...../  14.640000   0.000000  14.640000 ( 14.654000)
------------------------------------------ total: 76.120000sec

                       user     system      total        real
/a{#count}...../  15.560000   0.000000  15.560000 ( 15.574000)
/a{#count}...../  19.620000   0.000000  19.620000 ( 19.643000)
/a{#count}...../  18.230000   0.010000  18.240000 ( 18.256000)
/a{#count}...../  18.920000   0.000000  18.920000 ( 18.934000)
/a{#count}...../  19.320000   0.000000  19.320000 ( 

Code

The purpose of this code was to spend a long time in the regexp engine to see what the interrupted() check costs.

``ruby
count = 29
str = "a" * count # aaaa...
re = Regexp.new("a?" * count + "a" * count) # a?a?a?a?aaaa ...

require "benchmark"

Benchmark.bmbm do |x|
5.times do
x.report("/a{#count}...../") { re.match(str) }
end
end

@jordansissel
Copy link
Contributor Author

I see the travis-ci failures now, I will fix those.

@jordansissel
Copy link
Contributor Author

(still pending me fixing the tests, which is mostly just going to be adding 'throws InterruptedException' anywhere that complains)

@headius
Copy link
Member

headius commented May 1, 2013

Sorry about the lack of progress on getting this in.

I talked about it with @enebo and @lopex a bit. We all like the idea, but there's some concerns about excessive interrupt checking. Ideally we'd only need to do it on backtracking or backward branches, since otherwise the regexp is moving forward and will eventually complete. @enebo can probably explain this a little better, since he talked to @lopex before I did.

So in short, we want the feature in, but we need to reduce the perf impact a bit.

@jordansissel
Copy link
Contributor Author

Ahh, totally makes sense. I can work in the interrupted checks to occur only at the branches you mentioned.

@jordansissel
Copy link
Contributor Author

1058b09 fixes the tests.

@headius
Copy link
Member

headius commented May 7, 2013

Any status here? We're looking to push 1.7.4 toward end of week.

@jordansissel
Copy link
Contributor Author

I'll see what I can do to get a patch your way in a day or so.

enebo added a commit that referenced this pull request May 9, 2013
Make the regexp engine respect via Thread.interrupt().  Will move check elsewhere but most of the commit is good as-is.
@enebo enebo merged commit 1ec842a into jruby:master May 9, 2013
@jordansissel
Copy link
Contributor Author

Re: checking interrupt() only at the necessary places, I had a go reading the opcodes and state machine, but I didn't have enough time to fully sink it into my brain.

If there's worry about performance, I'll spend more time on it when I can.

@jordansissel jordansissel deleted the respect-thread-interrupt branch May 9, 2013 19:34
@enebo
Copy link
Member

enebo commented May 10, 2013

I am working on improving the impact of this. For Java 7 the impact is not bad but for Java 6 it is nasty (15s versus 90s per match). I am going to perhaps only check every 20 instructions executed. I am wondering if that still gives reasonable granularity for this class of regexp?

@jordansissel
Copy link
Contributor Author

I think so. My main goal is to be able to interrupt it at all, every 20 or 100 instructions is probably a good start. I looked to see where backtracking occurred but I couldn't find anything obvious - trying to visualize the state machine from the opcode list left me a bit confused ;)

@enebo
Copy link
Member

enebo commented May 10, 2013

I wrote a Java unit test for this and even only checking once every 30_000_000 instrs for your supplied regexp it still interrupts in 10ms. I think this is only one class of pathological regexp and I am wondering if you have any others which take a long time?

@jordansissel
Copy link
Contributor Author

I'll scan through the logstash bug reports related to this to see if I come up with a real-world example.

@enebo
Copy link
Member

enebo commented May 10, 2013

If you have one against a large string it would also be helpful. Large strings can do lots of large scans between instructions potentially. I look at matches against huge strings as a second potential class of regexp to interrupt. We are less worried about interrupting someone doing a complicated regexp against a 10Mb string. That case seems more like user error, but if we can easily handle it we will...

@enebo
Copy link
Member

enebo commented May 10, 2013

I went with once every 30_000 which is way overkill for the submitted regexp but should be more sensitive to regexps with less instrs. I also fixed a bug where when we throw we need to reset the interrupt state by calling interrupted (bonus: two more unit tests for this).

Additionally, not covered in original patch was that we now will return -2 on interrupt for non *Interrupt versions of search and match. This is a significant enough change to change the major version on joni. Users using <0 will still know the regexp did not match but it might have been due to thread being interrupted.

We can spin more releases to address newly found regexps which are not interrupting properly. I think the current performance is almost 10% slower on Java 6 for that regexp but nearly 0 for Java 7+. Probably more than worth it to be able to interrupt bad regexps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants