Make the regexp engine respect via Thread.interrupt(). #6

jordansissel · 2013-04-02T17:31:24Z

The use case is that some regexp executions can take exponential time.
A documented case and explanation is found here:
http://swtch.com/~rsc/regexp/regexp1.html

In cases where exponential runtime is undesirable, it would be nice if
we could interrupt the engine. To do this, running the pattern in a
separate thread and then interrupting that thread when it doesn't
complete on schule.

The use case is that some regexp executions can take exponential time. A documented case and explanation is found here: http://swtch.com/~rsc/regexp/regexp1.html In cases where exponential runtime is undesirable, it would be nice if we could interrupt the engine. To do this, running the pattern in a separate thread and then interrupting that thread when it doesn't complete on schule.

jordansissel · 2013-04-02T17:32:27Z

src/org/joni/ByteCodeMachine.java

@@ -186,6 +186,9 @@ protected final int matchAt(int range, int sstart, int sprev) {

        final int[]code = this.code;
        while (true) {
+            if (Thread.interrupted()) {


I think it's worth doing some performance tests before this patch is merged. I don't want to have the Thread.interrupted() check impact much, if possible.

Maybe? I can line those up if you think it's worth doing.

jordansissel · 2013-04-15T00:11:39Z

Did some benchmarks on a regexp I know to execute long -

Summary

These tests were done on my home workstation which was not necessarily idle during the tests; so I would pick the 'best case' performance from each run. 13.72 seconds w/o patch, 14.37 w/ patch.

Rough hand-waving, I'd say this incurred a 4% performance penalty, which isn't bad.

WIthout this patch:

Rehearsal ----------------------------------------------------
/a{#count}...../  13.880000   0.000000  13.880000 ( 13.846000)
/a{#count}...../  13.970000   0.000000  13.970000 ( 13.959000)
/a{#count}...../  14.810000   0.000000  14.810000 ( 14.821000)
/a{#count}...../  15.040000   0.000000  15.040000 ( 15.058000)
/a{#count}...../  16.660000   0.000000  16.660000 ( 16.679000)
------------------------------------------ total: 74.360000sec

                       user     system      total        real
/a{#count}...../  13.720000   0.000000  13.720000 ( 13.737000)
/a{#count}...../  15.560000   0.000000  15.560000 ( 15.571000)
/a{#count}..../ . 15.350000   0.000000  15.350000 ( 15.374000)
/a{#count}...../  13.990000   0.000000  13.990000 ( 14.007000)
/a{#count}...../  15.650000   0.000000  15.650000 ( 15.676000)
------------------------------------------ total: 76.120000sec

With this patch

Rehearsal ----------------------------------------------------
/a{#count}...../  16.920000   0.000000  16.920000 ( 16.891000)
/a{#count}...../  14.370000   0.000000  14.370000 ( 14.357000)
/a{#count}...../  15.390000   0.000000  15.390000 ( 15.413000)
/a{#count}...../  14.800000   0.000000  14.800000 ( 14.812000)
/a{#count}...../  14.640000   0.000000  14.640000 ( 14.654000)
------------------------------------------ total: 76.120000sec

                       user     system      total        real
/a{#count}...../  15.560000   0.000000  15.560000 ( 15.574000)
/a{#count}...../  19.620000   0.000000  19.620000 ( 19.643000)
/a{#count}...../  18.230000   0.010000  18.240000 ( 18.256000)
/a{#count}...../  18.920000   0.000000  18.920000 ( 18.934000)
/a{#count}...../  19.320000   0.000000  19.320000 (

Code

The purpose of this code was to spend a long time in the regexp engine to see what the interrupted() check costs.

``ruby
count = 29
str = "a" * count # aaaa...
re = Regexp.new("a?" * count + "a" * count) # a?a?a?a?aaaa ...

require "benchmark"

Benchmark.bmbm do |x|
5.times do
x.report("/a{#count}...../") { re.match(str) }
end
end

jordansissel · 2013-04-15T00:12:31Z

I see the travis-ci failures now, I will fix those.

jordansissel · 2013-05-01T18:32:57Z

(still pending me fixing the tests, which is mostly just going to be adding 'throws InterruptedException' anywhere that complains)

headius · 2013-05-01T18:36:50Z

Sorry about the lack of progress on getting this in.

I talked about it with @enebo and @lopex a bit. We all like the idea, but there's some concerns about excessive interrupt checking. Ideally we'd only need to do it on backtracking or backward branches, since otherwise the regexp is moving forward and will eventually complete. @enebo can probably explain this a little better, since he talked to @lopex before I did.

So in short, we want the feature in, but we need to reduce the perf impact a bit.

jordansissel · 2013-05-01T18:40:55Z

Ahh, totally makes sense. I can work in the interrupted checks to occur only at the branches you mentioned.

jordansissel · 2013-05-01T18:47:21Z

1058b09 fixes the tests.

headius · 2013-05-07T20:34:01Z

Any status here? We're looking to push 1.7.4 toward end of week.

jordansissel · 2013-05-07T21:36:44Z

I'll see what I can do to get a patch your way in a day or so.

Make the regexp engine respect via Thread.interrupt(). Will move check elsewhere but most of the commit is good as-is.

jordansissel · 2013-05-09T19:34:14Z

Re: checking interrupt() only at the necessary places, I had a go reading the opcodes and state machine, but I didn't have enough time to fully sink it into my brain.

If there's worry about performance, I'll spend more time on it when I can.

enebo · 2013-05-10T13:59:08Z

I am working on improving the impact of this. For Java 7 the impact is not bad but for Java 6 it is nasty (15s versus 90s per match). I am going to perhaps only check every 20 instructions executed. I am wondering if that still gives reasonable granularity for this class of regexp?

jordansissel · 2013-05-10T16:33:34Z

I think so. My main goal is to be able to interrupt it at all, every 20 or 100 instructions is probably a good start. I looked to see where backtracking occurred but I couldn't find anything obvious - trying to visualize the state machine from the opcode list left me a bit confused ;)

enebo · 2013-05-10T16:47:18Z

I wrote a Java unit test for this and even only checking once every 30_000_000 instrs for your supplied regexp it still interrupts in 10ms. I think this is only one class of pathological regexp and I am wondering if you have any others which take a long time?

jordansissel · 2013-05-10T18:51:38Z

I'll scan through the logstash bug reports related to this to see if I come up with a real-world example.

enebo · 2013-05-10T18:55:39Z

If you have one against a large string it would also be helpful. Large strings can do lots of large scans between instructions potentially. I look at matches against huge strings as a second potential class of regexp to interrupt. We are less worried about interrupting someone doing a complicated regexp against a 10Mb string. That case seems more like user error, but if we can easily handle it we will...

enebo · 2013-05-10T21:35:56Z

I went with once every 30_000 which is way overkill for the submitted regexp but should be more sensitive to regexps with less instrs. I also fixed a bug where when we throw we need to reset the interrupt state by calling interrupted (bonus: two more unit tests for this).

Additionally, not covered in original patch was that we now will return -2 on interrupt for non *Interrupt versions of search and match. This is a significant enough change to change the major version on joni. Users using <0 will still know the regexp did not match but it might have been due to thread being interrupted.

We can spin more releases to address newly found regexps which are not interrupting properly. I think the current performance is almost 10% slower on Java 6 for that regexp but nearly 0 for Java 7+. Probably more than worth it to be able to interrupt bad regexps.

jordansissel reviewed Apr 2, 2013
View reviewed changes

Add InterruptedException awareness to the tests.

1058b09

enebo added a commit that referenced this pull request May 9, 2013

Merge pull request #6 from jordansissel/respect-thread-interrupt

1ec842a

Make the regexp engine respect via Thread.interrupt(). Will move check elsewhere but most of the commit is good as-is.

enebo merged commit 1ec842a into jruby:master May 9, 2013

jordansissel deleted the respect-thread-interrupt branch May 9, 2013 19:34

colinsurprenant mentioned this pull request Jul 19, 2019

[meta] kv and grok filters timeout handling elastic/logstash#10976

Closed

headius mentioned this pull request Nov 30, 2021

Limit Date.parse input length and make interruptible jruby/jruby#6951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the regexp engine respect via Thread.interrupt(). #6

Make the regexp engine respect via Thread.interrupt(). #6

jordansissel commented Apr 2, 2013

jordansissel Apr 2, 2013

jordansissel commented Apr 15, 2013

jordansissel commented Apr 15, 2013

jordansissel commented May 1, 2013

headius commented May 1, 2013

jordansissel commented May 1, 2013

jordansissel commented May 1, 2013

headius commented May 7, 2013

jordansissel commented May 7, 2013

jordansissel commented May 9, 2013

enebo commented May 10, 2013

jordansissel commented May 10, 2013

enebo commented May 10, 2013

jordansissel commented May 10, 2013

enebo commented May 10, 2013

enebo commented May 10, 2013

Make the regexp engine respect via Thread.interrupt(). #6

Make the regexp engine respect via Thread.interrupt(). #6

Conversation

jordansissel commented Apr 2, 2013

jordansissel Apr 2, 2013

Choose a reason for hiding this comment

jordansissel commented Apr 15, 2013

Summary

WIthout this patch:

With this patch

Code

jordansissel commented Apr 15, 2013

jordansissel commented May 1, 2013

headius commented May 1, 2013

jordansissel commented May 1, 2013

jordansissel commented May 1, 2013

headius commented May 7, 2013

jordansissel commented May 7, 2013

jordansissel commented May 9, 2013

enebo commented May 10, 2013

jordansissel commented May 10, 2013

enebo commented May 10, 2013

jordansissel commented May 10, 2013

enebo commented May 10, 2013

enebo commented May 10, 2013