Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: port RE2's DFA matcher to the regexp package #11646

Open
michaelmatloob opened this issue Jul 9, 2015 · 35 comments
Open

regexp: port RE2's DFA matcher to the regexp package #11646

michaelmatloob opened this issue Jul 9, 2015 · 35 comments

Comments

@michaelmatloob
Copy link
Contributor

@michaelmatloob michaelmatloob commented Jul 9, 2015

The regexp package currently chooses between the standard NFA matcher, onepass, or the backtracker. This proposal is for porting over RE2's DFA matcher to be used as an option by exec.

@michaelmatloob michaelmatloob changed the title proposal: port RE2's DFA matcher to the regexp package proposal: regexp: port RE2's DFA matcher to the regexp package Jul 9, 2015
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Jul 10, 2015
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 10, 2015

CC @rsc

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Jul 13, 2015

IIRC, @rsc and @robpike were discussing this a few weeks ago and @rsc said he'd hit some design stumbling block where the C++ implementation didn't have an obvious Go translation.

Also, you were mentioning that the UTF-8-edness of the Go's regexp package and its runes make the 256-sized tables (which are just bytes in the C++ version) potentially tricky. Not sure whether restricting the DFA optimization to ASCII-only regexps helps or not.

@michaelmatloob

This comment has been minimized.

Copy link
Contributor Author

@michaelmatloob michaelmatloob commented Jul 26, 2015

One problem @rsc mentioned was the allocation of state data, but he gave me a solution so I don't think that will be a big problem.

After looking some more at the rune vs byte issue, I don't think we would need to skip the DFA for non-ascii regexps. The DFA makes a set of character ranges, and indexes the output transitions by range. We could do the same range lookup for input runes < 256 (or some other arbitrary table size) and do a slower lookup for larger runes.

Of course, there are a number of other things that don't map easily. I think most of those will have replacements and for any that don't we'll be able to do something slightly slower than RE2 but still be much faster than the nfa matcher.

@robpike

This comment has been minimized.

Copy link
Contributor

@robpike robpike commented Jul 27, 2015

I would like to understand why the NFA is not performant first. I am afraid that adding a DFA (I presume for performance reasons) may hit the same problems.

@mikioh mikioh added the Proposal label Aug 13, 2015
@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Oct 24, 2015

@michaelmatloob and I talked about this at Gophercon. I'm okay with adding this provided:

  1. A bound on the space used by the DFA can be kept.
  2. When the DFA cache runs out of space and must be flushed, the allocated memory can be reused directly. (That is, we don't allocate a new cache and rely on a GC cycle to reclaim the old cache memory.)
@rsc rsc changed the title proposal: regexp: port RE2's DFA matcher to the regexp package regexp: port RE2's DFA matcher to the regexp package Oct 24, 2015
@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Mar 4, 2016

I'm working on this and hope to have it ready for 1.7.

(How do I add myself as an assignee?)

@matloob matloob self-assigned this Mar 4, 2016
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Apr 18, 2016

CL https://golang.org/cl/22189 mentions this issue.

@matloob matloob modified the milestones: Go1.7Early, Unplanned Apr 18, 2016
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Apr 19, 2016

CL https://golang.org/cl/22246 mentions this issue.

@matloob matloob removed this from the Go1.7Early milestone Apr 27, 2016
@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Apr 27, 2016

This isn't going to get in for Go 1.7.

@bradfitz bradfitz added this to the Go1.8 milestone Apr 27, 2016
@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Apr 27, 2016

For those interested in trying it out, I've got a standalone copy of my work thus far in matloob.io/regexp. It passes all the regexp tests and is 2-4x faster than regexp for hard regexps and large inputs. It's 5-10x faster for the new BenchmarkMatchHard1 benchmark.

@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Aug 17, 2016

@matloob plan on sending new CLs this cycle?

@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Aug 17, 2016

Ooh I really want to... but...

rsc had some ideas that I want to look at before moving forward with this. I think things will need to be on hold till he gets back.

In the meantime I'll sync the cl to tip and re-check-in an even harder benchmark.

@rsc rsc modified the milestones: Go1.9, Go1.8 Oct 18, 2016
@olekukonko

This comment has been minimized.

Copy link

@olekukonko olekukonko commented Jan 13, 2017

@matloob can you share some of @rsc ideas ?

@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Jan 13, 2017

Russ would like to implement the ideas in the following paper: https://doi.org/10.1002/spe.2436. I haven't had a chance to read the paper yet (and it's behind a paywall).

@junyer

This comment has been minimized.

Copy link
Contributor

@junyer junyer commented Jan 17, 2017

If it helps, there is a video of the seminar that Chivers held about one month before the paper was published. The "preview" idea seems elegant and more appealing (to me, at least) than dealing with SIMD instructions for various architectures.

The first page of the paper mentions the use of loop counting to implement counted repetitions. If that idea is of interest, there are at least two papers about "extended" finite automata that predate Chivers' work. I must also point out that Smith, Estan and Jha patented their work.

@olekukonko

This comment has been minimized.

Copy link

@olekukonko olekukonko commented Jan 18, 2017

@matloob @jubalh thanks for the links ...

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jul 22, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 15, 2017
@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Nov 15, 2017

@matloob, any update here? Still want to do this sometime?

@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Nov 17, 2017

I haven't had the chance to review the Chivers paper, and don't know when I'd be able to dedicate the time to investigating it further.

On the other hand bringing in the DFA matcher is far easier, and I'd be able to do that work if we decided that's something we wanted.

@smasher164

This comment has been minimized.

Copy link
Member

@smasher164 smasher164 commented Nov 22, 2017

Forgive the length of this post. TLDR: This post summarizes Chivers' paper and my thoughts on DFA implementation for Go in general.

Chivers discusses optimization techniques implemented in the jsre regular expression library. Sections 3-5 are the pertinent ones for this post.

  • 3.1: Executable Architecture: NFA simulation is implemented with a bytecode virtual machine, where the states, transition functions, and parallel simulation are represented by the program counter, a set of instructions, and separate execution threads.
  • 3.3: Counted repetitions and NFA state: In order to support arbitrarily complex repeated subexpressions, jsre supports counted loops similar to an assembly translation with a loop control variable and a conditional jump. However, Chivers states that in order to avoid merging threads that are not in the same NFA state, one must “label states using both the program counter and the loop counter. Threads may be merged if they are at the same point in the input stream and both the program counter and any loop counters have the same value.[1]
  • 4.1: Preview Definition: A preview is a sequence of sets, where the ith symbol of every string in the language is a member of the ith set in the preview. For example:
    1. The language denoted by the regular expression abcd|ef is {abcd, ef}. The preview for this language is {{a,e}, {b,f}, {c}, {d}}.
    2. The minimum length of a string described by a preview is called its validity length (or validity index). The validity length of this preview is 2.
    3. A preview generalizes the one-byte lookahead common to existing regular expression libraries.
  • 4.2: Preview Construction:
    1. Base case: Regular expression has single symbol x. Preview is {x}.
    2. Union: Preview(R|S) = {Preview(R)i ∪ Preview(S)i | i ∈ ℕ}
      1. ValidityLength(Preview(R|S)) = min(ValidityLength(Preview(R)), ValidityLength(Preview(S)))
    3. Concatenation: Denote the ValidityLength(Preview(R)) by VR. Preview(RS) = {Preview(R)i+VR ∪ Preview(S)i | i ∈ ℕ}
      1. ValidityLength(Preview(RS)) = ValidityLength(Preview(R)) + ValidityLength(Preview(S))
    4. Kleene Star: Concatenate a Preview to itself many times. e.g: Preview(RR), Preview(Preview(RR)R), Preview(Preview(Preview(RR)R)R) …
      1. ValidityLength = 0 because of empty string.
  • 4.3: From preview to DFAs:
    1. Each set in the preview is the next transition function.
    2. Produces false positives, but not false negatives.

5: Optimization with previews:

I found that the paper was lacking enough examples, so I added one for each optimization.

  • 5.1: Anchor preview scan: Unanchored matches typically try to anchor position in the input to match a substring.

    1. One simple and lightweight optimization is to use a preview to eliminate non-candidates before trying to match the expression from that anchor point.
    2. Chivers provides an example of an RE for an IP address: ([0-9]{1,3}\.){3} [0-9] {1,3}
    3. The preview for the RE would be the following: {{[0-9]}, {\., [0-9]}, {\., [0-9]}}
    4. Therefore, based on the preview, the following input strings are filtered accordingly
      8.6.7.123         candidate
      192.16.7.123      candidate
      .1.2.234          not a candidate
      abc.234.532.445   not a candidate
      123.b.~.445       candidate
      
    5. Chivers shows a speedup of 5x from this optimization when tested on a large corpus.
  • 5.3: Matching multiple alternate expressions: A common use case for regular expressions is to union multiple different expressions and test each alternative on a different thread.

    1. Before accepting the overhead of dispatching another thread, previews allow one to test that a (sub) expression may match.
    2. For example, let’s say we wanted to match the expression (asking|aardvark|barhopping|bartender|barring)
    3. Here’s a preview of size 3: {{a,b},{s,a},{k,t,r}}
    4. Chivers shows that “[i]t is therefore straightforward to arrange an efficient search strategy by re-ordering the alternative expressions into a tree and using preview tests at each node.[1]” Note that if the preview check fails, we reject (or do not use that anchor point in the case of unanchored matches).
    5. Based on this example, here is a tree that I have marked up to include the preview checks and thread dispatch locations.
      $
      │
      {a,b}?
      │
      ├── a
      │   │
      │   {s,a}?
      │   │
      │   ├── s─{k,t,r}?─k─(new thread)─i──n──g
      │   │
      │   └── a─{k,t,r}?─r─(new thread)─d──v──a──r──k
      │
      └── b─{s,a}?─a─{k,t,r}?─r
                              ├─(new thread)─ h──o──p──p──i──n──g
                              ├─(new thread)─ t──e──n──d──e──r
                              └─(new thread)─ r──i──n──g
      
  • 5.5: Loop Optimization: For a given RE of the form [c]*e, where c is a character set and e is an expression

    1. If c and First(Preview(e)) have no members in common (that is, they are disjoint), we can terminate the loop for [c]* as soon as we encounter [^c].
    2. Normally, one has to dispatch a thread on each iteration to check that e also matches. However, if the sets are disjoint, that can’t happen.
    3. Consider the RE [abc]*d+
    4. Preview(d+) = {{d}, {d}, …}
    5. First(Preview(d+)) = {d}
    6. {a,b,c} is disjoint from {d}, so as soon as we encounter an element that is not in the character set [abc], we exit the loop.
  • 5.6: String Prefix Optimization: This optimization utilizes the disjoint property from 5.5. Chivers proves that if there is a repeated test of character set [x] with a disjoint suffix y, as in the case of [x]*y, if the character test stops and we fail to match the suffix, then any anchor point between our original anchor point and the suffix will also stop the character test and fail to match the suffix.

    1. Given this input:   aaaaxaaxy
    2. Match attempt 1:  aaaaxy
    3. The match fails on the sixth character. Any anchor point chosen before the first x will also fail at the same position.
    4. The next logical anchor point is after the first x, allowing us to optimally select anchor points rather than exhaustively trying each one.
    5. Input:                    aaaaxaaxy
    6. Match attempt 2:          aaxy
    7. Chivers notes that this optimization can be applied to arbitrary sub-expressions, where all that is needed is a preview of the (sub)expression and the disjoint property. This optimization can lead to substantial performance improvements on unanchored searches in expressions with Kleene closures.
  • 5.8: Avoiding Quadratic Complexity: Trying an anchor point at every position in the input results in roughly O(n2) runtime.

    1. A common solution to this is to prefix the expression with .* which creates a thread for every symbol in the alphabet. This exploits the thread merging of NFA simulation but has the disadvantage that if the beginning of the expression can have a large character set, then the number of threads can equal the number of input positions consumed.

      For the regular expression x{5}. 
      xxxxx
      01234  // states written out
      Input: xaxxxaxxxxxabc
      
      Note: f denotes failure.
      Exhaustive anchor point testing. (Each line is a new anchor point)
      xaxxxaxxxxxabc
      0f
       f
        012f
         01f
          0f
           f
            01234    // success
             0123f
              012f
               01f
                0f
                 f
                  f
                   f
      
      
      
      Using .* prefix to exploit thread merging. (Each line is a new thread)
      .*xxxxx
      0 12345
      
      xaxxxaxxxxxabc
      00000000000000
      1f
        123f
            12345    // success
             1234f
      
    2. Chivers states that “[a]t best thread handling may result in slower evaluation because of thread overheads; at worst this may result in early exhaustion of the available thread space.[1]

    3. An alternative solution applies to Kleene closures embedded anywhere in the expression. If we fail to match from the first anchor position using parallel simulation, we end up with a “guarded area” that corresponds to all of the characters matched by the Kleene closure. When we try a second anchor position and we reach the same state of the Kleene closure, if our position in the input is in the guarded area, then we know that the expression evaluated from this anchor will fail.

      Input: abcabczzzxyy
      Expression: abc.*xyz
      States:     0123 456
      
      Note: G denotes the guarded area.
      f denotes failure.
      Each line after the colon denotes another thread.
      
      Attempt 1:
      Input:   abcabczzzxyy
      States: 012333333333
            :          45f
      Guarded:   GGGGGGGGG
      
      Attempt 2:
      Input:  abcabczzzxyy
      Guarded:   GGGGGGGGG
      States:    012f    // because 3 would land in the guarded area
      
    4. Chivers shows that “in addition to providing linear search performance, this approach minimizes thread usage in the NFA simulation.[2]

Thoughts on implementation

Considering Go tries to have robust handling of Unicode symbols, previews allow one to more efficiently match characters that have multi-byte encodings. Given that Go's regexp compiler also outputs bytecode, it can be extended to support the preview scan instruction, which I assume stores a flag denoting the ith character's membership in the ith set of the preview. Perhaps this feature would also allow Go's implementation to operate on bytes instead of runes, considering it essentially has k-lookahead.

It seems that the primary advantage of having previews is to minimize thread usage during NFA simulation. New threads are not spawned until the input matches the preview, and invalid anchor positions can be quickly eliminated, preventing new threads from being spawned.

The preview for a language R of size i is a superset of the set of strings in R, which contains at most i elements. Because of this, a rejection from the preview implies a rejection from R, but not the other way around. Additionally, a preview can be represented as a DFA, with the ith subset being a transition function from state i to state i+1. Construction of previews would happen during compilation (or lazily--see below), but we'd have to do additional testing to determine their optimal sizes.

A balanced approach to DFA construction seems to be the lazy subset construction described in Russ's article[3]. I also agree with @rsc's comment that we should keep a bound on the space used by the DFA. However, if we run out of space, we should not throw away the automata. Instead, we construct a partial DFA with an upper bound on the allotted space. If the DFA doesn't require all of the allotted space, that’s great! We have a complete DFA. On the other hand, if we run out of space, only a certain fraction of the machine will be deterministic[4]. This would still offer an average case speedup for the deterministic portion of the machine.

I wonder if the current implementation has the notion of composable machines (or subroutines). An enhancement to finite state machines that doesn't add any language-recognition power but offers a practical performance benefit is the ability to feed the result of a separate machine or subroutine into the first. This allows one to take the union and intersection of machines, and therefore their languages, simply by accepting if one or both of the machines accept, and vice versa for rejection. In the context of subroutines, we would still have access to the same thread state, allowing the machine to record start and end positions of matches. Additionally, when we compute the preview of a subexpression that has multiple instances inside an expression, we can reuse that preview for each instance, provided that the subexpression has a separate machine or subroutine.

Finally, partial DFAs might be useful for previews as well. Similar to cached DFA construction, we can compute the previews only for code paths that are reachable. The first time an input character is tested as a member of a preview set, we construct the preview and cache it for lookup later. Although preview sizes are small, a large number of subexpressions in the language may demand many previews to the point where some of them will be NFAs constructed on the fly.

Overall, a reasonable strategy for implementation would be:

  1. Cached Partial DFA Construction
  2. Instruction for preview scan
    • Optimization for alternate expressions to improve Unicode code point matching
    • Optimization for Anchor Points
    • Loop optimization
    • String Prefix Optimization
    • Guarded area anchor point
      We could settle here if performance improvements were notable enough.
  3. Also construct preview DFAs lazily
  4. Explore the idea of subroutines.

[1] Chivers, H. (2016). Optimising unicode regular expression evaluation with previews. Software: Practice and Experience, 47(5), 669-688. doi:10.1002/spe.2436

[2] Chivers, H. (2016, August 15). Retrieved from https://youtu.be/tIRpyQpJ3Ug. York CS Seminars

[3] Cox, R. (2007, January). Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). Retrieved from https://swtch.com/~rsc/regexp/regexp1.html

[4] Navarro, G. (1997). A Partial Deterministic Automaton for Approximate String Matching.

@ramon-garcia

This comment has been minimized.

Copy link

@ramon-garcia ramon-garcia commented Dec 19, 2017

I have checked myself and the performance of golang regexp library is horrible.

In a simple benchmark RE2 was 10 times faster. It might not be representative.

RE2 compiler code is 1283 lines long. golang compiler is 290 lines long.

Porting RE2 code should be enough

@Mygod Mygod mentioned this issue Feb 25, 2018
6 of 17 tasks complete
@adsouza

This comment has been minimized.

Copy link

@adsouza adsouza commented Apr 18, 2018

The Teddy algorithm used in Hyperscan is much better than even RE2's DFA:
https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scanning-hyperscan-and-re2set

Perhaps somebody can translate it from the Rust implementation at https://github.com/jneem/teddy

@BurntSushi

This comment has been minimized.

Copy link

@BurntSushi BurntSushi commented Apr 18, 2018

@adsouza Note that the Teddy algorithm is a specific form of literal optimization that only really applies to a small number of small literals.

@junyer

This comment has been minimized.

Copy link
Contributor

@junyer junyer commented May 6, 2018

Has anyone evaluated the DFA implementation in https://github.com/google/codesearch/blob/master/regexp/match.go yet?

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 30, 2018
@bradfitz bradfitz added the NeedsFix label May 30, 2018
@smasher164

This comment has been minimized.

Copy link
Member

@smasher164 smasher164 commented Jun 5, 2018

@junyer The implementation in codesearch keeps a state cache of size 256 for each DFA state (although the alphabet spans the size of a rune). A cache like this could be placed into a sync.Pool, to play nicely with the GC. We would still be swapping a lot given a single DFA cache is 2K bytes, so it would be nice to reduce the size of the bitmap.

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong). I think @rsc mentions this idea of constructing smaller FSMs to handle unicode in https://swtch.com/~rsc/regexp/regexp1.html under Character sets.

A preview approach might be to use unicode character ranges to create a preview DFA of depth 3 to filter out invalid byte sequences. This presents a tradeoff between the number of states (from the depth value) and size of the DFA cache, which we could tune to our liking.

@BurntSushi

This comment has been minimized.

Copy link

@BurntSushi BurntSushi commented Jun 5, 2018

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong).

If you can read Rust, then this is exactly what the utf8-ranges crate does: https://github.com/BurntSushi/utf8-ranges There isn't much code, so it should be pretty easy to port. The idea is itself inspired by what RE2 does.

This approach is how you get the alphabet size down to 256, even while supporting all of Unicode. You can further reduce the alphabet size by merging the symbols in your alphabet that are indistinguishable from each other for a specific regex. @rsc talks about this in his third regexp article IIRC.

@kamilgregorczyk

This comment has been minimized.

Copy link

@kamilgregorczyk kamilgregorczyk commented Aug 12, 2018

Regexps are still slow as hell #26943

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Aug 12, 2018

@kamilgregorczyk, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite.

@kamilgregorczyk

This comment has been minimized.

Copy link

@kamilgregorczyk kamilgregorczyk commented Aug 12, 2018

@cznic

This comment has been minimized.

Copy link
Contributor

@cznic cznic commented Aug 12, 2018

What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude faster in some worst cases compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example?

What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed.

@kamilgregorczyk

This comment has been minimized.

Copy link

@kamilgregorczyk kamilgregorczyk commented Aug 12, 2018

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Aug 13, 2018

You could say something like "I find its performance unacceptable" or "It's unacceptable for my application", but saying that it's flat-out "unacceptable" is what's coming across as rude in your tone. It's been almost 9 years of people largely accepting its performance, even if there are always things that could be better. The bug is one such thing.

@mohanraj-r

This comment has been minimized.

Copy link

@mohanraj-r mohanraj-r commented Oct 17, 2019

Since

  • @matloob 's PR seems to have stalled as per conversations in this issue,
  • and there doesn't seem to be any regexp optimizations in Go 1.12 (or Go 1.13) as @rsc mentioned in an issue #26943 (comment) (referenced from this issue)

should this issue be closed in favor of #26623 @bradfitz ?

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Oct 17, 2019

Let's keep this open. This is a specific task with good discussion and history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.