Use frak to reduce backtracking #28

noprompt · 2013-07-31T20:10:28Z

Hey buddy. Remember when I said

Part of me feels like going the extra mile and optimizing the generated expressions to limit the amount of backtracking - which is going to be through the roof for clojureRegexpUnicodeCharClass.

in #19?

Well guess what? 😄

Those crazy expressions haunted me up until last weekend when I finally decided to come up with a solution (frak). If you look at the diff you'll notice those gigantic expressions are significantly smaller. Two are a tad larger but that's because they're fully expanded (ie. [abc] -> %(a|b|c)).

I know this sounds (and looks) crazy, but these expressions should definitely be more performant. All of the tests pass and, from what I can tell, everything seems to highlight fine.

See, I still care about Vim (I'm not completely evil). 😉

I had some new tests + bugfixes some months back, but then Bram introduced the new NFA regexp engine late into 7.9** and everything was broken for a while, so I decided to get out of Dodge until the dust settled. Now that things are settling down, I've pulled these changes from my git stash. None of them are complete, and require some thought.

This is an old failure; I believe more special characters need to be checked for preceding backslashes.

This became apparent after opening the new regexp optimization test buffer.

* noprompt-regex-backtracking: Small correction to clojureRegexpUnicodeCharClass Generate *CharClass patterns with frak Add frak Conflicts: clj/src/vim_clojure_static/generate.clj syntax/clojure.vim

There are some regressions due to the newly added support for abbrev. character classes

guns · 2013-08-02T03:49:07Z

Hey, hey, hey. Back from the dark side.

Well, to start off: great work! I can tell you had a bit of fun writing Frak (great name). You have clearly fell into the rabbit hole of regular expressions optimization; I think you will soon find there's only one thing left to do: write your own regexp engine.

To be honest, I didn't like this patch at first; I guess I thought losing the greppability of things like "UnifiedCanadianAboriginalSyllabics" was a big loss, so I decided to look into the actual performance impact of Frak.

I had always assumed that regexp engines essentially do what Frak does, so when in doubt, write the plainest pattern you can. I've never looked into this, and I was apparently intensely curious.

Vim 7.4 has a new NFA style regexp engine (as you probably know), and it also verbosely logs every single step of its operation when in debug mode.

I tried using these log files to get a sense of the opcount of the match operations, but the files quickly grow to many gigabytes of data, so they were nearly impossible to diff (OOM errors).

One nice thing the dumps do provide are full dumps of the regexp engine's state. They look like this:

(I suggest cloning the repo so you can vimdiff the log files, old vs new)

https://gist.github.com/guns/6137147

Awesome!

As you can see, the regexp engine does what's on the tin, without Frak-like optimizations. In the best case (the beastly Unicode block pattern), Frak reduces the size of the state table by ~35%; in the case of java, the reduction is ~11%.

Since the default NFA engine logging produces ALOT of data, I patched vim to get a high-level count of the number of match operations:

https://raw.github.com/guns/vim-clojure-static/development/clj/vim/custom-nfa-log.patch

(I also have a branch containing the changes here:)

https://github.com/guns/vim/tree/custom-nfa-log

If you apply the patch and compile with -DDEBUG, you can use the following function to dump/compare the regexp logs for different syntax files:

https://github.com/guns/vim-clojure-static/blob/development/clj/src/vim_clojure_static/test.clj#L82
https://github.com/guns/vim-clojure-static/blob/development/clj/src/vim_clojure_static/test.clj#L119

Running this on the old syntax file vs new, there is a corresponding drop in the number of operations from 10-30% (don't quote me on this! consider the error bars to be ±20%). What's even more impressive is the reduction in memory usage: the allocated memory sometimes drops over 50%!

High memory usage is low on my list of complaints about Vim, but given that vim is found on routers and Raspberry Pis, this is actually significant.

So in the end, despite my pessimism, the proof is in the pudding: Frak is a win.

BTW, while we're shaving this mighty yak, I noticed that the pattern A[abc] appears to be more efficient than the pattern A%(a|b|c). If you're interested, it would be nice to determine if it is actually faster, and if so, to use that style for category classes since patterns like \p{Lu} and \p{Cc} are more common than the longer properties.

Oh, and I added some tests for covering some ground we missed earlier:

82276e2

These now fail after merging your changes; would you mind adding support for these cases?

All the new changes are in the development branch.

Thanks!

Oops

noprompt · 2013-08-04T20:52:22Z

Wow! This is truly fascinating. I would have imagined some overall benefits but this quite impressive, especially with regard to the memory savings. I'm happy.

I had always assumed that regexp engines essentially do what Frak does...

frak was more or less an experiment to see if this was the case and if it was possible to produce better patterns (provided you know your inputs). Considering what I've read and my understanding of backtracking, I had a hunch there was only so much optimization an engine could do without sacrificing performance. The fact patterns will only be as good as your ability to write them doesn't completely surprise me.

As always, you have gone above and beyond to research the problem at hand and I think you understand a bit more about the underlying mechanics than I do. You're always teaching me new things. I'd like to know: how do you get the state table information in Java? This would be very useful to me because I've only used benchmarking to get a feel for whether or not the patterns are more performant.

Speaking of performance I think you're right about [abc] being superior to (a|b|c). I ran some benchmarks and found this to be the case (although the performance boost is small). This actually came as a surprise because my understanding was that [abc] gets expanded to (a|b|c) under the hood.

At any rate I think I can make this happen in frak. I was planning include more branch comparison and optimization (think suffixes) in the next minor releases anyway. I just need a little time to make sure I get it right because it's a bit tricky.

These now fail after merging your changes; would you mind adding support for these cases?

Of course. 😄

guns · 2013-08-05T06:31:06Z

I'd like to know: how do you get the state table information in Java? This would be very useful to me because I've only used benchmarking to get a feel for whether or not the patterns are more performant.

With Vim 7.4, it just so happens that Bram added a ton of logging to the new regexp engine, so all we had to do was turn it on. We got lucky :)

Taking a look at java.util.regex.Pattern, there isn't a completely obvious way to get the same information, but there is a single private debugging method: printObjectTree.

In order to play around with this, I added a copy of Pattern.java to the project, prepending it to the classpath:

https://github.com/guns/vim-clojure-static/commit/53a7e427c998c7f74114c5a578b6d63f1e8d9fe9.patch

Then I hacked up printObjectTree() to print more prettily, recursively expand branches, and dump TreeInfo data (which caculates min and max path lengths, as well as whether the state machine is deterministic or not):

02bc9f2

Now the debug-Pattern branch is ready to give us more insight!

$ lein trampoline repl :headless 2>tmp/pattern.log &
$ less +F tmp/pattern.log

Now that we're tailing the debugging output from our custom Pattern class, we can compile two different versions of a pattern and compare:

(Pattern/compile "C[cfnos]")

[TreeInfo] min: 2, max: 2, maxValid: true, deterministic: true
BEGIN: Dumping object tree: C[cfnos]
java.util.regex.Pattern$Start@73c83d69
java.util.regex.Pattern$Single@5f37f3e1
java.util.regex.Pattern$BitClass@526c699d
java.util.regex.Pattern$LastNode@70147758
Accept Node

(Pattern/compile "C(?:c|f|n|o|s)")

[TreeInfo] min: 2, max: 2, maxValid: true, deterministic: false
BEGIN: Dumping object tree: C(?:c|f|n|o|s)
java.util.regex.Pattern$Start@4958774c
java.util.regex.Pattern$Single@4f004432
java.util.regex.Pattern$GroupHead@1a7d5723
java.util.regex.Pattern$Branch@1ae3c86b
  Branch 0
    java.util.regex.Pattern$Single@607af697
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 1
    java.util.regex.Pattern$Single@14c02506
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 2
    java.util.regex.Pattern$Single@52beb78e
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 3
    java.util.regex.Pattern$Single@6704f612
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 4
    java.util.regex.Pattern$Single@76b74c94
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
Accept Node

Now, we can't make a direct judgment about the efficiency of each pattern from the length of the log output because I added special handling for expanding Branch nodes, but not for expanding CharProperty nodes, etc. (Branch was the easiest one to expand)

However, if we look closely, you can see that that both [c] and (?:c) create a pattern node of type Single. Furthermore, the interior of [cfnos] is represented by a BitClass instance, which, like it sounds, is backed by a bitfield! It appears that the Pattern class (in openjdk anyway) aggressively optimizes simple cases like [cfnos], which is awesome.

On the other hand, the (?:c|f|n|o|s) group is at least optimized with each branch being handled by a Single node, as opposed to the more generic Slice type. This is nice, but the other pattern is more heavily optimized and lacks branches, so will likely execute faster.

In Vim 7.4, [abc] groups are not backed by bitfields or anything awesome like that, but I did notice that the match iterations were done in a tighter loop than with %(a|b|c), so I imagined that the first version would be ever so slightly faster.

At any rate I think I can make this happen in frak. I was planning include more branch comparison and optimization (think suffixes) in the next minor releases anyway. I just need a little time to make sure I get it right because it's a bit tricky.

That sounds really great! I wonder if we're at too high of a level to affect CPU cache misses / branch mispredictions.

* master: Refactor \Q…\E region Add tests for new regexp char escapes Remove unused clojureRegexpEscapes Escape more special characters in regexps

Unnecessary due to changes in master

guns · 2013-08-11T18:17:59Z

Hey @noprompt,

Would you like me to merge your changes into master, or wait for more patches?

Now that the bug that cemerick found in #30 is fixed, I'd love to ship a new release to vim_dev in the next week.

If you're close to adding some final patches, I would like to wait for them, but if you're a little busy ATM, I think it would be just fine to ship with the original frak 0.1.1 changes (which contain the majority of the win).

No pressure! Everything is great.

noprompt · 2013-08-11T20:09:10Z

No pressure at all!

I will admit I have been unusually busy the past few weeks and it's been affecting my turn around just a bit. However, aside from the extra tests you wanted me to add, I think it would be fine to merge the initial work on this patch. Do you have a day in mind this week you're planning to release?

guns · 2013-08-11T21:51:59Z

Do you have a day in mind this week you're planning to release?

No, I think any time during the next week would be nice. I pushed some changes to development, so please rebase off the new HEAD when you are ready to hack!

noprompt · 2013-08-15T23:39:30Z

Just a heads up, I'll be working on some improvements to frak today and tomorrow. I should have a new version tested and ready to go by Saturday A.M. (PST). Once that's done, I'll pull in the new code and update those test cases.

guns · 2013-08-15T23:50:09Z

When I saw frak on the Github trending list and on HN, I was pretty excited: all those new eyeballs were definitely going to result in a finished patch.

:) congrats!

noprompt · 2013-08-16T00:34:27Z

Thanks! Honestly, I was a little surprised by the amount of attention it got. I didn't even have an HN account before this week.

If it inspires people to learn Clojure (or add badass regular expression highlighting to their vim runtime file project), I'll be really happy.

noprompt · 2013-08-17T22:00:52Z

Just a bit behind schedule. Trying to kill two birds with one stone.

noprompt · 2013-08-25T18:08:28Z

I pushed v0.1.3 today. As soon as I have a chance I will apply the updates to this patch. Sorry for the delay! I would have been able to get this out way sooner but I've been without the internet most of the week working on my new home. Thanks again for the awesome CLI stuff. It's made frak a more interesting and fun project.

noprompt · 2013-09-07T07:32:51Z

I guess it's like me to take a month to wrap up a PR? 😄 Seriously though, it's been a crazy month for me but glad to be sending you these patches. I keep looking at the comparisons between the old regexes and the new ones and it's kind gives me the chills (in a good way). The transformation is remarkable.

Could you double check that everything looks right? I rebased off the development branch but I'm a bit of a hack when it comes to git.

guns · 2013-09-07T20:12:17Z

This looks awesome! Thanks for tying up this loose end for us; the final result is really great. I'll cut a release later today and send it upstream.

I look forward to collaborating with you in the future, even if it won't be Vim related. :)

noprompt and others added 11 commits July 31, 2013 12:43

Add frak

6c5d76d

Generate *CharClass patterns with frak

3e768e8

Small correction to clojureRegexpUnicodeCharClass

8542162

Add complete char prop regexp generator

f22abed

Fix failing test for ^ and $ boundaries

3c3eebc

This is an old failure; I believe more special characters need to be checked for preceding backslashes.

Support more abbreviated Unicode category classes

82276e2

This became apparent after opening the new regexp optimization test buffer.

Add Vim NFA regexp engine dump log helpers

742216c

Move vim-nfa-dump to test and add convenience fn

74c7682

Merge branch 'noprompt-regex-backtracking' into development

320a11d

* noprompt-regex-backtracking: Small correction to clojureRegexpUnicodeCharClass Generate *CharClass patterns with frak Add frak Conflicts: clj/src/vim_clojure_static/generate.clj syntax/clojure.vim

Generate new regexp definitions with Frak

e8a9907

There are some regressions due to the newly added support for abbrev. character classes

Fix custom-nfa-log.patch

797e9ac

Oops

guns added 2 commits August 11, 2013 12:55

Merge branch 'master' into development

484e052

* master: Refactor \Q…\E region Add tests for new regexp char escapes Remove unused clojureRegexpEscapes Escape more special characters in regexps

Remove leading backslash check from clojureRegexpBoundary

4d81bee

Unnecessary due to changes in master

noprompt added 4 commits September 6, 2013 23:07

Update frak to v0.1.2 for character sets

a0f1b13

Upgrade to frak 0.1.3

de6f51c

Fix accidentally damaged char class, tidy up

4096113

Replace commented comment macro

dadefa7

guns merged commit dadefa7 into guns:master Sep 7, 2013

noprompt deleted the noprompt-regex-backtracking branch September 8, 2013 01:13

Use frak to reduce backtracking #28

Use frak to reduce backtracking #28

Uh oh!

Conversation

noprompt commented Jul 31, 2013

Uh oh!

guns commented Aug 2, 2013

Uh oh!

noprompt commented Aug 4, 2013

Uh oh!

guns commented Aug 5, 2013

Uh oh!

guns commented Aug 11, 2013

Uh oh!

noprompt commented Aug 11, 2013

Uh oh!

guns commented Aug 11, 2013

Uh oh!

noprompt commented Aug 15, 2013

Uh oh!

guns commented Aug 15, 2013

Uh oh!

noprompt commented Aug 16, 2013

Uh oh!

noprompt commented Aug 17, 2013

Uh oh!

noprompt commented Aug 25, 2013

Uh oh!

noprompt commented Sep 7, 2013

Uh oh!

guns commented Sep 7, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants