New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use frak to reduce backtracking #28

Merged
merged 18 commits into from Sep 7, 2013

Conversation

Projects
None yet
2 participants
@noprompt
Copy link
Contributor

noprompt commented Jul 31, 2013

Hey buddy. Remember when I said

Part of me feels like going the extra mile and optimizing the generated expressions to limit the amount of backtracking - which is going to be through the roof for clojureRegexpUnicodeCharClass.

in #19?

Well guess what? 馃槃

Those crazy expressions haunted me up until last weekend when I finally decided to come up with a solution (frak). If you look at the diff you'll notice those gigantic expressions are significantly smaller. Two are a tad larger but that's because they're fully expanded (ie. [abc] -> %(a|b|c)).

I know this sounds (and looks) crazy, but these expressions should definitely be more performant. All of the tests pass and, from what I can tell, everything seems to highlight fine.

See, I still care about Vim (I'm not completely evil). 馃槈

@noprompt

This comment has been minimized.

Copy link
Owner Author

noprompt commented on clj/src/vim_clojure_static/generate.clj in 3e768e8 Jul 31, 2013

frak doesn't compile Vim compatible regex out of the box (yet) so this is a small work around.

@noprompt

This comment has been minimized.

Copy link
Owner Author

noprompt commented on clj/src/vim_clojure_static/generate.clj in 3e768e8 Jul 31, 2013

The grouping is dropped since frak is handling that.

@noprompt

This comment has been minimized.

Copy link
Owner Author

noprompt commented on syntax/clojure.vim in 3e768e8 Jul 31, 2013

Whoops! I broke this. Will fix.

noprompt and others added some commits Jul 31, 2013

Add old TODOs from git stash
I had some new tests + bugfixes some months back, but then Bram
introduced the new NFA regexp engine late into 7.9** and everything was
broken for a while, so I decided to get out of Dodge until the dust
settled.

Now that things are settling down, I've pulled these changes from my git
stash. None of them are complete, and require some thought.
Fix failing test for ^ and $ boundaries
This is an old failure; I believe more special characters need to be
checked for preceding backslashes.
Support more abbreviated Unicode category classes
This became apparent after opening the new regexp optimization test
buffer.
Merge branch 'noprompt-regex-backtracking' into development
* noprompt-regex-backtracking:
  Small correction to clojureRegexpUnicodeCharClass
  Generate *CharClass patterns with frak
  Add frak

Conflicts:
	clj/src/vim_clojure_static/generate.clj
	syntax/clojure.vim
Generate new regexp definitions with Frak
There are some regressions due to the newly added support for abbrev.
character classes
@guns

This comment has been minimized.

Copy link
Owner

guns commented Aug 2, 2013

Hey, hey, hey. Back from the dark side.

Well, to start off: great work! I can tell you had a bit of fun writing Frak (great name). You have clearly fell into the rabbit hole of regular expressions optimization; I think you will soon find there's only one thing left to do: write your own regexp engine.

To be honest, I didn't like this patch at first; I guess I thought losing the greppability of things like "UnifiedCanadianAboriginalSyllabics" was a big loss, so I decided to look into the actual performance impact of Frak.

I had always assumed that regexp engines essentially do what Frak does, so when in doubt, write the plainest pattern you can. I've never looked into this, and I was apparently intensely curious.

Vim 7.4 has a new NFA style regexp engine (as you probably know), and it also verbosely logs every single step of its operation when in debug mode.

I tried using these log files to get a sense of the opcount of the match operations, but the files quickly grow to many gigabytes of data, so they were nearly impossible to diff (OOM errors).

One nice thing the dumps do provide are full dumps of the regexp engine's state. They look like this:

(I suggest cloning the repo so you can vimdiff the log files, old vs new)

https://gist.github.com/guns/6137147

Awesome!

As you can see, the regexp engine does what's on the tin, without Frak-like optimizations. In the best case (the beastly Unicode block pattern), Frak reduces the size of the state table by ~35%; in the case of java, the reduction is ~11%.

Since the default NFA engine logging produces ALOT of data, I patched vim to get a high-level count of the number of match operations:

https://raw.github.com/guns/vim-clojure-static/development/clj/vim/custom-nfa-log.patch

(I also have a branch containing the changes here:)

https://github.com/guns/vim/tree/custom-nfa-log

If you apply the patch and compile with -DDEBUG, you can use the following function to dump/compare the regexp logs for different syntax files:

https://github.com/guns/vim-clojure-static/blob/development/clj/src/vim_clojure_static/test.clj#L82
https://github.com/guns/vim-clojure-static/blob/development/clj/src/vim_clojure_static/test.clj#L119

Running this on the old syntax file vs new, there is a corresponding drop in the number of operations from 10-30% (don't quote me on this! consider the error bars to be 卤20%). What's even more impressive is the reduction in memory usage: the allocated memory sometimes drops over 50%!

High memory usage is low on my list of complaints about Vim, but given that vim is found on routers and Raspberry Pis, this is actually significant.

So in the end, despite my pessimism, the proof is in the pudding: Frak is a win.

BTW, while we're shaving this mighty yak, I noticed that the pattern A[abc] appears to be more efficient than the pattern A%(a|b|c). If you're interested, it would be nice to determine if it is actually faster, and if so, to use that style for category classes since patterns like \p{Lu} and \p{Cc} are more common than the longer properties.

Oh, and I added some tests for covering some ground we missed earlier:

82276e2

These now fail after merging your changes; would you mind adding support for these cases?

All the new changes are in the development branch.

Thanks!

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Aug 4, 2013

Wow! This is truly fascinating. I would have imagined some overall benefits but this quite impressive, especially with regard to the memory savings. I'm happy.

I had always assumed that regexp engines essentially do what Frak does...

frak was more or less an experiment to see if this was the case and if it was possible to produce better patterns (provided you know your inputs). Considering what I've read and my understanding of backtracking, I had a hunch there was only so much optimization an engine could do without sacrificing performance. The fact patterns will only be as good as your ability to write them doesn't completely surprise me.

As always, you have gone above and beyond to research the problem at hand and I think you understand a bit more about the underlying mechanics than I do. You're always teaching me new things. I'd like to know: how do you get the state table information in Java? This would be very useful to me because I've only used benchmarking to get a feel for whether or not the patterns are more performant.

Speaking of performance I think you're right about [abc] being superior to (a|b|c). I ran some benchmarks and found this to be the case (although the performance boost is small). This actually came as a surprise because my understanding was that [abc] gets expanded to (a|b|c) under the hood.

At any rate I think I can make this happen in frak. I was planning include more branch comparison and optimization (think suffixes) in the next minor releases anyway. I just need a little time to make sure I get it right because it's a bit tricky.

These now fail after merging your changes; would you mind adding support for these cases?

Of course. 馃槃

@guns

This comment has been minimized.

Copy link
Owner

guns commented Aug 5, 2013

I'd like to know: how do you get the state table information in Java? This would be very useful to me because I've only used benchmarking to get a feel for whether or not the patterns are more performant.

With Vim 7.4, it just so happens that Bram added a ton of logging to the new regexp engine, so all we had to do was turn it on. We got lucky :)

Taking a look at java.util.regex.Pattern, there isn't a completely obvious way to get the same information, but there is a single private debugging method: printObjectTree.

In order to play around with this, I added a copy of Pattern.java to the project, prepending it to the classpath:

https://github.com/guns/vim-clojure-static/commit/53a7e427c998c7f74114c5a578b6d63f1e8d9fe9.patch

Then I hacked up printObjectTree() to print more prettily, recursively expand branches, and dump TreeInfo data (which caculates min and max path lengths, as well as whether the state machine is deterministic or not):

02bc9f2

Now the debug-Pattern branch is ready to give us more insight!

$ lein trampoline repl :headless 2>tmp/pattern.log &
$ less +F tmp/pattern.log

Now that we're tailing the debugging output from our custom Pattern class, we can compile two different versions of a pattern and compare:

(Pattern/compile "C[cfnos]")
[TreeInfo] min: 2, max: 2, maxValid: true, deterministic: true
BEGIN: Dumping object tree: C[cfnos]
java.util.regex.Pattern$Start@73c83d69
java.util.regex.Pattern$Single@5f37f3e1
java.util.regex.Pattern$BitClass@526c699d
java.util.regex.Pattern$LastNode@70147758
Accept Node
(Pattern/compile "C(?:c|f|n|o|s)")
[TreeInfo] min: 2, max: 2, maxValid: true, deterministic: false
BEGIN: Dumping object tree: C(?:c|f|n|o|s)
java.util.regex.Pattern$Start@4958774c
java.util.regex.Pattern$Single@4f004432
java.util.regex.Pattern$GroupHead@1a7d5723
java.util.regex.Pattern$Branch@1ae3c86b
  Branch 0
    java.util.regex.Pattern$Single@607af697
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 1
    java.util.regex.Pattern$Single@14c02506
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 2
    java.util.regex.Pattern$Single@52beb78e
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 3
    java.util.regex.Pattern$Single@6704f612
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
  Branch 4
    java.util.regex.Pattern$Single@76b74c94
    java.util.regex.Pattern$BranchConn@28d364fd
    java.util.regex.Pattern$GroupTail@4e8b32fb
    Tail next is java.util.regex.Pattern$LastNode@70147758
Accept Node

Now, we can't make a direct judgment about the efficiency of each pattern from the length of the log output because I added special handling for expanding Branch nodes, but not for expanding CharProperty nodes, etc. (Branch was the easiest one to expand)

However, if we look closely, you can see that that both [c] and (?:c) create a pattern node of type Single. Furthermore, the interior of [cfnos] is represented by a BitClass instance, which, like it sounds, is backed by a bitfield! It appears that the Pattern class (in openjdk anyway) aggressively optimizes simple cases like [cfnos], which is awesome.

On the other hand, the (?:c|f|n|o|s) group is at least optimized with each branch being handled by a Single node, as opposed to the more generic Slice type. This is nice, but the other pattern is more heavily optimized and lacks branches, so will likely execute faster.

In Vim 7.4, [abc] groups are not backed by bitfields or anything awesome like that, but I did notice that the match iterations were done in a tighter loop than with %(a|b|c), so I imagined that the first version would be ever so slightly faster.

At any rate I think I can make this happen in frak. I was planning include more branch comparison and optimization (think suffixes) in the next minor releases anyway. I just need a little time to make sure I get it right because it's a bit tricky.

That sounds really great! I wonder if we're at too high of a level to affect CPU cache misses / branch mispredictions.

guns added some commits Aug 11, 2013

Merge branch 'master' into development
* master:
  Refactor \Q鈥E region
  Add tests for new regexp char escapes
  Remove unused clojureRegexpEscapes
  Escape more special characters in regexps
@guns

This comment has been minimized.

Copy link
Owner

guns commented Aug 11, 2013

Hey @noprompt,

Would you like me to merge your changes into master, or wait for more patches?

Now that the bug that cemerick found in #30 is fixed, I'd love to ship a new release to vim_dev in the next week.

If you're close to adding some final patches, I would like to wait for them, but if you're a little busy ATM, I think it would be just fine to ship with the original frak 0.1.1 changes (which contain the majority of the win).

No pressure! Everything is great.

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Aug 11, 2013

No pressure at all!

I will admit I have been unusually busy the past few weeks and it's been affecting my turn around just a bit. However, aside from the extra tests you wanted me to add, I think it would be fine to merge the initial work on this patch. Do you have a day in mind this week you're planning to release?

@guns

This comment has been minimized.

Copy link
Owner

guns commented Aug 11, 2013

Do you have a day in mind this week you're planning to release?

No, I think any time during the next week would be nice. I pushed some changes to development, so please rebase off the new HEAD when you are ready to hack!

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Aug 15, 2013

Just a heads up, I'll be working on some improvements to frak today and tomorrow. I should have a new version tested and ready to go by Saturday A.M. (PST). Once that's done, I'll pull in the new code and update those test cases.

@guns

This comment has been minimized.

Copy link
Owner

guns commented Aug 15, 2013

When I saw frak on the Github trending list and on HN, I was pretty excited: all those new eyeballs were definitely going to result in a finished patch.

:) congrats!

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Aug 16, 2013

Thanks! Honestly, I was a little surprised by the amount of attention it got. I didn't even have an HN account before this week.

If it inspires people to learn Clojure (or add badass regular expression highlighting to their vim runtime file project), I'll be really happy.

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Aug 17, 2013

Just a bit behind schedule. Trying to kill two birds with one stone.

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Aug 25, 2013

I pushed v0.1.3 today. As soon as I have a chance I will apply the updates to this patch. Sorry for the delay! I would have been able to get this out way sooner but I've been without the internet most of the week working on my new home. Thanks again for the awesome CLI stuff. It's made frak a more interesting and fun project.

@noprompt

This comment has been minimized.

Copy link
Contributor Author

noprompt commented Sep 7, 2013

I guess it's like me to take a month to wrap up a PR? 馃槃 Seriously though, it's been a crazy month for me but glad to be sending you these patches. I keep looking at the comparisons between the old regexes and the new ones and it's kind gives me the chills (in a good way). The transformation is remarkable.

Could you double check that everything looks right? I rebased off the development branch but I'm a bit of a hack when it comes to git.

@guns guns merged commit dadefa7 into guns:master Sep 7, 2013

@guns

This comment has been minimized.

Copy link
Owner

guns commented Sep 7, 2013

This looks awesome! Thanks for tying up this loose end for us; the final result is really great. I'll cut a release later today and send it upstream.

I look forward to collaborating with you in the future, even if it won't be Vim related. :)

@noprompt noprompt deleted the noprompt:noprompt-regex-backtracking branch Sep 8, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment