Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Use frak to reduce backtracking #28
Hey buddy. Remember when I said
Well guess what?
Those crazy expressions haunted me up until last weekend when I finally decided to come up with a solution (frak). If you look at the diff you'll notice those gigantic expressions are significantly smaller. Two are a tad larger but that's because they're fully expanded (ie.
I know this sounds (and looks) crazy, but these expressions should definitely be more performant. All of the tests pass and, from what I can tell, everything seems to highlight fine.
See, I still care about Vim (I'm not completely evil).
Hey, hey, hey. Back from the dark side.
Well, to start off: great work! I can tell you had a bit of fun writing Frak (great name). You have clearly fell into the rabbit hole of regular expressions optimization; I think you will soon find there's only one thing left to do: write your own regexp engine.
To be honest, I didn't like this patch at first; I guess I thought losing the greppability of things like "UnifiedCanadianAboriginalSyllabics" was a big loss, so I decided to look into the actual performance impact of Frak.
I had always assumed that regexp engines essentially do what Frak does, so when in doubt, write the plainest pattern you can. I've never looked into this, and I was apparently intensely curious.
Vim 7.4 has a new NFA style regexp engine (as you probably know), and it also verbosely logs every single step of its operation when in debug mode.
I tried using these log files to get a sense of the opcount of the match operations, but the files quickly grow to many gigabytes of data, so they were nearly impossible to diff (OOM errors).
One nice thing the dumps do provide are full dumps of the regexp engine's state. They look like this:
(I suggest cloning the repo so you can vimdiff the log files, old vs new)
As you can see, the regexp engine does what's on the tin, without Frak-like optimizations. In the best case (the beastly Unicode block pattern), Frak reduces the size of the state table by ~35%; in the case of java, the reduction is ~11%.
Since the default NFA engine logging produces ALOT of data, I patched vim to get a high-level count of the number of match operations:
(I also have a branch containing the changes here:)
If you apply the patch and compile with -DDEBUG, you can use the following function to dump/compare the regexp logs for different syntax files:
Running this on the old syntax file vs new, there is a corresponding drop in the number of operations from 10-30% (don't quote me on this! consider the error bars to be ±20%). What's even more impressive is the reduction in memory usage: the allocated memory sometimes drops over 50%!
High memory usage is low on my list of complaints about Vim, but given that vim is found on routers and Raspberry Pis, this is actually significant.
So in the end, despite my pessimism, the proof is in the pudding: Frak is a win.
BTW, while we're shaving this mighty yak, I noticed that the pattern
Oh, and I added some tests for covering some ground we missed earlier:
These now fail after merging your changes; would you mind adding support for these cases?
All the new changes are in the
Wow! This is truly fascinating. I would have imagined some overall benefits but this quite impressive, especially with regard to the memory savings. I'm happy.
frak was more or less an experiment to see if this was the case and if it was possible to produce better patterns (provided you know your inputs). Considering what I've read and my understanding of backtracking, I had a hunch there was only so much optimization an engine could do without sacrificing performance. The fact patterns will only be as good as your ability to write them doesn't completely surprise me.
As always, you have gone above and beyond to research the problem at hand and I think you understand a bit more about the underlying mechanics than I do. You're always teaching me new things. I'd like to know: how do you get the state table information in Java? This would be very useful to me because I've only used benchmarking to get a feel for whether or not the patterns are more performant.
Speaking of performance I think you're right about
At any rate I think I can make this happen in frak. I was planning include more branch comparison and optimization (think suffixes) in the next minor releases anyway. I just need a little time to make sure I get it right because it's a bit tricky.
With Vim 7.4, it just so happens that Bram added a ton of logging to the new regexp engine, so all we had to do was turn it on. We got lucky :)
Taking a look at java.util.regex.Pattern, there isn't a completely obvious way to get the same information, but there is a single private debugging method:
In order to play around with this, I added a copy of Pattern.java to the project, prepending it to the classpath:
Then I hacked up printObjectTree() to print more prettily, recursively expand branches, and dump TreeInfo data (which caculates min and max path lengths, as well as whether the state machine is deterministic or not):
Now that we're tailing the debugging output from our custom Pattern class, we can compile two different versions of a pattern and compare:
Now, we can't make a direct judgment about the efficiency of each pattern from the length of the log output because I added special handling for expanding Branch nodes, but not for expanding CharProperty nodes, etc. (Branch was the easiest one to expand)
However, if we look closely, you can see that that both
On the other hand, the
In Vim 7.4,
That sounds really great! I wonder if we're at too high of a level to affect CPU cache misses / branch mispredictions.
Would you like me to merge your changes into master, or wait for more patches?
Now that the bug that cemerick found in #30 is fixed, I'd love to ship a new release to vim_dev in the next week.
If you're close to adding some final patches, I would like to wait for them, but if you're a little busy ATM, I think it would be just fine to ship with the original frak 0.1.1 changes (which contain the majority of the win).
No pressure! Everything is great.
No pressure at all!
I will admit I have been unusually busy the past few weeks and it's been affecting my turn around just a bit. However, aside from the extra tests you wanted me to add, I think it would be fine to merge the initial work on this patch. Do you have a day in mind this week you're planning to release?
Thanks! Honestly, I was a little surprised by the amount of attention it got. I didn't even have an HN account before this week.
If it inspires people to learn Clojure (or add badass regular expression highlighting to their vim runtime file project), I'll be really happy.
I guess it's like me to take a month to wrap up a PR?
Could you double check that everything looks right? I rebased off the