-
Notifications
You must be signed in to change notification settings - Fork 50
Use frak to reduce backtracking #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I had some new tests + bugfixes some months back, but then Bram introduced the new NFA regexp engine late into 7.9** and everything was broken for a while, so I decided to get out of Dodge until the dust settled. Now that things are settling down, I've pulled these changes from my git stash. None of them are complete, and require some thought.
This is an old failure; I believe more special characters need to be checked for preceding backslashes.
This became apparent after opening the new regexp optimization test buffer.
* noprompt-regex-backtracking: Small correction to clojureRegexpUnicodeCharClass Generate *CharClass patterns with frak Add frak Conflicts: clj/src/vim_clojure_static/generate.clj syntax/clojure.vim
There are some regressions due to the newly added support for abbrev. character classes
|
Hey, hey, hey. Back from the dark side. Well, to start off: great work! I can tell you had a bit of fun writing Frak (great name). You have clearly fell into the rabbit hole of regular expressions optimization; I think you will soon find there's only one thing left to do: write your own regexp engine. To be honest, I didn't like this patch at first; I guess I thought losing the greppability of things like "UnifiedCanadianAboriginalSyllabics" was a big loss, so I decided to look into the actual performance impact of Frak. I had always assumed that regexp engines essentially do what Frak does, so when in doubt, write the plainest pattern you can. I've never looked into this, and I was apparently intensely curious. Vim 7.4 has a new NFA style regexp engine (as you probably know), and it also verbosely logs every single step of its operation when in debug mode. I tried using these log files to get a sense of the opcount of the match operations, but the files quickly grow to many gigabytes of data, so they were nearly impossible to diff (OOM errors). One nice thing the dumps do provide are full dumps of the regexp engine's state. They look like this: (I suggest cloning the repo so you can vimdiff the log files, old vs new) https://gist.github.com/guns/6137147 Awesome! As you can see, the regexp engine does what's on the tin, without Frak-like optimizations. In the best case (the beastly Unicode block pattern), Frak reduces the size of the state table by ~35%; in the case of java, the reduction is ~11%. Since the default NFA engine logging produces ALOT of data, I patched vim to get a high-level count of the number of match operations: https://raw.github.com/guns/vim-clojure-static/development/clj/vim/custom-nfa-log.patch (I also have a branch containing the changes here:) https://github.com/guns/vim/tree/custom-nfa-log If you apply the patch and compile with -DDEBUG, you can use the following function to dump/compare the regexp logs for different syntax files: https://github.com/guns/vim-clojure-static/blob/development/clj/src/vim_clojure_static/test.clj#L82 Running this on the old syntax file vs new, there is a corresponding drop in the number of operations from 10-30% (don't quote me on this! consider the error bars to be ±20%). What's even more impressive is the reduction in memory usage: the allocated memory sometimes drops over 50%! High memory usage is low on my list of complaints about Vim, but given that vim is found on routers and Raspberry Pis, this is actually significant. So in the end, despite my pessimism, the proof is in the pudding: Frak is a win. BTW, while we're shaving this mighty yak, I noticed that the pattern Oh, and I added some tests for covering some ground we missed earlier: These now fail after merging your changes; would you mind adding support for these cases? All the new changes are in the Thanks! |
|
Wow! This is truly fascinating. I would have imagined some overall benefits but this quite impressive, especially with regard to the memory savings. I'm happy.
frak was more or less an experiment to see if this was the case and if it was possible to produce better patterns (provided you know your inputs). Considering what I've read and my understanding of backtracking, I had a hunch there was only so much optimization an engine could do without sacrificing performance. The fact patterns will only be as good as your ability to write them doesn't completely surprise me. As always, you have gone above and beyond to research the problem at hand and I think you understand a bit more about the underlying mechanics than I do. You're always teaching me new things. I'd like to know: how do you get the state table information in Java? This would be very useful to me because I've only used benchmarking to get a feel for whether or not the patterns are more performant. Speaking of performance I think you're right about At any rate I think I can make this happen in frak. I was planning include more branch comparison and optimization (think suffixes) in the next minor releases anyway. I just need a little time to make sure I get it right because it's a bit tricky.
Of course. 😄 |
With Vim 7.4, it just so happens that Bram added a ton of logging to the new regexp engine, so all we had to do was turn it on. We got lucky :) Taking a look at java.util.regex.Pattern, there isn't a completely obvious way to get the same information, but there is a single private debugging method: In order to play around with this, I added a copy of Pattern.java to the project, prepending it to the classpath: https://github.com/guns/vim-clojure-static/commit/53a7e427c998c7f74114c5a578b6d63f1e8d9fe9.patch Then I hacked up printObjectTree() to print more prettily, recursively expand branches, and dump TreeInfo data (which caculates min and max path lengths, as well as whether the state machine is deterministic or not): Now the Now that we're tailing the debugging output from our custom Pattern class, we can compile two different versions of a pattern and compare: (Pattern/compile "C[cfnos]")(Pattern/compile "C(?:c|f|n|o|s)")Now, we can't make a direct judgment about the efficiency of each pattern from the length of the log output because I added special handling for expanding Branch nodes, but not for expanding CharProperty nodes, etc. (Branch was the easiest one to expand) However, if we look closely, you can see that that both On the other hand, the In Vim 7.4,
That sounds really great! I wonder if we're at too high of a level to affect CPU cache misses / branch mispredictions. |
* master: Refactor \Q…\E region Add tests for new regexp char escapes Remove unused clojureRegexpEscapes Escape more special characters in regexps
Unnecessary due to changes in master
|
Hey @noprompt, Would you like me to merge your changes into master, or wait for more patches? Now that the bug that cemerick found in #30 is fixed, I'd love to ship a new release to vim_dev in the next week. If you're close to adding some final patches, I would like to wait for them, but if you're a little busy ATM, I think it would be just fine to ship with the original frak 0.1.1 changes (which contain the majority of the win). No pressure! Everything is great. |
|
No pressure at all! I will admit I have been unusually busy the past few weeks and it's been affecting my turn around just a bit. However, aside from the extra tests you wanted me to add, I think it would be fine to merge the initial work on this patch. Do you have a day in mind this week you're planning to release? |
No, I think any time during the next week would be nice. I pushed some changes to |
|
Just a heads up, I'll be working on some improvements to frak today and tomorrow. I should have a new version tested and ready to go by Saturday A.M. (PST). Once that's done, I'll pull in the new code and update those test cases. |
|
When I saw frak on the Github trending list and on HN, I was pretty excited: all those new eyeballs were definitely going to result in a finished patch. :) congrats! |
|
Thanks! Honestly, I was a little surprised by the amount of attention it got. I didn't even have an HN account before this week. If it inspires people to learn Clojure (or add badass regular expression highlighting to their vim runtime file project), I'll be really happy. |
|
Just a bit behind schedule. Trying to kill two birds with one stone. |
|
I pushed |
|
I guess it's like me to take a month to wrap up a PR? 😄 Seriously though, it's been a crazy month for me but glad to be sending you these patches. I keep looking at the comparisons between the old regexes and the new ones and it's kind gives me the chills (in a good way). The transformation is remarkable. Could you double check that everything looks right? I rebased off the |
|
This looks awesome! Thanks for tying up this loose end for us; the final result is really great. I'll cut a release later today and send it upstream. I look forward to collaborating with you in the future, even if it won't be Vim related. :) |
Hey buddy. Remember when I said
in #19?
Well guess what? 😄
Those crazy expressions haunted me up until last weekend when I finally decided to come up with a solution (frak). If you look at the diff you'll notice those gigantic expressions are significantly smaller. Two are a tad larger but that's because they're fully expanded (ie.
[abc] -> %(a|b|c)).I know this sounds (and looks) crazy, but these expressions should definitely be more performant. All of the tests pass and, from what I can tell, everything seems to highlight fine.
See, I still care about Vim (I'm not completely evil). 😉