-
Couldn't load subscription status.
- Fork 50
Comprehensive regular expression support #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Allow for highlighting of escape characters contained within clojureString. Also, update clojureCharacter to support all unicode values.
Conflicts: syntax/clojure.vim
|
Just a heads up, somehow my fork got a messed up and I'm not very good a fixing that sort of thing. How ever you decide to apply this patch is up to you. I'll have to fix my fork later. |
|
This is really awesome. No kidding about the masochism; this was pretty low on my TODO list for that reason :). I merged your changes into a branch called noprompt-regexp: https://github.com/guns/vim-clojure-static/tree/noprompt-regexp I am currently working on a test file for the syntax now that it's grown reasonably complex. I also see at least one amendment (multiple adjacent flags are not matched: But anyway, this is stellar! Let's collaborate on the noprompt-regexp branch before we merge into master. |
|
All of this sounds great. Nice catch about the multiple flags. It shouldn't be too difficult to rework the expression to handle them. Also linking to Thanks for creating the new branch and I appreciate your acceptance of these contributions. I've been coding a lot today, so I'll probably resume work on this tomorrow. |
|
Sorry for the slight lag on this. I've pushed the changes to the new branch. In addition to permitting multiple flags, I've added the optional #"(?mix:X)"
#"(?-mix:X)"
#"(?mix-mix:X)"As you've suggested I've switched most everything over to link to Also, the |
|
Outstanding. I think having a distinction between the character escapes and regexp specials is nice. Please allow me to push a test suite (using clojure.test, shelling out to vim) before we merge to master; I think it will take me a few hours tonight. Then we can identify any other corner cases, and document which regexp specials are deliberately not matched. |
|
Looking forward to it! |
Needs more discussion at this point. cf. #19
|
Okay, I've pushed up the beginnings of the syntax tests. You can start an nrepl from the I added tests for numbers (which caught some new bugs!) and some initial tests for character escapes in regular expressions (BTW I renamed clojureRegexpSpecialChar to clojureRegexpEscape to mirror clojureStringEscape). Some of the tests are currently failing, but I would like your feedback on which of these cases you are ignoring, which are difficult to get right, and which are actually errors. It would be great if you could start filling out the rest of the tests for Thanks! |
|
This is pretty sweet. Creating this test suite is nothing short of brilliant. Right off the bat I was able to fix two problems (just pushed the fixes). Nice work!
IgnoringI'm ignoring Also, I've ignored comments which are permitted inside DifficultPreviously I was ignoring syntax region clojureRegexpQuote start="\v\<@!\\Q" end="\\E"It's misleading because, and I could be wrong, but it appear's if you drop the user=> (re-find #"\Qabc\E" "abc")
"abc"
user=> (re-find #"\Qabc" "abc")
"abc"The documentation doesn't mention this behavior but I'm assuming from the results above my assumption is correct. Since I couldn't figure out a way to handle the case when the ErrorsWell a few of these are fixed already and it's just a matter of writing the remaining tests to check everything else. The convention in the test suite is clear, so I'm rolling with it. If I need to make any changes, I'll be sure to loop you in on them and get your thoughts. Again, I can't tell you how impressed I am by the test suite. I've looked at several projects working on syntax files and have maybe seen a test suite once. Anyway, I'll be pushing more code tonight. Edit: Update ignore section |
|
So I've gone ahead and written tests for all the current A few quick notes/questions:
That's it for now. I've updated the branch and breaking on this for today! :) |
In order to match only the delimiters a subregion is matched within \Q…\E offset by the length of the delimiters. This subregion is linked to the normal regexp group to signal that interpretation of the tokens within the region is supressed. cf. #19
|
I couldn't post my comment through this text field for some reason, so |
|
Thanks for solving the "\Q..\E" problem and explaining the solution. It helped a lot.
Completion is always a good thing IMHO and parsing the spec sounds fun. Is it overboard? Well, as long as it's not going to impact performance, and I doubt it will, it's probably fine. Just for kicks, I was able to use a little JavaScript from the browser console and lift the unicode categories from the Android docs with almost no effort. $($('table')[1]).
find('td:first-child').
text().
split(' ').
filter(function(x) {
return x.length > 1
}).join("|")
// => "Cn|Cc|Cf|Co|Cs|Lu|Ll|Lt|Lm|Lo|Mn|Me|Mc|Nd|Nl|No|Pd|Ps|Pe|Pc|Pi|Pf|Po|Sm|Sc|Sk|So|Zs|Zl|Zp"
I just took |
|
Going back to the spec parser. We'll probably want to extract the Regarding the I don't feel pressured at all to work on these problems at all. Regular expressions, parsing, and text manipulation problems are my favorite. Without a doubt, once this is complete, this syntax file will have arguably some of the most robust and correct regular expression pattern highlighting out there. It's very encouraging. |
|
By the way I think the problem you were having posting that last comment had to do with that special A looking character. You'll notice it's missing in my previous comment. Might be worth mentioning to the Github staff. |
Great, I'm glad you're game.
Oh you pronounce that
Interesting. Thanks for implementing that. Could you please tweak Also, it looks like you used And could you please rebase your changes off the new merge I just pushed? We've gone slightly out of sync.
Yes, makes sense.
Maybe this is platform dependent? Here is my
Then that's the last time I'll mention it. :)
Yes, totally agree. This is great stuff, and will trump ruby's implementation, which is one of the most feature-rich syntaxes I've seen. Java, lacking regexp literals, has none to begin with.
The irony of talking about Unicode issues is that sometimes you cannot actually directly mention the issues at hand. I wonder if this is JavaScript's fault; I seem to remember some hubbub about nodejs and >BMP characters a few months back. Did you have any thoughts about this:
|
Haha! Yeah. To me, pronouncing it "char" as in charbroiled seems a little strange. I always thought it made more sense to pronounce it "care" since it's short for character. That's how it works with The problem with Looks like I'm upgrading. I've fixed the issue with Thoughts on |
Makes sense :) I guess I just enjoyed saying
I'm going to start calling that "eint" to get on people's nerves.
What about linking it to String? It seems better than character or regexpEscape because those deal with individual characters, while \Q…\E marks off a whole string of chars. |
|
The idea to use |
|
Okay, then that's settled. I agree with you now about naming the POSIX, Java, and Unicode Here's another bit about the (re-find #"\p{L}" "Matches any letter") ; "M"
(re-find #"\pL\pP" "Single letter classes can omit braces!") ; "s!"This is quite a learning experience. I'll refrain from pushing to |
|
Good deal. Breaking them up will also help a bit with code generation. It certainly has been a huge learning experience for me too! Thanks for doing all the research. 👍 |
|
Whoa. Talk about crazy. Just pushed up the some work on the By they way, I updated Java so it should be printing out only what's currently valid. My incremental version looks like it's less than yours hopefully that shouldn't mean anything significant. I'm taking a break and coming back to this later. I think all that's left here is the Posix expressions, which should be a push over. Hopefully tomorrow I can start picking up the other small issues and write tests for them. Maybe we can ship this by the end of the weekend. 👍 |
Unfortunately, both work on my machine. This may actually be undefined behaviour. I think we should only allow the Also curious is that the Unicode block classes can occur with underscores, or without (but not some), and that not all the blocks in the current spec are supported: http://www.unicode.org/Public/UNIDATA/Blocks.txt Unicode actually has a section on implementing regular expression engines, but the actual Java implementation will obviously take precedence for us. I plan on kicking back a few beers with the reference implementation of java.util.regex.Pattern and the relevant Unicode sections tonight to sort this all out. Then, if you can push the final POSIX changes, I can make some informed decisions about these Unicode and Java corner cases, and we can merge with master on Monday night. Sound like a plan? |
|
Sounds like a plan to me. My allergies were kicking my ass all weekend so I haven't done any work since Friday. The valley this time of year can be rough.
It figures. I'm wondering if it has more to do with the OS than Java though (I'm running OS X). Spinning up a new Linux VM and checking is possible, but since you're volunteering to handle the edge cases I'll hold off on that for now. I'm about to head out for dinner and, now that you mention it, a beer. When I get back I'll wrap up POSIX, the Cheers! 🍻 |
It turns out I actually didn't not have the latest JDK for OS X. My version now matches yours and, sure enough, this problem has gone away. I've decided to hold off on dropping POSIX and I couldn't resist parsing the spec and generating the expressions from them. The print out, as you can imagine, is pretty crazy. Part of me feels like going the extra mile and optimizing the generated expressions to limit the amount of backtracking - which is going to be through the roof for Other than that, I think what we've come up with is more than acceptable. We've covered 99% of what most programmers will likely ever use in a regular expression and the extra 1% is icing on the cake. I think the only thing else I could ask for is comment highlighting Anyway, I'm looking forward to hearing about whatever you dig up from reference implementation! 🍰 Edit: Note about comments. |
|
Just want to let you know that I am still working on the Unicode cases. Everything looks great and hopefully I'll be able to merge and push tonight. I'd like to post to the VimClojure group for testers, but the true extent of the regexp matching will be invisible without a colorscheme that takes advantage of the new groups. Do you have a colorscheme that I could promote to the list for people interested in these changes? edit: s/syntax file/colorscheme/ |
|
Awesome man! I'm so excited.
I never thought you'd ask! 😄 My personal theme lite-brite is always being updated. Now that we've settled on highlight names and I know this is going to be merged, I'll take a few minutes to add support for the new syntax groups. |
Actually, there will be a couple of very minor changes ( lite-brite looks great. I'll ping you after I push tonight. |
|
No worries. I can always readjust things once everything is finalized and merged. |
|
How's everything coming along? Anything else I can do? |
|
I'm sorry for the holdup; I've had some trouble finding time to hack this week. I finished the majority of the changes on Tuesday but I held off committing and pushing until I had a chance to review the fringes. Thanks for being patient! Here are a summary of the changes in the new push. I'd like your input before I merge to master. Character Property ClassesUnfortunately, the Unicode specs for character property names do not exactly match the reference implementation (as you know). So now all of the Character Property Classes are now derived from reflection, and the majority from private fields. I know this is not acceptable in normal programs, but the data was just too good to pass up, especially since the private data included aliases that were not programmatically generated from the official names. Property Prefixes
Code structure
Syntax file
(So actually, there were no visible syntax group changes) I think that covers it! I haven't added new tests yet because I wanted to push sooner than later. Sorry that your work on parsing the specs got replaced; I hope you'll agree that the wordlists procured from the JVM are different enough (neither subset nor superset) from the Unicode standards to merit just prying it out from Java's greedy private fingers. |
|
Everything looks awesome to me. I like the work you've done on refactoring/renaming and, honestly, I can't think of anything to pick a bone with. Nice job!
Actually, the changes you are made fantastic. It makes more sense to lift what we need from the JVM and get it right rather than parsing the spec and only getting so close. Java isn't a language I know and reading through the new code has taught me a few things. At any rate, I'm happy with all of this. If you're ready to merge, I have no objections. Great job! |
|
Alright, then this wraps this up! Thank you very much for stepping up to the plate and getting this done. Would've taken me months to tackle it, if at all. If lite-brite is ready, I'll post a message to the VimClojure group for testers and also start adding more tests. |
|
Okay, I posted to the list: https://groups.google.com/forum/?fromgroups=#!topic/vimclojure/8V8NFa1qFu0 |
|
Awesome! I'll be pushing some updates to the theme here in a moment to take advantage of the new groups. Sorry for the lag on that, I was in Yosemite all day! |
Whew! This was masochistic. 98% of the patterns should be covered here. (I skipped
\Q..\Epatterns because I wasn't sure the best way to handle them.) Overall, I think this is much more comprehensive than other language syntax files I've seen. It will be fun updating my color scheme to take advantage of them.Most of the new syntax groups are linked to
Specialbecause that seemed to make the most sense.Alright! I'm gonna take a little break from this now and check back later.