-
-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect collision in regular expressions when creating a lexer #76
Comments
Happily, there is already a library for this, https://github.com/qntm/greenery, although:
|
I found greenery about a year ago, and was excited to use it for Lark. But it had two major problems:
Maybe I misevaluated it, either from a technical or conceptual standpoint, so feel free to correct me. But I'm not interested in adding a dependency unless it can handle, say, over 95% of my users' grammars. Btw, my interest in this feature isn't just for better error handling. Having this feature in place paves the road to a stable Earley+LALR integration, which, similarly to GLR+LALR integration, can create a huge performance boost for many Earley grammars. |
I think you're entirely right, which means this might take a new library to solve 😕 |
After looking into greenery a bit, I figured out that the performance is acceptable, if used correctly. The main drawback in performance is not actually the calculation of the intersection, but the reconstruction of this intersection to a regex. If you skip this step (which is relatively easy) the performance is acceptable. A small example here: from lark import Lark
from greenery.lego import pattern
grammar = Lark(open("grammar.lark"))
regexps = [(term, pattern.parse(term.pattern.to_regexp())) for term in grammar._build_lexer().terminals]
fsms = [(term, regex.to_fsm()) for term, regex in regexps]
for i, (term_a, fsm_a) in enumerate(fsms):
for term_b, fsm_b in fsms[i+1:]:
if not fsm_a.isdisjoint(fsm_b):
print(f"Collision in Terminal {term_a} and {term_b}") This prints all collision in Terminals in a grammar. The problem is that the regexes (and sometimes the string literals) have to be modified to be acceptable. This is the problem that greenery does not support the same syntax as normal python. There are many difference one may have to change, but the biggest problem is that lookaheads and lookbacks are missing. If the regexes are compatible, this script works great. (For the smaller things I am already looking for solutions) |
@MegaIng Awesome, that is very encouraging! If the slowdown that's introduced isn't too noticeable, then perhaps, it's still worth doing as an optional feature. Even if it doesn't catch everything (i.e. complex regexps), it might still be useful. I have no idea how hard it would be to add look(aheads|backs) to greenery. It seems like the sort of thing regexp engines would implement as a separate operation, rather than transforming the DFA, and if that's the case then their success with the first is no indication of the difficulty of the latter. Speaking ignorantly, I'm guessing it's pretty difficult to solve. It does seem that intersection is involved there somehow too, except that the regexp has to continue where the intersection had stopped. |
I thing I manged to add support for lookaheads (not lookbacks). You can see an example on my fork of greenery: https://github.com/MegaIng/greenery. I thing lookbacks are possible, but way harder. Btw, most of the other small problems in syntax I also fixed with a new parser for the regexes. |
Welp, I forgot the edge case if the lookahead is at the end of the string. This will take a moment to implemented (but will make a huge step in the direction of lookbacks). |
@MegaIng That sounds fantastic! I can't wait to give it a try (but I'll have to wait, life etc.) Keep me posted on your progress, please! |
Ok, I added support for lookaheads at the end of regexps. (And prepared a bit for lookbacks). Now I am pretty sure that lookaheads work (except lookaheads of infinite length. I'm not sure how to handle them. The current implementation handles them, but I am not sure if that is correct.) Btw, do you have a big grammar which I could use for testing? |
Nice, I'll look into it pretty soon! I think the biggest real grammar I have is But it's possible to create arbitrarily big grammars if necessary |
I am now using that grammar and it is showing me a few missing features (e.g flags). I would really like to have a grammar that tests all possible edge case (weird regex constructs, weird overlaps between regexes, etc.). I am not sure what I really need to implement. I am currently implementing most of the normal python re, if possible. In that sense, the original author of greenery has listed few things that will be impossible to implement with his backend:
|
Ok, I added basic flag support ( I now realized that the current model on how I am checking the regexes is support inefficient in big grammars since I include all strings. I am not sure what the right behavior is. |
@MegaIng Very cool! It definitely sounds like a good starting point. Well, you don't have to check all terminals, only the regexps. Constant strings are easy. Most grammars I've seen don't realistically include more than 20 regexps, which would cap at half a second by your measurements. But that's still much slower than I hoped. I assume a large portion of the run-time comes from parsing the regexps into some internal representation? And I assume right now that is done for each possible pair? We could probably cut a lot of run-time by parsing it once and storing the internal representation. As for a smarter way, that doesn't sound likely, but I'm open to ideas. |
You can see my current progress at https://github.com/MegaIng/greenery (entry point is grammar_checker.py). I already did many of the obvious optimizations (including pre-computing all regexes). For 13 regexes (The number of 'real regexes' ones inside of python3) this system leads to the following timing percentages:
|
It sounds like optimizing |
I managed to save around 1 sec by changing a bit of behavior in the FSM inner workings. This is now down to around 0.7-0.8 seconds which feels way more acceptable. |
Yes, definitely brings it to the level of usability! It's still significant enough that it might be best to keep it optional, but at the 20% slowdown range, it's definitely worth using, especially for debugging. |
OK, I got it down to 0.6 secs, which is not that much above your goal. There might be many small improvements that can be done, bringing it around 0.5 secs or lower. |
Nice. If you can make it even faster, please do. Btw, what do you think are the odds that the original maintainer will merge your changes? |
Some are likely, some are not likely. We could just mention @qntm and see if he has anything to say in regards to this. Some of the files I currently have on my repo are just test files without actual use in greenery. I am also thinking about a complete refactoring of the fsm file to conform to PEP-8, which probably is not in his interest. |
Hallo! I'm thrilled that you're getting some genuine use out of greenery. I haven't looked at @MegaIng's code yet so I'm just reading these comments for now. Yes, I'm afraid performance was not a major goal for greenery, I just wanted to see if I could do it. I see you have already found some ways to improve performance. One particular performance-related issue you may find is that if greenery sees e.g. I also think there's a serious possibility that there are other, more performant and well-maintained DFA libraries for Python, which you could plumb in instead of my Python's regular expression syntax is pretty huge, so you can imagine why I didn't add support for all (or even most) of it. I see you've added some of the more obvious omissions back. There are also some omissions in the area of subtle regex parsing quandaries, e.g. is I believe the word boundary metacharacter Line ending characters On the topic of anchoring (using (I believe it's the On the topic of accepting changes to greenery: the real question is whether a delay on my part in accepting changes would block/slow development for you. I consider greenery to be a completed project aside from bug fixes, and have done for some years. I also don't know a whole lot about the Python ecosystem. In other words I can't guarantee to review, merge and publish major changes with a quick turnaround time, and I don't want to be a bottleneck. So, it may be that your simplest course of action is to fork my code, call your new project "greenier" or something, and handle it yourselves from there. If not, then I would at least want to bump the major version number. |
Hi @qntm, thanks for the fast response. First, I did not find a DFA/FSM library that has the required tools to make the creation of them from regex as easy as it is with your system. Also, the performance is acceptable (I got it down to 0.3-0.4 seconds, with still possibilities to improve) The ideas you listed are interesting and I will certainty look into them. But they already made me realize one central problem the currently existed: (Ok, two: lookbacks are still not implemented) nested lookaheads that go beyond to bounds of the group are unsupported and will just make the regex unusable. To the point of merging changes: I thing it will be easiest if I basically completely untie my code from the old usage of greenery and create a new library, maybe even just as a part or lark. What do you think @erezsh? This will make it easy to implement fixes/changes/additions. |
Hi @qntm, I also thank you for the quick reply, and for creating your library, which is proving very useful to us. There may be many other libraries, but most can't intersect regexps, and yours is written much better (or so was my impression a while ago, when I looked around) It sounds to me that both you and @MegaIng agree against merging back his work. It will slow us down, and provide you with work that you don't quite consider necessary. Generally, my inclination is to integrate rather than fragment, which is why I suggested it, but I think that in this case a fork makes sense. @MegaIng Regarding a separate library or as part of Lark: That really depends on how big the module would be, if stripped of the parts unnecessary for Lark. And also on your wishes regarding it. Regarding regexps that we can't reason about: As long as we can throw an exception in this case, that's fine. We don't have to support every edge-case for this to be useful. But to report "no intersection" when there actually is one, would in many ways defeat the purpose of this endeavor. |
I found a library, pyfsa/FSA (I'm not sure which is the correct name), which is unmaintained since 2004. It also had 'Non-goal: efficiency in the documentation. If I would want to use it, I would have to translate it from python2 to python3 and probably to a lot of changes to speed it up. I found a library In any case, there seems to be no existing package that can do exactly what we need, so we need to develop a new one. I assume that this will result in 3 files, with a total of around (in the end) 1500 LOC. One of them is a util, that can easily be removed or integrated into another file or package. |
If it's 1500 LOC, a separate package is probably better. Also, that leaves room for expanding it in the future, perhaps returning some of the functions we removed from greenery (or adding them, if you plan to build on a different library) |
@erezsh Should we create the library inside the lark-parser namespace or should I do that on my personal profile? (inside the lark-parser namespace it has a higher chance of being maintain forever) |
Also, does anyone have nice ideas for names? Since I am not that creative, I would currently name it |
@MegaIng I think it's best if you start in your own user. When it's mature for use (even alpha stage), we can move it to lark-parser. Name suggestions:
I'll think some more and see if I can come up with something even more ridiculous :) |
Ok, I think the time-overhead is now acceptable. The 13 regexes from the python3-grammar are now checked in 0.1 to 0.15 seconds. Regarding the name: I currently like 'interegular' the best and I think I will upload the package tomorrow. |
I finally came around to publishing it. (With name interegular) While it shouldn't be used in production right now, you can certainty start integrating it, @erezsh. While I still have a few things to do, I think I am mostly finished with the interface. (Unless you have wishes on how to change it.)
|
Very cool! What do you mean by integrating? Q: How do I check if regexes simply collide or not, most efficiently? (i.e. a boolean result) Just use Another question: Would it be possible to test if a regex is a full subset of another one? (i.e. everything the first can match, the second can too?)
Very important |
With 'integrating', I mean implementing/using it inside of lark. |
Well, I thought it can be used for optimization. If regex A is contained inside regex B, it means I only have test A on the matches of regex B. But on second thought, that's probably not very useful, and might even hurt performance in some unusual situations. |
When using a standard or contextual lexer, colliding regular expressions may produce an incorrect parser, that only produces an error when encountering the obscured terminal in the input text.
For example, the following definitions:
Will cause the standard lexer to always choose one over the other, but will only produce an error when it makes the wrong choice in run-time (and in some cases, will even produce an unintended parse without producing an error).
Ideally, this would be the correct behavior: If regexp1 contains regexp2 completely, use secondary lexing (i.e. the "unless" mechanism that already exists for strings), else throw an error.
Presumably, the solution lies in running an intersection on the NFAs.
The text was updated successfully, but these errors were encountered: