Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect collision in regular expressions when creating a lexer #76

Open
erezsh opened this issue Jan 29, 2018 · 33 comments
Open

Detect collision in regular expressions when creating a lexer #76

erezsh opened this issue Jan 29, 2018 · 33 comments

Comments

@erezsh
Copy link
Member

erezsh commented Jan 29, 2018

When using a standard or contextual lexer, colliding regular expressions may produce an incorrect parser, that only produces an error when encountering the obscured terminal in the input text.

For example, the following definitions:

VAR: /\w+/
FILENAME: /[\w.]+/

Will cause the standard lexer to always choose one over the other, but will only produce an error when it makes the wrong choice in run-time (and in some cases, will even produce an unintended parse without producing an error).

Ideally, this would be the correct behavior: If regexp1 contains regexp2 completely, use secondary lexing (i.e. the "unless" mechanism that already exists for strings), else throw an error.

Presumably, the solution lies in running an intersection on the NFAs.

@Zac-HD
Copy link
Contributor

Zac-HD commented Feb 4, 2019

Happily, there is already a library for this, https://github.com/qntm/greenery, although:

  • It doesn't exactly match Python's regex syntax. I once tried to support that in Are you interested in contributions? qntm/greenery#37, but moved on when it became clear that I'd spend more time on adding valiation logic than I wanted to for an evening fun project. Definitely tractable though.
  • Python's regex, and "PCRE" generally, don't always describe regular languages. Fortunately it's trivial to detect these features from the compiled regex's parse tree, but they still need to be handled by raising an error or bailing out of the attempt to detect collisions.

@erezsh
Copy link
Member Author

erezsh commented Feb 4, 2019

I found greenery about a year ago, and was excited to use it for Lark. But it had two major problems:

  1. As you said, it doesn't work with Python's regexp syntax. And moreso, it doesn't support common additions, such as lookaheads, which I believe are regular.

  2. I was very discouraged by its performance. To be fair, greenery tries to construct a new regexp that matches the intersection, while all I need to know is if an intersection exists. Still, non-trivial regexps could take full seconds to evaluate, which is a no-go.

Maybe I misevaluated it, either from a technical or conceptual standpoint, so feel free to correct me. But I'm not interested in adding a dependency unless it can handle, say, over 95% of my users' grammars.

Btw, my interest in this feature isn't just for better error handling. Having this feature in place paves the road to a stable Earley+LALR integration, which, similarly to GLR+LALR integration, can create a huge performance boost for many Earley grammars.

@Zac-HD
Copy link
Contributor

Zac-HD commented Feb 4, 2019

I think you're entirely right, which means this might take a new library to solve 😕

@MegaIng
Copy link
Member

MegaIng commented Jul 21, 2019

After looking into greenery a bit, I figured out that the performance is acceptable, if used correctly. The main drawback in performance is not actually the calculation of the intersection, but the reconstruction of this intersection to a regex. If you skip this step (which is relatively easy) the performance is acceptable. A small example here:

from lark import Lark

from greenery.lego import pattern

grammar = Lark(open("grammar.lark"))
regexps = [(term, pattern.parse(term.pattern.to_regexp())) for term in grammar._build_lexer().terminals]
fsms = [(term, regex.to_fsm()) for term, regex in regexps]
for i, (term_a, fsm_a) in enumerate(fsms):
    for term_b, fsm_b in fsms[i+1:]:
        if not fsm_a.isdisjoint(fsm_b):
            print(f"Collision in Terminal {term_a} and {term_b}")

This prints all collision in Terminals in a grammar. The problem is that the regexes (and sometimes the string literals) have to be modified to be acceptable. This is the problem that greenery does not support the same syntax as normal python. There are many difference one may have to change, but the biggest problem is that lookaheads and lookbacks are missing. If the regexes are compatible, this script works great. (For the smaller things I am already looking for solutions)

@erezsh
Copy link
Member Author

erezsh commented Jul 21, 2019

@MegaIng Awesome, that is very encouraging! If the slowdown that's introduced isn't too noticeable, then perhaps, it's still worth doing as an optional feature. Even if it doesn't catch everything (i.e. complex regexps), it might still be useful.

I have no idea how hard it would be to add look(aheads|backs) to greenery. It seems like the sort of thing regexp engines would implement as a separate operation, rather than transforming the DFA, and if that's the case then their success with the first is no indication of the difficulty of the latter. Speaking ignorantly, I'm guessing it's pretty difficult to solve. It does seem that intersection is involved there somehow too, except that the regexp has to continue where the intersection had stopped.

@MegaIng
Copy link
Member

MegaIng commented Jul 22, 2019

I thing I manged to add support for lookaheads (not lookbacks). You can see an example on my fork of greenery: https://github.com/MegaIng/greenery. I thing lookbacks are possible, but way harder. Btw, most of the other small problems in syntax I also fixed with a new parser for the regexes.

@MegaIng
Copy link
Member

MegaIng commented Jul 23, 2019

Welp, I forgot the edge case if the lookahead is at the end of the string. This will take a moment to implemented (but will make a huge step in the direction of lookbacks).

@erezsh
Copy link
Member Author

erezsh commented Jul 23, 2019

@MegaIng That sounds fantastic! I can't wait to give it a try (but I'll have to wait, life etc.)

Keep me posted on your progress, please!

@MegaIng
Copy link
Member

MegaIng commented Jul 23, 2019

Ok, I added support for lookaheads at the end of regexps. (And prepared a bit for lookbacks). Now I am pretty sure that lookaheads work (except lookaheads of infinite length. I'm not sure how to handle them. The current implementation handles them, but I am not sure if that is correct.) Btw, do you have a big grammar which I could use for testing?

@erezsh
Copy link
Member Author

erezsh commented Jul 23, 2019

Nice, I'll look into it pretty soon!

I think the biggest real grammar I have is examples/python3.lark

But it's possible to create arbitrarily big grammars if necessary

@MegaIng
Copy link
Member

MegaIng commented Jul 23, 2019

I am now using that grammar and it is showing me a few missing features (e.g flags). I would really like to have a grammar that tests all possible edge case (weird regex constructs, weird overlaps between regexes, etc.). I am not sure what I really need to implement. I am currently implementing most of the normal python re, if possible. In that sense, the original author of greenery has listed few things that will be impossible to implement with his backend:

  • back-references: A feature I did not no existed in python re till I start to implement regex (something different) and look into the documentation. I don't think I will need to implement that and I truly believe the author in this regard.
  • conditional matching: another feature to which the same points apply: not possible, not needed
  • He writes a paragraph about the 'metacharacters' ^ and $. I still have the exact same problem and no idea how to solve it. (The same applies to some special groups e.g. \b and \B as well as \A and \Z). I have no idea how I could implement them (well, I have an idea of \b, but that requires conditional matching). I will see how it goes

@MegaIng
Copy link
Member

MegaIng commented Jul 24, 2019

Ok, I added basic flag support (s and i). lookbacks are still not done. Also, Unicode support is missing completly. I thing the system can now be used and we could start thinking on how to integrate this in lark.

I now realized that the current model on how I am checking the regexes is support inefficient in big grammars since I include all strings. I am not sure what the right behavior is.
The time it takes is currently critically high. Checking the python3 grammar (all possible Terminal combinations) from the 95 terminals inside the grammar takes around 12 sec. At 95*94=8930 possible combinations, this means ~0.001 sec per combination which is not much. But for actual usage as default behavior in lark is this unacceptable. Maybe there is a smarter why of checking for collisions?

@erezsh
Copy link
Member Author

erezsh commented Jul 24, 2019

@MegaIng Very cool! It definitely sounds like a good starting point.

Well, you don't have to check all terminals, only the regexps. Constant strings are easy. Most grammars I've seen don't realistically include more than 20 regexps, which would cap at half a second by your measurements. But that's still much slower than I hoped.

I assume a large portion of the run-time comes from parsing the regexps into some internal representation? And I assume right now that is done for each possible pair? We could probably cut a lot of run-time by parsing it once and storing the internal representation.

As for a smarter way, that doesn't sound likely, but I'm open to ideas.

@MegaIng
Copy link
Member

MegaIng commented Jul 25, 2019

You can see my current progress at https://github.com/MegaIng/greenery (entry point is grammar_checker.py). I already did many of the obvious optimizations (including pre-computing all regexes). For 13 regexes (The number of 'real regexes' ones inside of python3) this system leads to the following timing percentages:

  • 55% (around 2 secs) is used by the Lark-constructor
  • The parsing of the regexes to internal representation 1 does not even show up in the stats
  • 40% (around 1.5 secs) is used by the complete comparing algorithym
  • This is split in two parts
    • 35% calculating the IR 2
    • 5% doing actual comparing
      As you see for the reduced number of regexps, the timing is way better (and the process of creating IR2 might will be optimizable.

@erezsh
Copy link
Member Author

erezsh commented Jul 25, 2019

It sounds like optimizing to_fsm is the key for good performance.

@MegaIng
Copy link
Member

MegaIng commented Jul 25, 2019

I managed to save around 1 sec by changing a bit of behavior in the FSM inner workings. This is now down to around 0.7-0.8 seconds which feels way more acceptable.

@erezsh
Copy link
Member Author

erezsh commented Jul 25, 2019

Yes, definitely brings it to the level of usability!

It's still significant enough that it might be best to keep it optional, but at the 20% slowdown range, it's definitely worth using, especially for debugging.

@MegaIng
Copy link
Member

MegaIng commented Jul 25, 2019

OK, I got it down to 0.6 secs, which is not that much above your goal. There might be many small improvements that can be done, bringing it around 0.5 secs or lower.

@erezsh
Copy link
Member Author

erezsh commented Jul 25, 2019

Nice. If you can make it even faster, please do.

Btw, what do you think are the odds that the original maintainer will merge your changes?

@MegaIng
Copy link
Member

MegaIng commented Jul 25, 2019

Some are likely, some are not likely. We could just mention @qntm and see if he has anything to say in regards to this. Some of the files I currently have on my repo are just test files without actual use in greenery. I am also thinking about a complete refactoring of the fsm file to conform to PEP-8, which probably is not in his interest.
I also did one fundamental change in the fsm file that dramatically boosts performance, but completely breaks to original use case. (changing the .reduce() at the end of the basic operations)

@qntm
Copy link

qntm commented Jul 25, 2019

Hallo! I'm thrilled that you're getting some genuine use out of greenery. I haven't looked at @MegaIng's code yet so I'm just reading these comments for now.

Yes, I'm afraid performance was not a major goal for greenery, I just wanted to see if I could do it. I see you have already found some ways to improve performance.

One particular performance-related issue you may find is that if greenery sees e.g. \w in a regular expression, it will convert it into a finite state machine with at least 63 possible input tokens ("A" to "Z", "a" to "z", "0" to "9" and "_"), plus a 64th for anything_else. This proliferation can make operations on the resulting DFA more time-consuming than necessary. Whereas, if the whole expression was something like /\w|aa/, you would really only need three tokens: "a", "everything in \w apart from 'a'", and anything_else. So, potentially there's some room for improvement here.

I also think there's a serious possibility that there are other, more performant and well-maintained DFA libraries for Python, which you could plumb in instead of my fsm module. I don't know, though.

Python's regular expression syntax is pretty huge, so you can imagine why I didn't add support for all (or even most) of it. I see you've added some of the more obvious omissions back. There are also some omissions in the area of subtle regex parsing quandaries, e.g. is [-abc] a character class matching any of "-", "a", "b" or "c"? Python probably thinks it is, greenery considers it a syntax error for simplicity. Some of these omissions (most prominently backreferences) are impossible to remedy, since they are definitely not strictly regular. (I would have said that lookahead/lookbehind assertions were not strictly regular either, but you're saying you implemented them, so nice job!)

I believe the word boundary metacharacter \b is effectively a compound lookaround assertion: it is an alternation between (1) lookbehind assertion of the beginning of the string, or a non-word character, followed by lookahead assertion of a word character or (2) lookbehind assertion of a word character, followed by lookahead assertion of a non-word character or the end of the string. That might help...?

Line ending characters ^ and $ characters can also be interpreted in this way, maybe?

On the topic of anchoring (using \A and \Z to indicate the beginning or end of a line): it is possible (but annoyingly convoluted) to parse an implicitly unanchored regex such as /abc|\Adef\Z/ and convert it to an implicitly anchored regex which does not use these metacharacters, in this case /.*abc.*|def/.

(I believe it's the MULTILINE flag which changes ^ and $ to mean the same as \A and \Z respectively?)

On the topic of accepting changes to greenery: the real question is whether a delay on my part in accepting changes would block/slow development for you. I consider greenery to be a completed project aside from bug fixes, and have done for some years. I also don't know a whole lot about the Python ecosystem. In other words I can't guarantee to review, merge and publish major changes with a quick turnaround time, and I don't want to be a bottleneck. So, it may be that your simplest course of action is to fork my code, call your new project "greenier" or something, and handle it yourselves from there. If not, then I would at least want to bump the major version number.

@MegaIng
Copy link
Member

MegaIng commented Jul 25, 2019

Hi @qntm, thanks for the fast response.

First, I did not find a DFA/FSM library that has the required tools to make the creation of them from regex as easy as it is with your system. Also, the performance is acceptable (I got it down to 0.3-0.4 seconds, with still possibilities to improve)

The ideas you listed are interesting and I will certainty look into them. But they already made me realize one central problem the currently existed: (Ok, two: lookbacks are still not implemented) nested lookaheads that go beyond to bounds of the group are unsupported and will just make the regex unusable.

To the point of merging changes: I thing it will be easiest if I basically completely untie my code from the old usage of greenery and create a new library, maybe even just as a part or lark. What do you think @erezsh? This will make it easy to implement fixes/changes/additions.

@erezsh
Copy link
Member Author

erezsh commented Jul 25, 2019

Hi @qntm, I also thank you for the quick reply, and for creating your library, which is proving very useful to us. There may be many other libraries, but most can't intersect regexps, and yours is written much better (or so was my impression a while ago, when I looked around)

It sounds to me that both you and @MegaIng agree against merging back his work. It will slow us down, and provide you with work that you don't quite consider necessary. Generally, my inclination is to integrate rather than fragment, which is why I suggested it, but I think that in this case a fork makes sense.


@MegaIng Regarding a separate library or as part of Lark: That really depends on how big the module would be, if stripped of the parts unnecessary for Lark. And also on your wishes regarding it.

Regarding regexps that we can't reason about: As long as we can throw an exception in this case, that's fine. We don't have to support every edge-case for this to be useful. But to report "no intersection" when there actually is one, would in many ways defeat the purpose of this endeavor.

@MegaIng
Copy link
Member

MegaIng commented Jul 26, 2019

I found a library, pyfsa/FSA (I'm not sure which is the correct name), which is unmaintained since 2004. It also had 'Non-goal: efficiency in the documentation. If I would want to use it, I would have to translate it from python2 to python3 and probably to a lot of changes to speed it up.

I found a library whoosh.automata.fsa which might be capable of everything we want fast enough, but I will have to do some rework to use it. I am currently trying to do that.


In any case, there seems to be no existing package that can do exactly what we need, so we need to develop a new one. I assume that this will result in 3 files, with a total of around (in the end) 1500 LOC. One of them is a util, that can easily be removed or integrated into another file or package.
Since the usefulness of this regex-intersection-checker might exist outside of lark, I would prefer to have an external package. I would however say this package should include the FSA-file, because it quite specialized for usage in this context and will be even more at the end of my work.
In that sense, it would probably be easiest if I create this package and lark depends one it, wouldn't it? Or should we create that package inside of lark-parser? What would you prefer @erezsh?

@erezsh
Copy link
Member Author

erezsh commented Jul 26, 2019

If it's 1500 LOC, a separate package is probably better. Also, that leaves room for expanding it in the future, perhaps returning some of the functions we removed from greenery (or adding them, if you plan to build on a different library)

@MegaIng
Copy link
Member

MegaIng commented Jul 27, 2019

@erezsh Should we create the library inside the lark-parser namespace or should I do that on my personal profile? (inside the lark-parser namespace it has a higher chance of being maintain forever)

@MegaIng
Copy link
Member

MegaIng commented Jul 27, 2019

Also, does anyone have nice ideas for names? Since I am not that creative, I would currently name it regex_intersections.

@erezsh
Copy link
Member Author

erezsh commented Jul 27, 2019

@MegaIng I think it's best if you start in your own user. When it's mature for use (even alpha stage), we can move it to lark-parser.

Name suggestions:

  1. Interegular

  2. Finito

  3. Automato

  4. Intersectomata

I'll think some more and see if I can come up with something even more ridiculous :)

@MegaIng
Copy link
Member

MegaIng commented Jul 27, 2019

Ok, I think the time-overhead is now acceptable. The 13 regexes from the python3-grammar are now checked in 0.1 to 0.15 seconds.


Regarding the name: I currently like 'interegular' the best and I think I will upload the package tomorrow.

@MegaIng
Copy link
Member

MegaIng commented Aug 30, 2019

I finally came around to publishing it. (With name interegular) While it shouldn't be used in production right now, you can certainty start integrating it, @erezsh. While I still have a few things to do, I think I am mostly finished with the interface. (Unless you have wishes on how to change it.)
TODO list:

  • Create tests
  • Make sure everything unsupported throws an exception.
  • Make sure I can remove the remaining # TODO in the code

@erezsh
Copy link
Member Author

erezsh commented Aug 30, 2019

Very cool!

What do you mean by integrating?

Q: How do I check if regexes simply collide or not, most efficiently? (i.e. a boolean result)

Just use any(compare_regexes(...)) ?

Another question: Would it be possible to test if a regex is a full subset of another one? (i.e. everything the first can match, the second can too?)

Make sure everything unsupported throws an exception

Very important

@MegaIng
Copy link
Member

MegaIng commented Aug 30, 2019

With 'integrating', I mean implementing/using it inside of lark.
Yes, any(compare_regexes(...)) will work.
Yes, full-subset testing can be implemented. How to you want to use it?

@erezsh
Copy link
Member Author

erezsh commented Aug 30, 2019

Well, I thought it can be used for optimization. If regex A is contained inside regex B, it means I only have test A on the matches of regex B.

But on second thought, that's probably not very useful, and might even hurt performance in some unusual situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants