Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added interegular support #1258

Merged
merged 6 commits into from
Mar 8, 2023
Merged

Conversation

MegaIng
Copy link
Member

@MegaIng MegaIng commented Mar 2, 2023

This is basically the same as #715 , with (I think) all requested changes being considered. I also added an Example to the log output that can show in what situation two terminals would collide.

I created a new PR after having a mess after a git rebase. This just seemed easier.

I guess it should be mentioned somewhere in the docs, not sure where.

Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I'll clear some time soon to test it.

It would be nice to have an automated test for the basic use of this feature..

lark/lexer.py Show resolved Hide resolved
lark/lexer.py Show resolved Hide resolved
@MegaIng
Copy link
Member Author

MegaIng commented Mar 2, 2023

So many checks, many of which are hard to execute locally tbh:

  • tox -e mypy works, although tox needs to be installed first (well, as long as the dependencies are added in the correct places)
  • tox -e test doesn't work, fails with some mysterious git submodule error (didn't look into this) Luckily, one of the 5 other ways to execute tests does work.
  • tox -e format doesn't work with the same error message. I don't know how to manually execute this, so I have to just push and see what github says.

Oh, and the codecov workflow always fails with "nothing to run". Do you have an idea what is going on there?

@MegaIng MegaIng requested a review from erezsh March 2, 2023 22:39
@erezsh
Copy link
Member

erezsh commented Mar 2, 2023

Sorry, I didn't look into the codecov yet. I'm okay with just disabling it in the meantime.

As for tox, makes sense that mypy would pass but regular tests won't. Mypy doesn't run the tests.

@erezsh
Copy link
Member

erezsh commented Mar 7, 2023

The code looks good, but I'm not crazy about the warning message -

Collision between Terminals 'B' and 'A': Example(prefix='', main_text='', postfix='a')

I'm sure we can do better.

  • Doesn't seem like prefix and main_text are ever used? At least, I couldn't find an example where they have a value.

  • The warning should explain the consequences of the collision. e.g. "The lexer will choose between them arbitrarily."

  • And ofc better not to use the repr version of Example.

@MegaIng
Copy link
Member Author

MegaIng commented Mar 7, 2023

Doesn't seem like prefix and main_text are ever used? At least, I couldn't find an example where they have a value.

Uhm... This is then a bug in interegular. OTOH, I could have sworn the eample for A and B put the text into main_text where it belongs. (Edit: aha yeah, I know what the bug is: -0 == 0)
The idea behind the three categories is that main_text is what actually get's captured by the regex and prefix and postfix are what is needed by lookbehind/lookahead. For example, the lark grammar without the change would produce a conflict on Example(prefix='', main_text='?', postfix='_')

The warning should explain the consequences of the collision. e.g. "The lexer will choose between them arbitrarily."

Fair, although I worry that a grammar would many collision would product a lot of warning text.

And ofc better not to use the repr version of Example.

I am unsure how to best format this. What prefix and postfix mean is kinda hard to explain in a short text. I guess I can do it multiline and use ^ to underline the main part, similar to python syntax errors.

@MegaIng
Copy link
Member Author

MegaIng commented Mar 7, 2023

Ok, I am now relatively happy with the warning message. It's decently complex code, but the result is IMO very readable. The formatting should now also never break, not even for newlines or weird unicode characters. Especially newlines is the reason I was using repr before.

Copy link
Member

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

You want to add the docs in a separate PR, or add them here?

@MegaIng
Copy link
Member Author

MegaIng commented Mar 7, 2023

I can add them here. Where would you want them?

@erezsh
Copy link
Member

erezsh commented Mar 7, 2023

Hmm yeah there's no obvious spot. I'd say -

I'm open to ideas.

Anyway, once that's in, I wanted to add a strict flag, so Lark(..., strict=True) will throw an exception when there are collisions. (regex collision and also shift-reduce). So then users can also learn about it through the API docs.

@MegaIng
Copy link
Member Author

MegaIng commented Mar 7, 2023

how_to_use#debug
is certainly a place it should belong, I wasn't even really aware that that section exists.

I assume strict is a separate PR?

@MegaIng
Copy link
Member Author

MegaIng commented Mar 7, 2023

Btw, the default config parser="earley", lexer="dynamic" doesn't benefit from this at all, right? How bad are regex collisions in that context?

@erezsh
Copy link
Member

erezsh commented Mar 7, 2023

I assume strict is a separate PR?

I think so.

the default config parser="earley", lexer="dynamic" doesn't benefit from this at all, right?

Well, colliding regexes are still slower, because Earley has to try all variations. But it won't cause an error. So maybe a milder warning in that case? (assuming it doesn't complicate things too much)

@MegaIng
Copy link
Member Author

MegaIng commented Mar 7, 2023

So maybe a milder warning in that case? (assuming it doesn't complicate things too much)

I would have to look through the earley code base to find a good place to put this check. They don't construct a Lexer object after all, and especially not a ContextLexer object. So I don't know how to correctly do this.

@erezsh
Copy link
Member

erezsh commented Mar 7, 2023

Umm actually that's only true for dynamic_complete. The default dynamic can still get errors from colliding regexes. (if it chooses the wrong one first)

(I'm talking about my comment)

@erezsh
Copy link
Member

erezsh commented Mar 8, 2023

I rephrased the docs a little bit, let me know if you're okay with the changes: #1260

Also, I just realized we're only showing shift/reduce warnings when debug=True. Maybe we should use the same convention for interegular?

@MegaIng
Copy link
Member Author

MegaIng commented Mar 8, 2023

Yeah, those changes are good. I am not that skilled with words :-)

Maybe we should use the same convention for interegular?

I was thinking about this as well. I decided against it, since regex collisions are from my understanding more severe than shift/reduce collisions. I.e. for shift/reduce the automatic resolving probably will do the correct thing, for regex collision, the automatic resolving will maybe do the correct thing.

@erezsh
Copy link
Member

erezsh commented Mar 8, 2023

I think I agree. But I was also thinking about the possibility of interegular throwing exceptions, which might interfere with normal use. For example certain regex features it doesn't know about, or perhaps some edge case it wasn't expecting. If that's a possibility, then it would be nice to have a way to disable/enable it, that isn't installing/uninstalling it.

@MegaIng
Copy link
Member Author

MegaIng commented Mar 8, 2023

interegular should catch every exception it can throw and only produce a warning on it's internal logger. We probably should make the debug flag make these warnings visible. I don't believe that performance could ever be bogged down enough by these checks to be worth having a disable flag for that. If the performance is that important, create a standalone parser.

@erezsh
Copy link
Member

erezsh commented Mar 8, 2023

We probably should make the debug flag make these warnings visible

That's not a bad idea. Makes sense that it would set the appropriate logger levels to DEBUG by default.

@erezsh
Copy link
Member

erezsh commented Mar 8, 2023

Can you please show me where this exception gets caught? https://github.com/MegaIng/interegular/blob/master/interegular/fsm.py#L431

Maybe I'm missing something obvious.

@MegaIng
Copy link
Member Author

MegaIng commented Mar 8, 2023

https://github.com/MegaIng/interegular/blob/ffde88d87a1eb4209fc3bdad2153643b4944c064/interegular/fsm.py#L889

It's used internally inside of .fsm to correctly crawl the (potentially incomplete) FSM.

@erezsh
Copy link
Member

erezsh commented Mar 8, 2023

Ah, yes. Missed that it's a callback.

erezsh added a commit that referenced this pull request Mar 8, 2023
@erezsh erezsh merged commit d508d1c into lark-parser:master Mar 8, 2023
@erezsh
Copy link
Member

erezsh commented Mar 8, 2023

Thanks for the awesome new feature!

@MegaIng
Copy link
Member Author

MegaIng commented Mar 9, 2023

I also published a new version of interegular that should double the performance again. :-D

@MegaIng MegaIng deleted the interegular-two branch March 9, 2023 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants