How to define lark grammar for best parsing performance #1404

ShivaShankarMovidius · 2024-04-04T12:30:21Z

What is your question?

I am currently using pymlir for a project. pymlir is an open source project that allows to parse mlir grammar using lark. I currently have mlir files of ~2k lines and parsing this takes multiple minutes. I am looking at ways to optimize the parsing time and my target is to bring this below a minute. Are there any best practices defined in grammar definition so that lark can optimally parse a DSL.

pymlir uses earley parser.
I tried switching to LALR parser to improve performance. But, i get an error when i try to do this. So, sticking to earley for now.

If you're having trouble with your code or grammar

Provide a small script that encapsulates your issue.

Explain what you're trying to do, and what is obstructing your progress.

MegaIng · 2024-04-04T12:37:44Z

The first recommendation is always to try and use parser='lalr' instead of parser='earley'. For most sane languages that should be perfectly possible, but it does require reworking the grammar.

Try to have as much as possible in your grammar be a terminals, not rules, however you shouldn't use non-regular regex features like lookaheads/behinds.

If you have to use earley because your grammar is ambiguous for some reason or just doesn't fit into LALR, you should make sure that your regex patterns are as simple and non-overlapping as possible, and make sure that your grammar is left-recurisve, i.e. a rule only (or at least primarily) appears as it's own first child.

You can try switching to lexer='basic' if your tokens are cleanly distinct, although I am not sure if this will give a performance benefit.

You can also switch to use PyPy instead of CPython for a very huge boost almost always.

MegaIng · 2024-04-04T12:44:09Z

Oh, also use a newer version of lark than 0.7.8

erezsh · 2024-04-04T13:28:49Z

Try to let the regex engine capture as much as possible. For example, this is VERY bad:

bare_id : (letter| underscore) (letter|digit|underscore|id_chars)*

Try making bare_id (and its dependencies) into a terminal. It should help a lot.

Avoid unnecessary right-recursion. This is very bad:

semi_affine_expr : "(" semi_affine_expr ")"                        -> semi_affine_parens
                 | semi_affine_expr "+" semi_affine_expr           -> semi_affine_add

See the calculator example.

Also, try reducing the number of rules. You have a lot of unnecessary duplication of rules.

It's not worth it to use Earley to validate every corner of the input. Just make sure it parses into the correct structure, and validate afterwards.

ShivaShankarMovidius · 2024-04-04T14:55:52Z

@erezsh, @MegaIng, thank you so much for the quick response. Ill try some of these suggestions and update the thread accordingly.

ShivaShankarMovidius · 2024-04-04T18:46:56Z

Try to let the regex engine capture as much as possible. For example, this is VERY bad:
bare_id : (letter| underscore) (letter|digit|underscore|id_chars)*
Try making bare_id (and its dependencies) into a terminal. It should help a lot.

Avoid unnecessary right-recursion. This is very bad:
semi_affine_expr : "(" semi_affine_expr ")"                        -> semi_affine_parens
                 | semi_affine_expr "+" semi_affine_expr           -> semi_affine_add
See the calculator example.

Also, try reducing the number of rules. You have a lot of unnecessary duplication of rules.

It's not worth it to use Earley to validate every corner of the input. Just make sure it parses into the correct structure, and validate afterwards.

@erezsh, I have question on the difference between rules vs terminal. Is there any difference in the way a rule and a terminal is processed internally by lark. specifically with respect to your suggestion on keeping regrex's to terminals?

erezsh · 2024-04-05T08:39:42Z

Yes, there's an important difference. A terminal turns into a single regular expression. So in the following:

X: "a" "b" "c"
x: "a" "b" "c"

X will be a regexp "abc", while x will be 3 separate regexes, and their matching will be managed by the parser. The terminal is a lot faster to match. But the rule is more flexible and can do more things. But only use rules when it makes sense.

ShivaShankarMovidius · 2024-04-08T12:01:11Z

https://lark-parser.readthedocs.io/en/stable/grammar.html#

Hi @erezsh, the document above is where I started to understand about lark grammar definition. Is there any other documents that describe how to define grammar using lark? Specifically guidelines on the best practices for grammar definition.

ShivaShankarMovidius · 2024-04-11T15:03:28Z

I think we have good information to move forward here. So, closing the issue for now.

ShivaShankarMovidius added the question label Apr 4, 2024

ShivaShankarMovidius changed the title ~~How to define the grammar for best parsing performance~~ How to define lark grammar for best parsing performance Apr 4, 2024

ShivaShankarMovidius mentioned this issue Apr 4, 2024

Improve parser speed spcl/pymlir#33

Open

ShivaShankarMovidius closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to define lark grammar for best parsing performance #1404

How to define lark grammar for best parsing performance #1404

ShivaShankarMovidius commented Apr 4, 2024 •

edited

MegaIng commented Apr 4, 2024 •

edited

MegaIng commented Apr 4, 2024

erezsh commented Apr 4, 2024

ShivaShankarMovidius commented Apr 4, 2024

ShivaShankarMovidius commented Apr 4, 2024

erezsh commented Apr 5, 2024

ShivaShankarMovidius commented Apr 8, 2024

ShivaShankarMovidius commented Apr 11, 2024

How to define lark grammar for best parsing performance #1404

How to define lark grammar for best parsing performance #1404

Comments

ShivaShankarMovidius commented Apr 4, 2024 • edited

MegaIng commented Apr 4, 2024 • edited

MegaIng commented Apr 4, 2024

erezsh commented Apr 4, 2024

ShivaShankarMovidius commented Apr 4, 2024

ShivaShankarMovidius commented Apr 4, 2024

erezsh commented Apr 5, 2024

ShivaShankarMovidius commented Apr 8, 2024

ShivaShankarMovidius commented Apr 11, 2024

ShivaShankarMovidius commented Apr 4, 2024 •

edited

MegaIng commented Apr 4, 2024 •

edited