Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lark.lark parse the same grammar as load_grammar.py, and make grammar.md document it more fully. #1388

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
db1a5a5
Make lark.lark parse the same grammar as load_grammar.py, and make gr…
RossPatterson Feb 1, 2024
9493f81
1. Fix "Python type check / Format (pull request)" failure in test_la…
RossPatterson Feb 1, 2024
7a2880f
DOH!
RossPatterson Feb 1, 2024
83a374f
Remove unnessary anchor; coalesce ENBF item sets; fix %override grammar
RossPatterson Feb 2, 2024
fdffb5f
Revert lark.lark to its original form.
RossPatterson Feb 9, 2024
95c5742
Make lark.lark accept the same input as load_grammar.py, and provide …
RossPatterson Feb 9, 2024
200d6b5
Address some review comments.
RossPatterson Feb 9, 2024
0fb28f9
Fix review comment re: templates in terminals.
RossPatterson Feb 10, 2024
2ec5ef3
Fix review comment: Remove inlining from expansions, expansion, and v…
RossPatterson Feb 10, 2024
e9c026e
Address review comment: Make alias and expr optionals, not maybes, so…
RossPatterson Feb 10, 2024
9bf7ddf
Address review comment: Make '%declare rule' fail in post-processing …
RossPatterson Feb 10, 2024
7f02bd1
lark.lark doesn't allow backslash-nl as a line-continuation, but load…
RossPatterson Feb 13, 2024
4f7a5eb
Push optionality of rule_modifiers and priority down into rule_modifi…
RossPatterson Mar 15, 2024
40576d2
Fix bug introduced in #1018
RossPatterson Mar 15, 2024
daac65d
Issue #1388 is ready for review.
RossPatterson Mar 15, 2024
5f37365
Resolve @megalng comment re:@skipIf
RossPatterson Jun 21, 2024
697841b
Resolve @megalng comment re:tests/test_lark_validator.py
RossPatterson Jun 21, 2024
654e102
Resolve @megalng comment re:docstrings
RossPatterson Jun 21, 2024
33d7088
Resolve @erezsh comment re:typo
RossPatterson Jun 21, 2024
0d01fe2
Resolve part of @erezsh comment re: options.
RossPatterson Jun 21, 2024
20302ca
Remove obsolete 'options' parameter
RossPatterson Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 75 additions & 49 deletions docs/grammar.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,53 +51,48 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o

Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).

## EBNF Expressions

## Terminals

Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.

**Syntax:**

```html
<NAME> [. <priority>] : <literals-and-or-terminals>
```

Terminal names must be uppercase.
The EBNF expression in a Lark termminal definition is a sequence of items to be matched.
Each item is one of:
erezsh marked this conversation as resolved.
Show resolved Hide resolved

Literals can be one of:
* `TERMINAL` - Another terminal, which cannot be defined in terms of this terminal.
* `"string literal"` - Literal, to be matched as-is.
* `"string literal"i` - Literal, to be matched case-insensitively.
* `/regexp literal/[imslux]` - Regular expression literal. Can include the Python stdlib's `re` [flags `imslux`](https://docs.python.org/3/library/re.html#contents-of-module-re)

* `"string"`
* `/regular expression+/`
* `"case-insensitive string"i`
* `/re with flags/imulx`
* Literal range: `"a".."z"`, `"1".."9"`, etc.
* `"character".."character"` - Literal range. The range represends all values between the two literals, inclusively.
* `(item item ..)` - Group items
* `(item | item | ..)` - Alternate items.
* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `item?` - Zero or one instances of item (a "maybe")
* `item*` - Zero or more instances of item
* `item+` - One or more instances of item
* `item ~ n` - Exactly *n* instances of item
* `item ~ n..m` - Between *n* to *m* instances of item

Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.
The EBNF expression in a Lark rule definition is also a sequence of the same set of items to be matched, with one addition:

Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
* `rule` - A rule, which can include recursive use of this rule.

### Templates
## Terminals

Templates are expanded when preprocessing the grammar.
Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.

Definition syntax:
**Syntax:**

```ebnf
my_template{param1, param2, ...}: <EBNF EXPRESSION>
```html
<NAME> [. <priority>] : <items-to-match>
```

Use syntax:
Terminal names must be uppercase. They must start with an underscore (`_`) or a letter (`A` through `Z`), and may be composed of letters, underscores, and digits (`0` through `9`). Terminal names that start with "_" will not be included in the parse tree, unless the `keep_all_tokens` option is specified, or unless they are part of a containing terminal. Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).

```ebnf
some_rule: my_template{arg1, arg2, ...}
```
See [EBNF Expressions](#ebnf-expressions) above for the list of items that a terminal can match.

Example:
```ebnf
_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'
### Templates

num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
```
Templates are not allowed with terminals.

### Priority

Expand All @@ -122,7 +117,7 @@ SIGNED_INTEGER: /
/x
```

Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
Supported flags are one of: `imslux`. See Python's [regex documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) for more details on each one.

Regexps/strings of different flags can only be concatenated in Python 3.6+

Expand Down Expand Up @@ -196,29 +191,19 @@ _ambig

**Syntax:**
```html
<name> : <items-to-match> [-> <alias> ]
<modifiers><name> : <items-to-match> [-> <alias> ]
| ...
```

Names of rules and aliases are always in lowercase.
Names of rules and aliases are always in lowercase. They must start with an underscore (`_`) or a letter (`a` through `z`), and may be composed of letters, underscores, and digits (`0` through `9`). Rule names that start with "_" will be inlined into their containing rule.

Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).

An alias is a name for the specific rule alternative. It affects tree construction.
An alias is a name for the specific rule alternative. It affects tree construction (see [Shaping the tree](tree_construction#shaping_the_tree).

The affect of a rule on the parse tree can be specified by modifiers. The `!` modifier causes the rule to keep all its tokens, regardless of whether they are named or not. The `?` modifier causes the rule to be inlined if it only has a single child. The `?` modifier cannot be used on rules that are named starting with an underscore.

Each item is one of:

* `rule`
* `TERMINAL`
* `"string literal"` or `/regexp literal/`
* `(item item ..)` - Group items
* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
* `item?` - Zero or one instances of item ("maybe")
* `item*` - Zero or more instances of item
* `item+` - One or more instances of item
* `item ~ n` - Exactly *n* instances of item
* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
See [EBNF Expressions](#ebnf_expressions) above for the list of items that a rule can match.

**Examples:**
```perl
Expand All @@ -230,6 +215,29 @@ expr: expr operator expr
four_words: word ~ 4
```

### Templates

Templates are expanded when preprocessing rules in the grammar.

Definition syntax:

```ebnf
my_template{param1, param2, ...}: <EBNF EXPRESSION>
```

Use syntax:

```ebnf
some_rule: my_template{arg1, arg2, ...}
```

Example:
```ebnf
_separated{x, sep}: x (sep x)* // Define a sequence of 'x sep x sep x ...'

num_list: "[" _separated{NUMBER, ","} "]" // Will match "[1, 2, 3]" etc.
```

### Priority

Like terminals, rules can be assigned a priority. Rule priorities are signed
Expand Down Expand Up @@ -297,12 +305,24 @@ Note that `%ignore` directives cannot be imported. Imported rules will abide by

Declare a terminal without defining it. Useful for plugins.

**Syntax:**
```html
%declare <TERMINAL>
%declare <rule>
RossPatterson marked this conversation as resolved.
Show resolved Hide resolved
```

### %override

Override a rule or terminals, affecting all references to it, even in imported grammars.

Useful for implementing an inheritance pattern when importing grammars.

**Syntax:**
```html
%override <terminal definition>
%override <rule definition>
```

**Example:**
```perl
%import my_grammar (start, number, NUMBER)
Expand All @@ -319,6 +339,12 @@ Useful for splitting up a definition of a complex rule with many different optio

Can also be used to implement a plugin system where a core grammar is extended by others.

**Syntax:**
```html
%extend <TERMINAL> ... additional terminal alternate ...
%extend <rule> ... additional rule alternate ...
```


**Example:**
```perl
Expand Down
39 changes: 27 additions & 12 deletions lark/grammars/lark.lark
Original file line number Diff line number Diff line change
@@ -1,56 +1,70 @@
# Lark grammar of Lark's syntax
# Note: Lark is not bootstrapped, its parser is implemented in load_grammar.py
# This grammar matches that one, but does not enforce some rules that it does.
# If you want to enforce those, you can pass the "LarkValidator" over
# the parse tree, like this:

# from lark import Lark
# from lark.lark_validator import LarkValidator
#
# lark_parser = Lark.open_from_package("lark", "grammars/lark.lark", parser="lalr")
# parse_tree = lark_parser.parse(my_grammar)
# LarkValidator.validate(parse_tree)

start: (_item? _NL)* _item?

_item: rule
| token
| statement

rule: RULE rule_params priority? ":" expansions
token: TOKEN token_params priority? ":" expansions
rule: rule_modifiers RULE rule_params priority ":" expansions
token: TOKEN priority? ":" expansions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

priority is already optional

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but load_grammar.py says priority is a required element of rule, and that priority is _DOT NUMER or null. I wanted lark.lark to produce the same parse tree as load_grammar.py.

It's different for token (term in load_grammar.py) - there, load_grammar.py [says priority is optional[(https://github.com/lark-parser/lark/blob/master/lark/load_grammar.py#L162-L163).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erezsh If my comment of 2024-06-20 is acceptable, let's resolve this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I meant was that priority can already be an empty rule, so no point in making it optional.


rule_modifiers: RULE_MODIFIERS?

rule_params: ["{" RULE ("," RULE)* "}"]
token_params: ["{" TOKEN ("," TOKEN)* "}"]

priority: "." NUMBER
priority: ("." NUMBER)?

statement: "%ignore" expansions -> ignore
| "%import" import_path ["->" name] -> import
| "%import" import_path name_list -> multi_import
| "%override" rule -> override_rule
| "%override" (rule | token) -> override
| "%declare" name+ -> declare
| "%extend" (rule | token) -> extend

!import_path: "."? name ("." name)*
name_list: "(" name ("," name)* ")"

?expansions: alias (_VBAR alias)*
expansions: alias (_VBAR alias)*

?alias: expansion ["->" RULE]
?alias: expansion ("->" RULE)?

?expansion: expr*
expansion: expr*

?expr: atom [OP | "~" NUMBER [".." NUMBER]]
?expr: atom (OP | "~" NUMBER (".." NUMBER)?)?

?atom: "(" expansions ")"
| "[" expansions "]" -> maybe
| value

?value: STRING ".." STRING -> literal_range
value: STRING ".." STRING -> literal_range
| name
| (REGEXP | STRING) -> literal
| name "{" value ("," value)* "}" -> template_usage
| RULE "{" value ("," value)* "}" -> template_usage

name: RULE
| TOKEN

_VBAR: _NL? "|"
OP: /[+*]|[?](?![a-z])/
RULE: /!?[_?]?[a-z][_a-z0-9]*/
RULE_MODIFIERS: /(!|![?]?|[?]!?)(?=[_a-z])/
RULE: /_?[a-z][_a-z0-9]*/
TOKEN: /_?[A-Z][_A-Z0-9]*/
STRING: _STRING "i"?
REGEXP: /\/(?!\/)(\\\/|\\\\|[^\/])*?\/[imslux]*/
_NL: /(\r?\n)+\s*/
BACKSLASH: /\\[ ]*\n/

%import common.ESCAPED_STRING -> _STRING
%import common.SIGNED_INT -> NUMBER
Expand All @@ -60,3 +74,4 @@ COMMENT: /\s*/ "//" /[^\n]/* | /\s*/ "#" /[^\n]/*

%ignore WS_INLINE
%ignore COMMENT
%ignore BACKSLASH
Loading