lark-parser · RossPatterson · Feb 1, 2024 · Feb 1, 2024 · Feb 1, 2024 · Feb 2, 2024
diff --git a/docs/grammar.md b/docs/grammar.md
@@ -51,53 +51,48 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o
 
 Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).
 
+## EBNF Expressions
 
-## Terminals
-
-Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
-
-**Syntax:**
-
-```html
-<NAME> [. <priority>] : <literals-and-or-terminals>
-```
-
-Terminal names must be uppercase.
+The EBNF expression in a Lark termminal definition is a sequence of items to be matched.
+Each item is one of:
 
-Literals can be one of:
+* `TERMINAL` - Another terminal, which cannot be defined in terms of this terminal.
+* `"string literal"` - Literal, to be matched as-is.
+* `"string literal"i` - Literal, to be matched case-insensitively.
+* `/regexp literal/[imslux]` - Regular expression literal.  Can include the Python stdlib's `re` [flags `imslux`](https://docs.python.org/3/library/re.html#contents-of-module-re)
 
-* `"string"`
-* `/regular expression+/`
-* `"case-insensitive string"i`
-* `/re with flags/imulx`
-* Literal range: `"a".."z"`, `"1".."9"`, etc.
+* `"character".."character"` - Literal range.  The range represends all values between the two literals, inclusively.
+* `(item item ..)` - Group items
+* `(item | item | ..)` - Alternate items.
+* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
+* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
+* `item?` - Zero or one instances of item (a "maybe")
+* `item*` - Zero or more instances of item
+* `item+` - One or more instances of item
+* `item ~ n` - Exactly *n* instances of item
+* `item ~ n..m` - Between *n* to *m* instances of item
 
-Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.
+The EBNF expression in a Lark rule definition is also a sequence of the same set of items to be matched, with one addition:
 
-Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
+* `rule` - A rule, which can include recursive use of this rule.
 
-### Templates
+## Terminals
 
-Templates are expanded when preprocessing the grammar.
+Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
 
-Definition syntax:
+**Syntax:**
 
-```ebnf
-  my_template{param1, param2, ...}: <EBNF EXPRESSION>
+```html
+<NAME> [. <priority>] : <items-to-match>
 ```
 
-Use syntax:
+Terminal names must be uppercase.  They must start with an underscore (`_`) or a letter (`A` through `Z`), and may be composed of letters, underscores, and digits (`0` through `9`).  Terminal names that start with "_" will not be included in the parse tree, unless the `keep_all_tokens` option is specified, or unless they are part of a containing terminal.  Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
 
-```ebnf
-some_rule: my_template{arg1, arg2, ...}
-```
+See [EBNF Expressions](#ebnf-expressions) above for the list of items that a terminal can match.
 
-Example:
-```ebnf
-_separated{x, sep}: x (sep x)*  // Define a sequence of 'x sep x sep x ...'
+### Templates
 
-num_list: "[" _separated{NUMBER, ","} "]"   // Will match "[1, 2, 3]" etc.
-```
+Templates are not allowed with terminals.
 
 ### Priority
 
@@ -122,7 +117,7 @@ SIGNED_INTEGER: /
  /x
 ```
 
-Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
+Supported flags are one of: `imslux`. See Python's [regex documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) for more details on each one.
 
 Regexps/strings of different flags can only be concatenated in Python 3.6+
 
@@ -196,29 +191,19 @@ _ambig
 
 **Syntax:**
 ```html
-<name> : <items-to-match>  [-> <alias> ]
+<modifiers><name> : <items-to-match>  [-> <alias> ]
        | ...
 ```
 
-Names of rules and aliases are always in lowercase.
+Names of rules and aliases are always in lowercase.  They must start with an underscore (`_`) or a letter (`a` through `z`), and may be composed of letters, underscores, and digits (`0` through `9`).  Rule names that start with "_" will be inlined into their containing rule.
 
 Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).
 
-An alias is a name for the specific rule alternative. It affects tree construction.
+An alias is a name for the specific rule alternative. It affects tree construction (see [Shaping the tree](tree_construction#shaping_the_tree).
 
+The affect of a rule on the parse tree can be specified by modifiers.  The `!` modifier causes the rule to keep all its tokens, regardless of whether they are named or not.  The `?` modifier causes the rule to be inlined if it only has a single child.  The `?` modifier cannot be used on rules that are named starting with an underscore.
 
-Each item is one of:
-
-* `rule`
-* `TERMINAL`
-* `"string literal"` or `/regexp literal/`
-* `(item item ..)` - Group items
-* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
-* `item?` - Zero or one instances of item ("maybe")
-* `item*` - Zero or more instances of item
-* `item+` - One or more instances of item
-* `item ~ n` - Exactly *n* instances of item
-* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
+See [EBNF Expressions](#ebnf_expressions) above for the list of items that a rule can match.
 
 **Examples:**
 ```perl
@@ -230,6 +215,29 @@ expr: expr operator expr
 four_words: word ~ 4
 ```
 
+### Templates
+
+Templates are expanded when preprocessing rules in the grammar.
+
+Definition syntax:
+
+```ebnf
+  my_template{param1, param2, ...}: <EBNF EXPRESSION>
+```
+
+Use syntax:
+
+```ebnf
+some_rule: my_template{arg1, arg2, ...}
+```
+
+Example:
+```ebnf
+_separated{x, sep}: x (sep x)*  // Define a sequence of 'x sep x sep x ...'
+
+num_list: "[" _separated{NUMBER, ","} "]"   // Will match "[1, 2, 3]" etc.
+```
+
 ### Priority
 
 Like terminals, rules can be assigned a priority. Rule priorities are signed
@@ -297,12 +305,24 @@ Note that `%ignore` directives cannot be imported. Imported rules will abide by
 
 Declare a terminal without defining it. Useful for plugins.
 
+**Syntax:**
+```html
+%declare <TERMINAL>
+%declare <rule>
+```
+
 ### %override
 
 Override a rule or terminals, affecting all references to it, even in imported grammars.
 
 Useful for implementing an inheritance pattern when importing grammars.
 
+**Syntax:**
+```html
+%override <terminal definition>
+%override <rule definition>
+```
+
 **Example:**
 ```perl
 %import my_grammar (start, number, NUMBER)
@@ -319,6 +339,12 @@ Useful for splitting up a definition of a complex rule with many different optio
 
 Can also be used to implement a plugin system where a core grammar is extended by others.
 
+**Syntax:**
+```html
+%extend <TERMINAL> ... additional terminal alternate ...
+%extend <rule> ... additional rule alternate ...
+```
+
 
 **Example:**
 ```perl

diff --git a/lark/grammars/lark.lark b/lark/grammars/lark.lark
@@ -1,56 +1,70 @@
 # Lark grammar of Lark's syntax
 # Note: Lark is not bootstrapped, its parser is implemented in load_grammar.py
+# This grammar matches that one, but does not enforce some rules that it does.
+# If you want to enforce those, you can pass the "LarkValidator" over
+# the parse tree, like this:
+
+# from lark import Lark
+# from lark.lark_validator import LarkValidator
+#
+# lark_parser = Lark.open_from_package("lark", "grammars/lark.lark", parser="lalr")
+# parse_tree = lark_parser.parse(my_grammar)
+# LarkValidator.validate(parse_tree)
 
 start: (_item? _NL)* _item?
 
 _item: rule
      | token
      | statement
 
-rule: RULE rule_params priority? ":" expansions
-token: TOKEN token_params priority? ":" expansions
+rule: rule_modifiers RULE rule_params priority ":" expansions
+token: TOKEN priority? ":" expansions
+
+rule_modifiers: RULE_MODIFIERS?
 
 rule_params: ["{" RULE ("," RULE)* "}"]
-token_params: ["{" TOKEN ("," TOKEN)* "}"]
 
-priority: "." NUMBER
+priority: ("." NUMBER)?
 
 statement: "%ignore" expansions                    -> ignore
          | "%import" import_path ["->" name]       -> import
          | "%import" import_path name_list         -> multi_import
-         | "%override" rule                        -> override_rule
+         | "%override" (rule | token)              -> override
          | "%declare" name+                        -> declare
+         | "%extend" (rule | token)                -> extend
 
 !import_path: "."? name ("." name)*
 name_list: "(" name ("," name)* ")"
 
-?expansions: alias (_VBAR alias)*
+expansions: alias (_VBAR alias)*
 
-?alias: expansion ["->" RULE]
+?alias: expansion ("->" RULE)?
 
-?expansion: expr*
+expansion: expr*
 
-?expr: atom [OP | "~" NUMBER [".." NUMBER]]
+?expr: atom (OP | "~" NUMBER (".." NUMBER)?)?
 
 ?atom: "(" expansions ")"
      | "[" expansions "]" -> maybe
      | value
 
-?value: STRING ".." STRING -> literal_range
+value: STRING ".." STRING -> literal_range
       | name
       | (REGEXP | STRING) -> literal
-      | name "{" value ("," value)* "}" -> template_usage
+      | RULE "{" value ("," value)* "}" -> template_usage
 
 name: RULE
     | TOKEN
 
 _VBAR: _NL? "|"
 OP: /[+*]|[?](?![a-z])/
-RULE: /!?[_?]?[a-z][_a-z0-9]*/
+RULE_MODIFIERS: /(!|![?]?|[?]!?)(?=[_a-z])/
+RULE: /_?[a-z][_a-z0-9]*/
 TOKEN: /_?[A-Z][_A-Z0-9]*/
 STRING: _STRING "i"?
 REGEXP: /\/(?!\/)(\\\/|\\\\|[^\/])*?\/[imslux]*/
 _NL: /(\r?\n)+\s*/
+BACKSLASH: /\\[ ]*\n/
 
 %import common.ESCAPED_STRING -> _STRING
 %import common.SIGNED_INT -> NUMBER
@@ -60,3 +74,4 @@ COMMENT: /\s*/ "//" /[^\n]/* | /\s*/ "#" /[^\n]/*
 
 %ignore WS_INLINE
 %ignore COMMENT
+%ignore BACKSLASH