-
-
Notifications
You must be signed in to change notification settings - Fork 218
/
syntax-reference.md
361 lines (229 loc) · 16.2 KB
/
syntax-reference.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# Syntax Reference
This document describes the syntax of the _Ohm language_, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:
- When naming rules, **case matters**: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see [Syntactic vs. Lexical Rules](#syntactic-lexical).
- Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm — we believe that it improves modularity and makes both grammars and semantics easier to understand.
- Alternation expressions support _case names_, which are used in [inline rule declarations](#inline-rule-declarations). This makes semantic actions for alternation expressions simpler and less error-prone.
- Ohm does not (yet) support semantic predicates.
Ohm is closely related to [OMeta](http://tinlizzie.org/ometa/), another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:
- [Rule applications](#rule-application) can accept parameters. This makes it possible to write higher-order rules, such as the built-in `ListOf` rule.
- Grammars can be extended in an object-oriented way — see [Defining, Extending, and Overriding Rules](#defining-extending-and-overriding-rules).
## Terminology
<!-- @markscript
const ohm = require('ohm-js');
function checkGrammar(source) {
assert(ohm.grammar(source));
}
markscript.transformNextBlock(checkGrammar);
-->
```
Arithmetic {
Expr = "1 + 1"
}
```
This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of _Expr_ is known as a "rule body". A rule body may be any valid _parsing expression_.
## Parsing Expressions
Here is a full list of the different kinds of parsing expressions supported by Ohm:
### Terminals
"hello there"
Matches exactly the characters contained inside the quotation marks.
#### Special characters
Special characters (`"`, `\`, and `'`) can be escaped with a backslash — e.g., `"\""` will match a literal quote character in the input stream. Other valid escape sequences include: `\b` (backspace), `\f` (form feed), `\n` (line feed), `\r` (carriage return), and `\t` (tab), as well as `\x` followed by 2 hex digits and `\u` followed by 4 hex digits, for matching characters by code point.
The <code>\u{<i>hexDigits</i>}</code> escape sequence can be used to represent _any_ Unicode code point, including code points above `0xFFFF`. E.g., `"\u{1F639}"` will match `'😹'`. (_New in Ohm v16.3.0._)
**NOTE:** For grammars defined in a JavaScript string literal (i.e., not in a separate .ohm file), it's recommended to use a [template literal with the String.raw tag](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/raw). Without `String.raw`, you'll need to use double-escaping — e.g., `\\n` rather than `\n`.
### Terminal Range
<pre><code><i>start</i>..<i>end</i></code></pre>
Matches exactly one code point whose value is between _start_ and _end_ (inclusive). E.g., `"a".."c"` will match `'a'`, `'b'`, or `'c'`. Note: _start_ and _end_ must be [Terminal](#terminals) expressions containing a single character or code point. (_Note:_ Prior to Ohm v16.3.0, terminal ranges only supported code points up `0xFFFF`. As of v16.3.0, higher code points can be specified directly (e.g. `"😇".."😈"`) or with an escape code (`"\u{1F607}".."\u{1F608}"`).
<!-- @markscript
assert(ohm.grammar('G{ start = "😇".."😈" }').match('😇').succeeded())
assert(ohm.grammar('G{ start = "\u{1F607}".."\u{1F608}" }').match('😇').succeeded())
-->
### Rule Application
<pre><code><i>ruleName</i></code></pre>
Matches the body of the rule named _ruleName_. For example, the built-in rule `letter` will parse a string of length 1 that is a letter.
<pre><code><i>ruleName</i><<i>expr</i>></code></pre>
Matches the body of the _parameterized rule_ named _ruleName_, substituting the parsing expression _expr_ as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. `ListOf<field, ";">`.
### Repetition operators: \*, +, ?
<pre><code><i>expr</i> *</code></pre>
Matches the expression _expr_ repeated 0 or more times. E.g., `"a"*` will match `''`, `'a'`, `'aa'`, ...
Inside a _syntactic rule_ — any rule whose name begins with an upper-case letter — spaces before a match are automatically skipped. E.g., `"a"*` will match `" a a"` as well as `"aa"`. See the documentation on [syntactic and lexical rules](#syntactic-lexical) for more information.
<pre><code><i>expr</i> +</code></pre>
Matches the expression _expr_ repeated 1 or more times. E.g., `letter+` will match `'x'`, `'xA'`, ...
As with the `*` operator, spaces are skipped when used in a [syntactic rule](#syntactic-lexical).
<pre><code><i>expr</i> ?</code></pre>
Tries to match the expression _expr_, succeeding whether it matches or not. No input is consumed if it does not match.
### Sequence
<pre><code><i>expr1</i> <i>expr2</i></code></pre>
Matches the expression `expr1` followed by `expr2`. E.g., `"grade" letter` will match `'gradeA'`, `'gradeB'`, ...
As with the `*` and `+` operators, spaces are skipped when used in a [syntactic rule](#syntactic-lexical). E.g., `"grade" letter` will match `' grade A'` as well as `'gradeA'`.
### Alternation
<pre><code><i>expr1</i> | <i>expr2</i></code></pre>
Matches the expression `expr1`, and if that does not succeed, matches the expression `expr2`. E.g., `letter | digit` will match `'a'`, `'9'`, ...
### Lookahead: &
<pre><code>& <i>expr</i></code></pre>
Succeeds if the expression `expr` can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. `letter &digit` will match `'a9'`, but only consume 'a'. `&"a" letter+` will match any string of letters that begins with 'a'.
### Negative Lookahead: ~
<pre><code>~ <i>expr</i></code></pre>
Succeeds if the expression `expr` cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., `~"\n" any` will consume any single character that is not a new line character.
### Lexification: <span>#</span>
<pre><code># <i>expr</i></code></pre>
Matches _expr_ as if in a lexical context. This can be used to prevent whitespace skipping before an expression that appears in the body of a syntactic rule. For further information, see [Syntactic vs. Lexical Rules](#syntactic-lexical).
### Comment
Inside an Ohm grammar, you can use both single-line (`//`) comments like
```
booleanLiteral = ("true" | "false") // TODO: Should we support "True"/"False" as well?
```
or
```
// For semantics on how decimal literals are constructed, see section 7.8.3
```
as well as multiline (`/* */`) comments like:
```
/*
Note: Punctuator and DivPunctuator (see https://es5.github.io/x7.html#x7.7) are
not currently used by this grammar.
*/
```
## Built-in Rules
(See [src/built-in-rules.ohm](https://github.com/ohmjs/ohm/blob/main/packages/ohm-js/src/built-in-rules.ohm).)
`any`: Matches the next Unicode character — i.e., a single code point — in the input stream, if one exists.
**NOTE:** A JavaScript string is a sequence of 16-bit _code units_. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string `'😆'` has length 2, but contains a single Unicode code point. Prior to Ohm v17, `any` always consumed a single 16-bit code unit, rather than a full Unicode character.
`letter`: Matches a single character which is a letter (either uppercase or lowercase).
`lower`: Matches a single lowercase letter.
`upper`: Matches a single uppercase letter.
`digit`: Matches a single character which is a digit from 0 to 9.
`hexDigit`: Matches a single character which is a either digit or a letter from A-F.
`alnum`: Matches a single letter or digit; equivalent to `letter | digit`.
`space`: Matches a single whitespace character (e.g., space, tab, newline, etc.)
`end`: Matches the end of the input stream. Equivalent to `~any`.
<code>caseInsensitive<<i>terminal</i>></code>: Matches _terminal_, but ignoring any differences in casing (based on the simple, single-character Unicode case mappings). E.g., `caseInsensitive<"ohm">` will match `'Ohm'`, `'OHM'`, etc.
<code>ListOf<<i>elem</i>, <i>sep</i>></code>: Matches the expression _elem_ zero or more times, separated by something that matches the expression _sep_. E.g., `ListOf<letter, ",">` will match `''`, `'a'`, and `'a, b, c'`.
<code>NonemptyListOf<<i>elem</i>, <i>sep</i>></code>: Like `ListOf`, but matches _elem_ at least one time.
<code>listOf<<i>elem</i>, <i>sep</i>></code>: Similar to `ListOf<elem, sep>` but interpreted as [lexical rule](#syntactic-lexical).
<code id="applySyntactic">applySyntactic<<i>ruleName</i>></code>: Allows the syntactic rule _ruleName_ to be applied in a lexical context, which is otherwise not allowed. Spaces are skipped _before_ and _after_ the rule application. _New in Ohm v16.1.0._
## Grammar Syntax
### Grammar Inheritance
<pre><code><i>grammarName</i> <: <i>supergrammarName</i> { ... }</code></pre>
Declares a grammar named `grammarName` which inherits from `supergrammarName`.
### Defining, Extending, and Overriding Rules
In the three forms below, the rule body may optionally begin with a `|` character, which will be
ignored. Also note that in rule names, [**case is significant**](#syntactic-lexical).
<pre><code><i>ruleName</i> = <i>expr</i></code></pre>
Defines a new rule named `ruleName` in the grammar, with the parsing expression `expr` as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.
<pre><code><i>ruleName</i> := <i>expr</i></code></pre>
Defines a rule named `ruleName`, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.
**New in 15.3.0:** The _super-splice_ operator (`...`) can be used to append and/or prepend cases to the supergrammar rule body. E.g., if the supergrammar defines `comment = multiLineComment`, then `comment := ... | singleLineComment` is equivalent to `comment := multiLineComment | singleLineComment`.
<pre><code><i>ruleName</i> += <i>expr</i></code></pre>
Extends a supergrammar rule named `ruleName`, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be <code><i>expr</i> | <i>oldBody</i></code>, where `oldBody` is the rule body as defined in the supergrammar.
Note that as of v15.3.0, the super-splice operator (`...`) offers a more general form of rule extension. E.g., `keyword += "def"` can also be written `keyword := "def" | ...`.
#### Parameterized Rules
<pre><code><i>ruleName</i><<i>arg1</i>, ..., <i>argN</i>> = <i>expr</i></code></pre>
Defines a new rule named `ruleName` which has _n_ parameters. In the rule body _expr_, the parameter names (e.g. _arg1_) may be used as rule applications. E.g., `Repeat<x> = x x`.
#### Rule Descriptions
Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:
<!-- @markscript
function checkRule(source) {
assert(ohm.ohmGrammar.match(source, 'Rule').succeeded());
}
markscript.transformNextBlock(checkRule);
-->
```
ident (an identifier)
= ~keyword name
```
#### Inline Rule Declarations
<pre><code><i>expr</i> — <i>caseName</i></code></pre>
When a parsing expression is followed by the characters `--` and a name, it signals an _inline rule declaration_. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:
<!-- @markscript
markscript.transformNextBlock(checkRule);
-->
```
AddExp = AddExp "+" MulExp -- plus
| MulExp
```
is equivalent to:
```ohm
AddExp = AddExp_plus
| MulExp
AddExp_plus = AddExp "+" MulExp
```
<h3 id="syntactic-lexical">Syntactic vs. Lexical Rules</h3>
<!-- https://ohmjs.org/d/svl -->
A _syntactic rule_ is a rule whose name begins with an uppercase letter, and _lexical rule_ is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.
The definition of "whitespace character" is anything that matches the grammar's `space` rule. The default implementation of `space` matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the [ES5 spec](http://ecma-international.org/ecma-262/5.1/#sec-7.2).
#### How space skipping works
In the body of a syntactic rule, Ohm implicitly inserts applications of the `spaces` rule before each expression. (The `spaces` rule is defined as `spaces = space*`.) As an example, take this fragment of JSON grammar:
<!-- @markscript
let syntacticDefs;
markscript.transformNextBlock(code => {
syntacticDefs = code;
});
-->
```
Array = "[" "]" -- empty
| "[" Elements "]" -- nonEmpty
Elements = Element ("," Element)*
```
`Array` and `Elements` are both synactic rules, since their names begin with a capital letter. Here's what a lexical version of these rule would look like, with _explicit_ space skipping:
<!-- @markscript
let lexicalDefs;
const delexifyRuleNames = str =>
str.replace(/array/g, 'Array').replace(/element/g, 'Element');
markscript.transformNextBlock(code => {
lexicalDefs = code;
assert.equal(syntacticDefs, delexifyRuleNames(lexicalDefs).replace(/spaces /g, ''));
const g = ohm.grammar(`
JSON {
${syntacticDefs}
${lexicalDefs}
lexStart = array spaces // Ensure trailing space is skipped.
Element = number
element = number
number = digit+
}
`);
assert(g.match(' [2, 33 ] ').succeeded());
assert(g.match(' [2, 33 ] ', 'lexStart').succeeded());
assert(g.match(' [ ] ').succeeded());
assert(g.match(' [ ] ', 'lexStart').succeeded());
assert(g.match('[] ').succeeded());
assert(g.match('[] ', 'lexStart').succeeded());
assert(g.match(' [12 ,2,2]').succeeded());
assert(g.match(' [12 ,2,2]', 'lexStart').succeeded());
assert(g.match(' [1 2]').failed());
assert(g.match(' [1 2]', 'lexStart').failed());
assert(g.match(' [1,]').failed());
assert(g.match(' [1,]', 'lexStart').failed());
});
-->
```
array = spaces "[" spaces "]" -- empty
| spaces "[" spaces elements spaces "]" -- nonEmpty
elements = spaces element (spaces "," spaces element)*
```
In terms of the language it accepts, this version of the rules — with explicit space skipping — is equivalent to the syntactic version above.
A few other details that are helpful to know:
1. If the start rule is a syntactic rule, both leading and trailing spaces are skipped around the top-level application.
2. When the body of a rule contains a [repetition operator](#repetition-operators---) (e.g. `+` or `*`), spaces are skipped before each match. In other words, `Names = name+` is equivalent to `names = (spaces name)+`.
3. The [lexification operator (`#`)](#lexification-) can be used in the body of a syntactic rule to prevent space skipping in specific places. For example:
<!-- @markscript
let syntacticKeyValueDef;
markscript.transformNextBlock(code => { syntacticKeyValueDef = code; });
-->
```
KeyAndValue = #(letter alnum+) ":" #(digit+)
```
is equivalent to:
<!-- @markscript
markscript.transformNextBlock(code => {
let lexicalKeyValueDef = code;
const g = ohm.grammar(`G { ${syntacticKeyValueDef} ${lexicalKeyValueDef} }`);
assert(g.match('count :33', 'keyAndValue').succeeded());
assert(g.match('count :33', 'KeyAndValue').succeeded());
assert(g.match('count: 33', 'keyAndValue').failed());
assert(g.match('count: 33', 'KeyAndValue').failed());
});
-->
```
keyAndValue = letter alnum+ spaces ":" digit+
```
Note that no space skipping occurs _inside_ or _before_ the lexical context defined by the `#` character. That means that this rule will match `'count :33'`, but _not_ `'count: 33'`.