Skip to content

Commit

Permalink
Clarify lexing is greedy with lookahead restrictions.
Browse files Browse the repository at this point in the history
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.

This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear.

Either way, the additional clarity removes ambiguity from the spec

Partial fix for #564

Specifically addresses #564 (comment)
  • Loading branch information
leebyron committed Jul 30, 2019
1 parent 439cacf commit 27c2602
Show file tree
Hide file tree
Showing 3 changed files with 149 additions and 55 deletions.
46 changes: 30 additions & 16 deletions spec/Appendix A -- Notation Conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
replaced by terminal characters.

Terminals are represented in this document in a monospace font in two forms: a
specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
(ex {/[0-9]+/}).
specific Unicode character or sequence of Unicode characters (ie. {`=`} or
{`terminal`}), and prose typically describing a specific Unicode code-point
{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
grammars and represent a {Name} token of that specific sequence.

Non-terminal production rules are represented in this document using the
following notation for a non-terminal with a single definition:
Expand All @@ -48,23 +50,25 @@ ListOfLetterA :

The GraphQL language is defined in a syntactic grammar where terminal symbols
are tokens. Tokens are defined in a lexical grammar which matches patterns of
source characters. The result of parsing a sequence of source Unicode characters
produces a GraphQL AST.
source characters. The result of parsing a source text sequence of Unicode
characters first produces a sequence of lexical tokens according to the lexical
grammar which then produces abstract syntax tree (AST) according to the
syntactical grammar.

A Lexical grammar production describes non-terminal "tokens" by
A lexical grammar production describes non-terminal "tokens" by
patterns of terminal Unicode characters. No "whitespace" or other ignored
characters may appear between any terminal Unicode characters in the lexical
grammar production. A lexical grammar production is distinguished by a two colon
`::` definition.

Word :: /[A-Za-z]+/
Word :: Letter+

A Syntactical grammar production describes non-terminal "rules" by patterns of
terminal Tokens. Whitespace and other ignored characters may appear before or
after any terminal Token. A syntactical grammar production is distinguished by a
one colon `:` definition.
terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
after any terminal {Token}. A syntactical grammar production is distinguished by
a one colon `:` definition.

Sentence : Noun Verb
Sentence : Word+ `.`


## Grammar Notation
Expand All @@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
A grammar production may specify that certain expansions are not permitted by
using the phrase "but not" and then indicating the expansions to be excluded.

For example, the production:
For example, the following production means that the nonterminal {SafeWord} may
be replaced by any sequence of characters that could replace {Word} provided
that the same sequence of characters could not replace {SevenCarlinWords}.

SafeName : Name but not SevenCarlinWords

means that the nonterminal {SafeName} may be replaced by any sequence of
characters that could replace {Name} provided that the same sequence of
characters could not replace {SevenCarlinWords}.
SafeWord : Word but not SevenCarlinWords

A grammar may also list a number of restrictions after "but not" separated
by "or".
Expand All @@ -96,6 +98,18 @@ For example:
NonBooleanName : Name but not `true` or `false`


**Lookahead Restrictions**

A grammar production may specify that certain characters or tokens are not
permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
Lookahead restrictions are often used to remove ambiguity from the grammar.

The following example makes it clear that {Letter+} must be greedy, since {Word}
cannot be followed by yet another {Letter}.

Word :: Letter+ [lookahead != Letter]


**Optionality and Lists**

A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one
Expand Down
48 changes: 36 additions & 12 deletions spec/Appendix B -- Grammar Summary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# B. Appendix: Grammar Summary

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
## Source Text

SourceCharacter ::
- "U+0009"
- "U+000A"
- "U+000D"
- "U+0020–U+FFFF"


## Ignored Tokens
Expand All @@ -20,10 +26,10 @@ WhiteSpace ::

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Comment :: `#` CommentChar*
Comment :: `#` CommentChar* [lookahead != CommentChar]

CommentChar :: SourceCharacter but not LineTerminator

Expand All @@ -41,24 +47,41 @@ Token ::

Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }

Name :: /[_A-Za-z][_0-9A-Za-z]*/
Name ::
- NameStart NameContinue* [lookahead != NameContinue]

NameStart ::
- Letter
- `_`

NameContinue ::
- Letter
- Digit
- `_`

IntValue :: IntegerPart
Letter :: one of
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`

Digit :: one of
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`

IntValue :: IntegerPart [lookahead != {Digit, `.`}]

IntegerPart ::
- NegativeSign? 0
- NegativeSign? NonZeroDigit Digit*

NegativeSign :: -

Digit :: one of 0 1 2 3 4 5 6 7 8 9

NonZeroDigit :: Digit but not `0`

FloatValue ::
- IntegerPart FractionalPart
- IntegerPart ExponentPart
- IntegerPart FractionalPart ExponentPart
- IntegerPart FractionalPart ExponentPart [lookahead != Digit]
- IntegerPart FractionalPart [lookahead != Digit]
- IntegerPart ExponentPart [lookahead != Digit]

FractionalPart :: . Digit+

Expand All @@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
Sign :: one of + -

StringValue ::
- `"` StringCharacter* `"`
- `""` [lookahead != `"`]
- `"` StringCharacter+ `"`
- `"""` BlockStringCharacter* `"""`

StringCharacter ::
Expand All @@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
lines and uniform indentation with {BlockStringValue()}.


## Document
## Document Syntax

Document : Definition+

Expand Down
110 changes: 83 additions & 27 deletions spec/Section 2 -- Language.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,50 @@ common unit of composition allowing for query reuse.

A GraphQL document is defined as a syntactic grammar where terminal symbols are
tokens (indivisible lexical units). These tokens are defined in a lexical
grammar which matches patterns of source characters (defined by a
double-colon `::`).
grammar which matches patterns of source characters. In this document, syntactic
grammar productions are distinguished with a colon `:` while lexical grammar
productions are distinguished with a double-colon `::`.

Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more details about the definition of lexical and syntactic grammar and other notational conventions
used in this document.
The source text of a GraphQL document must be a sequence of {SourceCharacter}.
The character sequence must be described by a sequence of {Token} and {Ignored}
lexical grammars. The lexical token sequence, omitting {Ignored}, must be
described by a single {Document} syntactic grammar.

Note: See [Appendix A](#sec-Appendix-Notation-Conventions) for more information
about the lexical and syntactic grammar and other notational conventions used
throughout this document.

**Lexical Analysis & Syntactic Parse**

The source text of a GraphQL document is first converted into a sequence of
lexical tokens, {Token}, and ignored tokens, {Ignored}. The source text is
scanned from left to right, repeatedly taking the next possible sequence of
code-points allowed by the lexical grammar productions as the next token. This
sequence of lexical tokens are then scanned from left to right to produce an
abstract syntax tree (AST) according to the {Document} syntactical grammar.

Lexical grammar productions in this document use *lookahead restrictions* to
remove ambiguity and ensure a single valid lexical analysis. A lexical token is
only valid if not followed by a character in its lookahead restriction.

For example, an {IntValue} has the restriction {[lookahead != Digit]}, so cannot
be followed by a {Digit}. Because of this, the sequence `123` cannot represent
as the tokens (`12`, `3`) since `12` is followed by the {Digit} `3` and so must
only represent a single token. Use {WhiteSpace} or other {Ignored} between
characters to represent multiple tokens.

Note: This typically has the same behavior as a
"[maximal munch](https://en.wikipedia.org/wiki/Maximal_munch)" longest possible
match, however some lookahead restrictions include additional constraints.


## Source Text

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
SourceCharacter ::
- "U+0009"
- "U+000A"
- "U+000D"
- "U+0020–U+FFFF"

GraphQL documents are expressed as a sequence of
[Unicode](https://unicode.org/standard/standard.html) characters. However, with
Expand Down Expand Up @@ -60,7 +94,7 @@ control tools.

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Like white space, line terminators are used to improve the legibility of source
Expand All @@ -75,7 +109,7 @@ the line number.

### Comments

Comment :: `#` CommentChar*
Comment :: `#` CommentChar* [lookahead != CommentChar]

CommentChar :: SourceCharacter but not LineTerminator

Expand Down Expand Up @@ -118,8 +152,7 @@ Token ::
A GraphQL document is comprised of several kinds of indivisible lexical tokens
defined here in a lexical grammar by patterns of source Unicode characters.

Tokens are later used as terminal symbols in a GraphQL Document
syntactic grammars.
Tokens are later used as terminal symbols in GraphQL syntactic grammar rules.


### Ignored Tokens
Expand All @@ -131,15 +164,16 @@ Ignored ::
- Comment
- Comma

Before and after every lexical token may be any amount of ignored tokens
including {WhiteSpace} and {Comment}. No ignored regions of a source
document are significant, however ignored source characters may appear within
a lexical token in a significant way, for example a {String} may contain white
space characters.
{Ignored} tokens are used to improve readability and provide separation between
{Token}, but are otherwise insignificant and not referenced in syntactical
grammar productions.

No characters are ignored while parsing a given token, as an example no
white space characters are permitted between the characters defining a
{FloatValue}.
Any amount of {Ignored} may appear before and after every lexical token. No
ignored regions of a source document are significant, however ignored source
characters may appear within a lexical token in a significant way, for example a
{String} may contain white space characters. No characters are ignored within a
{Token}, as an example no white space characters are permitted between the
characters defining a {FloatValue}.


### Punctuators
Expand All @@ -153,7 +187,26 @@ lacks the punctuation often used to describe mathematical expressions.

### Names

Name :: /[_A-Za-z][_0-9A-Za-z]*/
Name ::
- NameStart NameContinue* [lookahead != NameContinue]

NameStart ::
- Letter
- `_`

NameContinue ::
- Letter
- Digit
- `_`

Letter :: one of
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`

Digit :: one of
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`

GraphQL Documents are full of named things: operations, fields, arguments,
types, directives, fragments, and variables. All names must follow the same
Expand All @@ -163,8 +216,9 @@ Names in GraphQL are case-sensitive. That is to say `name`, `Name`, and `NAME`
all refer to different names. Underscores are significant, which means
`other_name` and `othername` are two different names.

Names in GraphQL are limited to this <acronym>ASCII</acronym> subset of possible
characters to support interoperation with as many other systems as possible.
Note: Names in GraphQL are limited to the Latin <acronym>ASCII</acronym> subset
of possible Source Characters in order to support interoperation with as many
other systems as possible.


## Document
Expand Down Expand Up @@ -666,27 +720,28 @@ specified as a variable. List and inputs objects may also contain variables (unl

### Int Value

IntValue :: IntegerPart
IntValue :: IntegerPart [lookahead != {Digit, `.`}]

IntegerPart ::
- NegativeSign? 0
- NegativeSign? NonZeroDigit Digit*

NegativeSign :: -

Digit :: one of 0 1 2 3 4 5 6 7 8 9

NonZeroDigit :: Digit but not `0`

An Int number is specified without a decimal point or exponent (ex. `1`).

An {IntValue} must not be followed by a {`.`}. If a {`.`} follows the token must
only be interpreted as a {FloatValue}.


### Float Value

FloatValue ::
- IntegerPart FractionalPart
- IntegerPart ExponentPart
- IntegerPart FractionalPart ExponentPart
- IntegerPart FractionalPart ExponentPart [lookahead != Digit]
- IntegerPart FractionalPart [lookahead != Digit]
- IntegerPart ExponentPart [lookahead != Digit]

FractionalPart :: . Digit+

Expand All @@ -710,7 +765,8 @@ The two keywords `true` and `false` represent the two boolean values.
### String Value

StringValue ::
- `"` StringCharacter* `"`
- `""` [lookahead != `"`]
- `"` StringCharacter+ `"`
- `"""` BlockStringCharacter* `"""`

StringCharacter ::
Expand Down

0 comments on commit 27c2602

Please sign in to comment.