Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that lexing is greedy #599

Merged
merged 1 commit into from
Jan 10, 2020
Merged

Clarify that lexing is greedy #599

merged 1 commit into from
Jan 10, 2020

Conversation

leebyron
Copy link
Collaborator

@leebyron leebyron commented Jul 23, 2019

GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.

Either way, the additional clarity removes ambiguity from the spec

Partial fix for #564
Fixes #572

Specifically addresses #564 (comment)

@leebyron leebyron force-pushed the editorial-greedy-lexer branch 2 times, most recently from ead49a0 to 919f3f2 Compare July 23, 2019 02:15
@leebyron leebyron added the 🤷‍♀️ Ambiguity An issue/PR which identifies or fixes spec ambiguity label Jul 23, 2019
@leebyron leebyron force-pushed the editorial-greedy-lexer branch 2 times, most recently from fac60df to 92c1602 Compare July 23, 2019 02:33
Copy link
Member

@IvanGoncharov IvanGoncharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leebyron Looks good 👍

P.S. In retrospective I think making commas optional was a mistake since it causing not only this issue but also complicates writing error recovering parser.

@leebyron
Copy link
Collaborator Author

This issue is unrelated to optional commas since the lexical grammar applies before the syntactical grammar. Other languages face the same thing and a clause like this is pretty common I found.

Javascript's is even more complex actually, but has a similar clause about the lexer being greedy: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar

@leebyron
Copy link
Collaborator Author

I'd love to learn more about your exploration into error recovering parsing. I know that some strategies involve looking for a comma, semicolon or line break to attempt recovery.

The parser used in GraphiQL has some error recovery properties as well, though I'm not sure if there's any academic literature to back up the method it uses.

@IvanGoncharov
Copy link
Member

This issue is unrelated to optional commas since the lexical grammar applies before the syntactical grammar. Other languages face the same thing and a clause like this is pretty common I found.

Javascript's is even more complex actually, but has a similar clause about the lexer being greedy: ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar

@leebyron Thanks for the link, now I see how it happens with other languages.
It helped me to reformulate problem and understand how graphql-js implementation of "greedy" lexer is different from other languages.
As you pointed our JS also has a greedy parser, but it's totally different from GraphQL in how it invalid tokens. For example in Node's repl:

> abc = 1
1
> [1,abc]
[ 1, 1 ]
> [1abc]
Thrown:
[1abc]
 ^

SyntaxError: Invalid or unexpected token

But since is GraphQL whitespaces and commas is insignificant both [1abc] and [1,abc] should be equal. Moreover using the same logic [1efg] should be equivalent to [1, efg] but in the graphql-js lexer, it is interpreted as a number in scientific notation with invalid exponent.

So if we now adopt the same definition of "greedy" lexer as JS and other languages how should [1abc] and [1efg] now parsed?

@leebyron
Copy link
Collaborator Author

Great question. I'm actually surprised to see the JS error point to that specific location. I'll need to read more about how JS and other languages are handling this, and if this is implementation specific or specified behavior.

Given the ECMAScript spec, I would have assumed that 1abc would be parsed as 1, abc and then given a syntax error that abc was unexpected (missing comma). But perhaps there's something else going on.

As our spec is currently written, I would expect [1abc] to be valid (though unclear) and equivalent to [1, abc] while abc1 is a valid identifier.

Perhaps JS and other languages have a more complex number lexer?

@leebyron
Copy link
Collaborator Author

Indeed ECMAScript spec has specific language for this (though, surprisingly it's not in the formal grammar)

See the prose and note at the end of this section: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals

The SourceCharacter immediately following a NumericLiteral must not be an IdentifierStart or DecimalDigit.

NOTE For example: 3in is an error and not the two input elements 3 and in.

I would have expected a negative lookahead in the formal grammar.

Something like this adds value since it protects the space in case we ever wanted to use trailing letters for different kinds of numbers (seems unlikely), but certainly it helps combat the potential confusion or poor readability.

If we were to pursue adding that, we should do it as a separate RFC instead of an editorial clarification since it's technical a language (breaking) change and so we should go through the stages to get implementations on board.

@leebyron
Copy link
Collaborator Author

leebyron commented Jul 29, 2019

Doing a little more research about how other popular languages specify this behavior and looking for specific quirks around numbers...

Maximal Munch

First, learning about the academic background. A term of art for this is "maximal munch." There's some interesting research about avoiding implementation rules and relying only on unambiguous grammars by using follow restrictions (essentially what I was expecting JS to be doing)

For example, we might want to describe the Name Lex Grammar to not be allowed to be followed by a [_0-9A-Za-z]

Today we describe some of our Lex grammars with regular expressions which use greedy * and + operators.

https://en.wikipedia.org/wiki/Maximal_munch
https://research.cs.wisc.edu/wpis/papers/toplas98b.pdf
https://link.springer.com/chapter/10.1007%2F3-540-45937-5_12

C

Uses a maximal munch lexer with extra rules (header preprocessors)

If the input stream has been parsed into preprocessing tokens up to a given character, the
next preprocessing token is the longest sequence of characters that could constitute a
preprocessing token.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf (page 49)

Javascript

Uses a maximal munch lexer with extra rules (for regex, template literals). Numbers are not allowed to be followed by a letter (and also a decimal digit - which is an attempt at a disambiguating follow restriction)

After reading grammars for a bunch of other languages, I think this limitation is really smart. Most programming languages offer many numeric literals, including 0b001100 and 0xAaB2C3, and C99 includes type with the literal 123L. These are currently not valid GraphQL literals, however our lexer does not enforce that. Instead 0x1F would be lexed as [0, x, 1, F] given the current specified behavior. That could parse syntactically if it's in a List value position. While it's almost assuredly not going to be valid and produce an error at that point, it's going to produce an odd error.

I don't know the historical context for this follow restriction in Javascript, but I assume it is to produce clear errors if someone attempts to parse JS-ish C99 code and perhaps there was foresight that they might want to support something like this in the future and wanted to protect the lexical space.

http://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar
https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals

Others

Also investigated the lexers for Swift, Kotlin, Python, and a couple other languages which have specifications for their language parsing. They nearly all include some language about a maximal munch style algorithm. None of them have follow limiters for numeric literals.

https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html
https://kotlinlang.org/docs/reference/grammar.html
https://docs.python.org/3/reference/lexical_analysis.html

Unrelated notes & findings

After spending a night reading modern language specifications, I'm seeing some clear trends, that while unrelated to this issue, I think are interesting and I wanted to document my thoughts.

Underscores in numeric literals

There seems to be a trend in newer languages (including Python 3) to allow arbitrary underscores in a numeric literal, which are ignored during interpretation. This is to aid readability ie 0b_0010_1001 and 1_000_000. While tempting to follow a language trend, I don't think we should pursue this since integer literals can be used for ID type inputs, and underscores might actually cause confusion in this context. GraphQL is also typically not number heavy, so the benefit is less than for programming languages.

Unicode astral plane support

I made a mistake basing GraphQL's initial source and lexical parsing logic on ECMAScript's, in that it now unnecessarily suffers from many of the same Unicode related issues. We've known this for a while (there was an ill-fated attempt at fixing this a couple years back). Nearly all modern languages don't make the same mistake.

More specifically, GraphQL's spec borrowed JS's assumptions of a UTF16 (previously UCS-2) encoded world, and mixes concepts of a character (which is ambiguous), code point, and code unit. Because of UTF16, the language specifies source only allowing up to U+FFFF, but ideally we should support up to U+10FFFF. Notably, we should support emoji in string literals without requiring surrogate pair code units.

Similarly, we missed an opportunity for correct string literal unicode escape sequences.

I'll probably propose an RFC fixing most of these unicode issues in the future.

@leebyron
Copy link
Collaborator Author

Significant update, so I've put this back up for review. @andimarek I'd love your eyes on this as well.

This latest update is informed by my research and makes a few significant changes:

  • Adds lookahead restrictions to many lexical token productions which enforces that they must be greedy.
  • Adds a subsection on lexical analysis and grammar parsing explaining the tokenization process as well as related cleanup and clarity around how token and ignored interact.
  • Removes regular expressions as a notational representation, since it doesn't speak to our expectations of implementation.
  • Edits to Appendix A to include the missing description of lookahead restriction notation and some related cleanup.

@leebyron
Copy link
Collaborator Author

leebyron commented Aug 1, 2019

This question came up in the GraphQL working group meeting today, so I wanted to follow up briefly:

Would this affect parsing of 00 as two zeros?

The original intent was actually for this to be a parse error! In fact there is a test in GraphQL.js for exactly this case. However the spec makes no mention of this case or of the expected "greedy" behavior of a lexer, so it would be entirely reasonable that if you only read this spec that you might expect 00 to be understood to be the token sequence [0, 0].

This change makes it clear that 00 cannot be parsed in this way since the first 0 is followed by a Digit (the second 0) and therefore not a valid token, and 00 doesn't match any token grammar, since IntValue and FloatValue both restrict leading zeros.

@andimarek
Copy link
Contributor

great work @leebyron
Some of these comments here are actually worth having in the spec as an explanation imho.

@leebyron leebyron force-pushed the editorial-greedy-lexer branch 3 times, most recently from 100d8fe to 1ebfc40 Compare August 7, 2019 05:33
@leebyron
Copy link
Collaborator Author

leebyron commented Aug 7, 2019

Updated to mirror the lookahead restrictions between IntValue and FloatValue and include exponent in the restriction. Also includes more prose explaining the effect of the restrictions.

Suggested test cases based on this change (thanks @robzhu for the suggestion to write these):

These test cases will most likely already be passing in GraphQL implementations, however were ambiguous or under-defined in the spec before this change.

  • LineTerminator
    • U+000D U+000D U+000A U+000A should be 3 line terminators, not 4
    • U+000A U+000A U+000D U+000D should be 4 line terminators
  • Comment
    • # abc is one comment ignored token, not a comment followed by the abc token.
    • ## is one comment ignored token, not two.
  • Name
    • abc should be one token abc
    • abc123 should be one token abc123
  • IntValue
    • 123 should be one token 123
    • 00 should be a parse error, not two tokens [0, 0]
    • 01 should be a parse error, not two tokens [0, 1]
    • 123. is a parse error
    • 123e is a parse error, not two tokens [123, e]
    • 123E is a parse error, not two tokens [123, E]
  • FloatValue
    • 1.23 should be one token 1.23
    • 01.23 should be a parse error, not the two tokens [0, 1.23]
    • 1e5 should be one token 1e5, not two tokens [1, e5]
    • 1.2e3 should be one token 1.2e3, not two tokens [1.2, e3]
    • 1.2e3.4 should be a parse error, not three tokens [1.2, e, 3.4]
    • 1.23.4 should be a parse error, not two tokens [1.2, 3.4]
    • 1.2e3e should be a parse error, not two tokens [1.2e3, e]
  • StringValue
    • "" is an a single empty string token
    • """ is a parse error due to an unterminated block string, not the token "" followed by an unterminated string
    • """" is a parse error, not two tokens ["", ""]
    • """""" is a single empty block string token

@andimarek
Copy link
Contributor

@leebyron @robzhu these test cases are really great!

leebyron added a commit to graphql/graphql-js that referenced this pull request Sep 11, 2019
Adds the test cases described in graphql/graphql-spec#599

Makes the lookahead restriction change necessary for the new tests to pass for numbers, all other tests are already passing.
@leebyron leebyron added 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) 🚀 Next Stage? This RFC believes it is ready for the next stage labels Sep 11, 2019
leebyron added a commit to graphql/graphql-js that referenced this pull request Sep 12, 2019
Adds the test cases described in graphql/graphql-spec#599

Makes the lookahead restriction change necessary for the new tests to pass for numbers, all other tests are already passing.
@IvanGoncharov IvanGoncharov added 📄 Draft (RFC 2) RFC Stage 2 (See CONTRIBUTING.md) and removed 🚀 Next Stage? This RFC believes it is ready for the next stage 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) labels Sep 13, 2019
Cito added a commit to graphql-python/graphql-core that referenced this pull request Sep 25, 2019
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.

This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear.

Either way, the additional clarity removes ambiguity from the spec

Partial fix for #564

Specifically addresses #564 (comment)
@leebyron leebyron added 🏁 Accepted (RFC 3) RFC Stage 3 (See CONTRIBUTING.md) and removed 📄 Draft (RFC 2) RFC Stage 2 (See CONTRIBUTING.md) labels Jan 10, 2020
@leebyron
Copy link
Collaborator Author

@leebyron leebyron merged commit a73cd6f into master Jan 10, 2020
lirsacc added a commit to lirsacc/py-gql that referenced this pull request Jan 28, 2020
Some edge cases around numbers were not handled as expected. This commit adds
test cases from the 2 RFCs clarifying the expected behaviour (
graphql/graphql-spec#601,
graphql/graphql-spec#599) and updates the Lexer to match.

This is technically a breaking change but most cases were likely to lead
to validation errors (e.g. "0xF1" being parsed as [0, xF1] when
expecting a list of integers).
@andimarek
Copy link
Contributor

andimarek commented Jul 19, 2020

@leebyron Some feedback, even it it comes very late:
Overall great PR, which helps to clarify some edge cases!

I think negative lookaheads is a good way of making some details clear, but I think we could improve some details and give some context for people implementing/maintaining GraphQL parser to make it easier. Specifically:

Some negative lookahead is used to make a rule greedy (e.g. Comment). This is the most intuitive way and doesn't need a lot of explanation beside what we have right now.

Other lookaheads are more complicated is more about removing ambiguity (for example for StringValue) and not about “greedy”.

One of the more challenging rules is BlockStringCharacter because it references SourceCharacter (which is one char including ) but also \””” which is multiple chars.

In ANTLR (which is the parser generator we use in GraphQL Java) we use the following rule:

StringValue:
'""'  { _input.LA(1) != '"'}? |
'"' StringCharacter+ '"' |
'"""' BlockStringCharacter*? '"""';

BlockStringCharacter:
'\\"""'|
SourceCharacter;

The BlockStringCharacter*? is a special notation which makes BlockStringCharacter* nongreedy, because otherwise it would match everything because SourceCharacter matches also . I suspect that other parser generator might have a similar challenge.

Hand written parsers like graphql.js don't really have this challenge as they can easily lookahead for the escaped triple quote.

@leebyron leebyron added this to the May2021 milestone Apr 6, 2021
@leebyron leebyron deleted the editorial-greedy-lexer branch July 1, 2022 18:00
anthonymedinawork added a commit to anthonymedinawork/py-gql that referenced this pull request Jan 24, 2024
Some edge cases around numbers were not handled as expected. This commit adds
test cases from the 2 RFCs clarifying the expected behaviour (
graphql/graphql-spec#601,
graphql/graphql-spec#599) and updates the Lexer to match.

This is technically a breaking change but most cases were likely to lead
to validation errors (e.g. "0xF1" being parsed as [0, xF1] when
expecting a list of integers).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏁 Accepted (RFC 3) RFC Stage 3 (See CONTRIBUTING.md) 🤷‍♀️ Ambiguity An issue/PR which identifies or fixes spec ambiguity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spec and graphql-js disagree on the meaning of a leading zero in numbers
6 participants