Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the Yacc parser with hand-written recursive descent parser #32

Merged
merged 6 commits into from
Nov 13, 2016
Merged

Replace the Yacc parser with hand-written recursive descent parser #32

merged 6 commits into from
Nov 13, 2016

Conversation

apparentlymart
Copy link
Contributor

@apparentlymart apparentlymart commented Nov 7, 2016

I'm interested in proposing some changes and enhancements to HIL, but each time I've dug in I've found myself grumpy at the yacc-based parser, since I find it hard to extend. It also produces (IMHO) inadequate error messages, confusing users of applications such as Terraform that embed HIL.

I felt inclined to hand-write a replacement in Go, and when I dug into the code I noticed several references to the idea of doing this, so I figured I wasn't the first person to have this feeling.

So here's my attempt. It includes both a hand-written recursive descent parser and a new scanner that is designed to co-operate well with it. For now I kept it compatible with most of the quirks of the old parser, such as the weird handling of negation and the total lack of operator precedence for the binary operators. There are a few minor differences, but I don't expect them to cause real-world compatibility issues:

  • I extended the definition of an identifier to allow combining diacritical marks, where previously it allowed only unicode letters and numbers. This is mostly just a "for completeness" thing.

  • The new scanner disagrees with the old scanner about where exactly it counts the start positions for some token types when populating the ast.Pos values. For example:

    • In a sequence like "foo" the opening and closing quotes are themselves tokens, and so the token representing foo has a Column value that is one greater than what the old scanner would've generated.

    • The index operator counts its position as being the location of the opening [ rather than the position of the variable being indexed. I think this makes more sense because it is the bracket that represents the index operator, not the variable it's applied to.

  • Whereas the old scanner would eat up sequences like foo.bar.*.baz into a single identifier token, the new scanner is more conservative and produces separate tokens for the periods and star. The parser then sticks these back together to produce the full variable name needed for the AST. In general I favored making the scanner just do basic matching and having the parser do the "smart stuff", since it has more context to work with and can in theory produce better diagnostics.

    • This also implies that spaces between the punctuation elements are now tolerated, so e.g. foo . bar would produce the same tokens as foo.bar, though I certainly wouldn't encourage such usage.

As noted above, my original motivation is that I'm interested in making some changes to the HIL language. I restrained myself from implementing any such changes here, preferring to just get an "as close as reasonable" port of the previous behavior, but here are some examples of things I've been thinking about that are not implemented here, but in some cases the design of the new scanner/parser is making room for them:

  • Tighter integration with HCL so that, for example, HIL can emit errors with source positions relative to the entire HCL file an expression is embedded in, rather than just from the start of the string. Right now when a HIL syntax error is detected in Terraform it can be hard to find it in a large configuration. Lots of Terraform issues trace back to this but the most recent one I've seen is Unhelpful error message when unknown variable accessed terraform#7189 .

  • Indexing as a proper operator rather than just as an extension of the variable syntax. That is, to allow expressions such as split(" ", var.something)[0] where today indexing can be applied only directly to variables.

  • Boolean operators (logical AND, OR, NOT, equality tests, ternary operator, etc) to replace hacks people do with lookup tables in Terraform, and eventually to allow for the ideas discussed in Support use cases with conditional logic terraform#1604.

  • A more intuitive operator precedence for the existing arithmetic operators. This one is likely to have some backward-compatibility concerns for people who have been depending on the existing "leftmost-first" parsing scheme, but as discussed in Inconsistent operators precedence terraform#9169 it would be much better UX to make this follow conventional precedence rules.

Previously HIL used a generated scanner and parser, but the generated
parser is finicky and hard to extend. A later commit will replace that
parser with a hand-written recursive descent parser, but as a first step
the scanner is hand-written so it can be used as input to this forthcoming
parser.

To start this scanner produces comparable output to the old scanner but
for the following exceptions:

- Multi-byte tokens are just returned verbatim as they were presented in
  the source input. It's the parser's responsibility to resolve any
  escape sequences in LITERAL and STRING tokens and to ensure that
  INT and FLOAT tokens are of the appropriate range and syntax.

- The "column" position is maintained in runes rather than bytes, so that
  UTF-8 sequences can be counted more properly. (This is still not 100%
  accurate, since combining forms will count as multiple runes, but
  closer than before.)

- Identifiers may contain any characters that Unicode considers to be
  letters or combining marks, so they can now e.g. include latin accent
  characters or letters in non-Latin scripts.

- Sequences like foo.bar.*.baz are handled as a sequence of separate
  IDENTIFIER, PERIOD and STAR tokens, which the parser should then stick
  back together to make a variable name to include in the AST. This means
  that the parser can detect incorrect sequences like "foo..bar" and
  raise a nicer error message for them than the scanner would be able to.

The tests from the old scanner are preserved and adapted to the above
changes. Some of the test cases are adjusted to exercise the UTF-8
handling and verify the correct handling of some invalid cases.
This will be useful later, when generating error messages in the parser.
When hand-writing recursive descent parsers it's convenient to be able
to "peek ahead" by one Token to decide what path to take next. The Peeker
type provides that functionality in terms of the raw Token channel
returned by the Scan function.

Peeker also provides a Close function to gracefully end the scanner's
inner goroutine if parsing terminates early, e.g. due to an error.
The yacc parser was non-reentrant, generated poor error messages, and
is annoying to maintain as the language gets more complex.

This first change is a mostly-faithful port of the yacc parser, though
it produces some different error messages and has some different opinions
about what source code positions it reports for different nodes in the
AST.

This is a hand-written recursive descent parser, so it uses native
recursion to process recursive expression structures. For the relatively-
simple expressions that are used with HIL this should not pose any
problem and it makes the parser control flow easier to follow.
HIL expressions are often embedded in other files. A sufficiently-
sophisticated caller can pass ine the location where the HIL expression
starts relative to the broader file context, thus giving the user better
error diagnostics if a problem is encountered during either parsing or
evaluation.

This is entirely optional and will be ignored if empty.
Running this turned up some crasher bugs and some other interesting input
that wasn't being handled correctly. Just adding it here in case it's
useful in the future; the "gofuzz" build tag will prevent it from being
included in regular builds.
@mitchellh
Copy link
Contributor

LGTM, merging for TF 0.8.

@mitchellh mitchellh merged commit bd8a635 into hashicorp:master Nov 13, 2016
@apparentlymart
Copy link
Contributor Author

Thanks, @mitchellh!

Are you open to changes to HIL behavior in 0.8 (such as the ones I listed in my original note above) or is just this port alone a back-compat risk enough for 0.8? (In the short term I'm particularly interested in making indexing be parsed as a distinct operator.)

@mitchellh
Copy link
Contributor

I'm open to all 4 enhancements you proposed at the bottom (the 4 bullet point). Those are great and things we've talked about.

I take backwards compatibility seriously at this stage of Terraform, so that is important, but everything other than operator precedence that you listed is BC. Operator precedence is important enough that I'm open to it though.

@mitchellh
Copy link
Contributor

Also, I found some regressions. I'll CC you on those.

@apparentlymart
Copy link
Contributor Author

For anyone that finds themselves here wondering what became of the "future enhancements" I talked about in my original summary:

At the time of writing we're working on building out HCL2, which merges HCL and HIL to create a new language that supports configuration structure, embedded expressions, and templates (a generalization of "interpolation" from HIL). As I write this it is still experimental and subject to change, both in its own syntax and in its Go API.

@apparentlymart apparentlymart deleted the go-parser branch July 14, 2018 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants