-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the Yacc parser with hand-written recursive descent parser #32
Conversation
Previously HIL used a generated scanner and parser, but the generated parser is finicky and hard to extend. A later commit will replace that parser with a hand-written recursive descent parser, but as a first step the scanner is hand-written so it can be used as input to this forthcoming parser. To start this scanner produces comparable output to the old scanner but for the following exceptions: - Multi-byte tokens are just returned verbatim as they were presented in the source input. It's the parser's responsibility to resolve any escape sequences in LITERAL and STRING tokens and to ensure that INT and FLOAT tokens are of the appropriate range and syntax. - The "column" position is maintained in runes rather than bytes, so that UTF-8 sequences can be counted more properly. (This is still not 100% accurate, since combining forms will count as multiple runes, but closer than before.) - Identifiers may contain any characters that Unicode considers to be letters or combining marks, so they can now e.g. include latin accent characters or letters in non-Latin scripts. - Sequences like foo.bar.*.baz are handled as a sequence of separate IDENTIFIER, PERIOD and STAR tokens, which the parser should then stick back together to make a variable name to include in the AST. This means that the parser can detect incorrect sequences like "foo..bar" and raise a nicer error message for them than the scanner would be able to. The tests from the old scanner are preserved and adapted to the above changes. Some of the test cases are adjusted to exercise the UTF-8 handling and verify the correct handling of some invalid cases.
This will be useful later, when generating error messages in the parser.
When hand-writing recursive descent parsers it's convenient to be able to "peek ahead" by one Token to decide what path to take next. The Peeker type provides that functionality in terms of the raw Token channel returned by the Scan function. Peeker also provides a Close function to gracefully end the scanner's inner goroutine if parsing terminates early, e.g. due to an error.
The yacc parser was non-reentrant, generated poor error messages, and is annoying to maintain as the language gets more complex. This first change is a mostly-faithful port of the yacc parser, though it produces some different error messages and has some different opinions about what source code positions it reports for different nodes in the AST. This is a hand-written recursive descent parser, so it uses native recursion to process recursive expression structures. For the relatively- simple expressions that are used with HIL this should not pose any problem and it makes the parser control flow easier to follow.
HIL expressions are often embedded in other files. A sufficiently- sophisticated caller can pass ine the location where the HIL expression starts relative to the broader file context, thus giving the user better error diagnostics if a problem is encountered during either parsing or evaluation. This is entirely optional and will be ignored if empty.
Running this turned up some crasher bugs and some other interesting input that wasn't being handled correctly. Just adding it here in case it's useful in the future; the "gofuzz" build tag will prevent it from being included in regular builds.
LGTM, merging for TF 0.8. |
Thanks, @mitchellh! Are you open to changes to HIL behavior in 0.8 (such as the ones I listed in my original note above) or is just this port alone a back-compat risk enough for 0.8? (In the short term I'm particularly interested in making indexing be parsed as a distinct operator.) |
I'm open to all 4 enhancements you proposed at the bottom (the 4 bullet point). Those are great and things we've talked about. I take backwards compatibility seriously at this stage of Terraform, so that is important, but everything other than operator precedence that you listed is BC. Operator precedence is important enough that I'm open to it though. |
Also, I found some regressions. I'll CC you on those. |
For anyone that finds themselves here wondering what became of the "future enhancements" I talked about in my original summary: At the time of writing we're working on building out HCL2, which merges HCL and HIL to create a new language that supports configuration structure, embedded expressions, and templates (a generalization of "interpolation" from HIL). As I write this it is still experimental and subject to change, both in its own syntax and in its Go API. |
I'm interested in proposing some changes and enhancements to HIL, but each time I've dug in I've found myself grumpy at the yacc-based parser, since I find it hard to extend. It also produces (IMHO) inadequate error messages, confusing users of applications such as Terraform that embed HIL.
I felt inclined to hand-write a replacement in Go, and when I dug into the code I noticed several references to the idea of doing this, so I figured I wasn't the first person to have this feeling.
So here's my attempt. It includes both a hand-written recursive descent parser and a new scanner that is designed to co-operate well with it. For now I kept it compatible with most of the quirks of the old parser, such as the weird handling of negation and the total lack of operator precedence for the binary operators. There are a few minor differences, but I don't expect them to cause real-world compatibility issues:
I extended the definition of an identifier to allow combining diacritical marks, where previously it allowed only unicode letters and numbers. This is mostly just a "for completeness" thing.
The new scanner disagrees with the old scanner about where exactly it counts the start positions for some token types when populating the
ast.Pos
values. For example:In a sequence like
"foo"
the opening and closing quotes are themselves tokens, and so the token representingfoo
has aColumn
value that is one greater than what the old scanner would've generated.The index operator counts its position as being the location of the opening
[
rather than the position of the variable being indexed. I think this makes more sense because it is the bracket that represents the index operator, not the variable it's applied to.Whereas the old scanner would eat up sequences like
foo.bar.*.baz
into a single identifier token, the new scanner is more conservative and produces separate tokens for the periods and star. The parser then sticks these back together to produce the full variable name needed for the AST. In general I favored making the scanner just do basic matching and having the parser do the "smart stuff", since it has more context to work with and can in theory produce better diagnostics.foo . bar
would produce the same tokens asfoo.bar
, though I certainly wouldn't encourage such usage.As noted above, my original motivation is that I'm interested in making some changes to the HIL language. I restrained myself from implementing any such changes here, preferring to just get an "as close as reasonable" port of the previous behavior, but here are some examples of things I've been thinking about that are not implemented here, but in some cases the design of the new scanner/parser is making room for them:
Tighter integration with HCL so that, for example, HIL can emit errors with source positions relative to the entire HCL file an expression is embedded in, rather than just from the start of the string. Right now when a HIL syntax error is detected in Terraform it can be hard to find it in a large configuration. Lots of Terraform issues trace back to this but the most recent one I've seen is Unhelpful error message when unknown variable accessed terraform#7189 .
Indexing as a proper operator rather than just as an extension of the variable syntax. That is, to allow expressions such as
split(" ", var.something)[0]
where today indexing can be applied only directly to variables.Boolean operators (logical AND, OR, NOT, equality tests, ternary operator, etc) to replace hacks people do with lookup tables in Terraform, and eventually to allow for the ideas discussed in Support use cases with conditional logic terraform#1604.
A more intuitive operator precedence for the existing arithmetic operators. This one is likely to have some backward-compatibility concerns for people who have been depending on the existing "leftmost-first" parsing scheme, but as discussed in Inconsistent operators precedence terraform#9169 it would be much better UX to make this follow conventional precedence rules.