Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Replace the Yacc parser with hand-written recursive descent parser #32
I'm interested in proposing some changes and enhancements to HIL, but each time I've dug in I've found myself grumpy at the yacc-based parser, since I find it hard to extend. It also produces (IMHO) inadequate error messages, confusing users of applications such as Terraform that embed HIL.
I felt inclined to hand-write a replacement in Go, and when I dug into the code I noticed several references to the idea of doing this, so I figured I wasn't the first person to have this feeling.
So here's my attempt. It includes both a hand-written recursive descent parser and a new scanner that is designed to co-operate well with it. For now I kept it compatible with most of the quirks of the old parser, such as the weird handling of negation and the total lack of operator precedence for the binary operators. There are a few minor differences, but I don't expect them to cause real-world compatibility issues:
As noted above, my original motivation is that I'm interested in making some changes to the HIL language. I restrained myself from implementing any such changes here, preferring to just get an "as close as reasonable" port of the previous behavior, but here are some examples of things I've been thinking about that are not implemented here, but in some cases the design of the new scanner/parser is making room for them:
Previously HIL used a generated scanner and parser, but the generated parser is finicky and hard to extend. A later commit will replace that parser with a hand-written recursive descent parser, but as a first step the scanner is hand-written so it can be used as input to this forthcoming parser. To start this scanner produces comparable output to the old scanner but for the following exceptions: - Multi-byte tokens are just returned verbatim as they were presented in the source input. It's the parser's responsibility to resolve any escape sequences in LITERAL and STRING tokens and to ensure that INT and FLOAT tokens are of the appropriate range and syntax. - The "column" position is maintained in runes rather than bytes, so that UTF-8 sequences can be counted more properly. (This is still not 100% accurate, since combining forms will count as multiple runes, but closer than before.) - Identifiers may contain any characters that Unicode considers to be letters or combining marks, so they can now e.g. include latin accent characters or letters in non-Latin scripts. - Sequences like foo.bar.*.baz are handled as a sequence of separate IDENTIFIER, PERIOD and STAR tokens, which the parser should then stick back together to make a variable name to include in the AST. This means that the parser can detect incorrect sequences like "foo..bar" and raise a nicer error message for them than the scanner would be able to. The tests from the old scanner are preserved and adapted to the above changes. Some of the test cases are adjusted to exercise the UTF-8 handling and verify the correct handling of some invalid cases.
When hand-writing recursive descent parsers it's convenient to be able to "peek ahead" by one Token to decide what path to take next. The Peeker type provides that functionality in terms of the raw Token channel returned by the Scan function. Peeker also provides a Close function to gracefully end the scanner's inner goroutine if parsing terminates early, e.g. due to an error.
The yacc parser was non-reentrant, generated poor error messages, and is annoying to maintain as the language gets more complex. This first change is a mostly-faithful port of the yacc parser, though it produces some different error messages and has some different opinions about what source code positions it reports for different nodes in the AST. This is a hand-written recursive descent parser, so it uses native recursion to process recursive expression structures. For the relatively- simple expressions that are used with HIL this should not pose any problem and it makes the parser control flow easier to follow.
HIL expressions are often embedded in other files. A sufficiently- sophisticated caller can pass ine the location where the HIL expression starts relative to the broader file context, thus giving the user better error diagnostics if a problem is encountered during either parsing or evaluation. This is entirely optional and will be ignored if empty.
Running this turned up some crasher bugs and some other interesting input that wasn't being handled correctly. Just adding it here in case it's useful in the future; the "gofuzz" build tag will prevent it from being included in regular builds.
Are you open to changes to HIL behavior in 0.8 (such as the ones I listed in my original note above) or is just this port alone a back-compat risk enough for 0.8? (In the short term I'm particularly interested in making indexing be parsed as a distinct operator.)
I'm open to all 4 enhancements you proposed at the bottom (the 4 bullet point). Those are great and things we've talked about.
I take backwards compatibility seriously at this stage of Terraform, so that is important, but everything other than operator precedence that you listed is BC. Operator precedence is important enough that I'm open to it though.
For anyone that finds themselves here wondering what became of the "future enhancements" I talked about in my original summary:
At the time of writing we're working on building out HCL2, which merges HCL and HIL to create a new language that supports configuration structure, embedded expressions, and templates (a generalization of "interpolation" from HIL). As I write this it is still experimental and subject to change, both in its own syntax and in its Go API.