syntax: support a limited version of ambiguous parsing #686

mvdan · 2021-03-17T14:57:52Z

POSIX Shell and Bash, as languages, have a few areas of ambiguity which are partially documented at https://github.com/mvdan/sh#caveats. The two main ones are:

Open parentheses. Is $(( the start of $((2 + 3)), or $((cmd1); cmd2)?
Parsing indexes. Is foo=([ the start of foo=([3]=bar), or foo=([^_]*.txt)?

In both of those cases, the parsing for the two variants is pretty different. Our parser currently does not backtrack, nor does it parse multiple "decision paths" at once, so we choose an arbitrary option and error if we made the wrong choice.

Erroring like that is a bad idea, because it breaks compatibility with the language, and it causes confusion with users.

There are some other cases of ambiguity which we do handle today, like whether foo is followed by () and is thus a func declaration, or is followed by a space and is thus a simple command. Those we can deal with just fine, because they require peeking just one token, instead of potentially an unlimited amount of tokens.

There are two ways we could add proper support for ambiguity:

When we reach an ambiguous point, we store the parse state, and continue parsing the common path. If we encounter an error before the ambiguity is resolved, we go back to the saved point, and try with the alternative path. If we finish the ambiguity with no error, we drop the stored parse state. This is the backtracking that Bash does, I believe.
When we reach an ambiguous point, we clone the parse state into a second copy, and use the two states to parse the input until the ambiguity is resolved. If the common one succeeds, we drop the other state and continue as normal. Otherwise, if the alternative state succeeds, we continue with that one.

The problem with option 1 is that we could potentially hold onto a very big buffer in memory, if the ambiguity spans megabytes or gigabytes of input. Setting a limit there could be reasonable, but any limit is going to be too small for some people and too large for others, whatever we do. If it's small, it could be that adding a comment inside an ambiguous node could result in a parse error. If it's big, large ambiguous nodes could make the memory usage grow unexpectedly.

In the past, in #350 I talked about those problems, and I also mentioned an alternative using io.Seek. Ultimately, I decided against both of those implementations.

I don't think option 1 can be implemented with our stream-based parser. Let's now consider option 2.

The upside is that we don't need buffering or seeking. The downside is the added CPU cost: cloning of the state (position, next token, etc) plus tokenizing and parsing the ambiguous bytes twice. I reckon that's reasonable, though; resolving ambiguity will always have a cost.

We can also limit the added CPU cost to a maximum of 2x by not supporting nested ambiguity, such as:

$(( # first fork; 2x the cost
$(( # second fork, 4x the cost!
$(( # third fork; 8x the cost...!

In those cases, we could just take the common path for the nested ambiguities, or we could error with a message like "reached the maximum level of nested ambiguity". Or a mix of both, like trying the common case, and if that fails, adding text to the end of the error like "refusing to disambiguate because we reached the maximum level of ambiguity".

I think this should also be configurable, for example with a maximum level of ambiguity to support. The default could be 1, limiting the CPU cost to 2x. One could set it to 0 to go back to the old behavior of never supporting ambiguity, or go higher to 2 or even 3 if the improved compatibility is more important than the exponentially increasing CPU cost.

This will be a big refactor, but one that I think is necessary if we want the parser to be fully compatible with the language spec. It's hard to estimate how long it would take to implement, but it's safe to say it would be at least a week's worth of work.

The text was updated successfully, but these errors were encountered:

mvdan · 2021-03-17T16:01:02Z

There is also an option 3 which I didn't mention: peek up to N tokens to disambiguate. However, in practice this has a lot of the same problems as option 1, because we still have to choose an arbitrary limit. Plus, I think we would have to fully parse the tokens to disambiguate, so it's not cheaper than any of the other two options.

kaey · 2021-03-17T17:45:14Z

As a related work https://tree-sitter.github.io/tree-sitter/ is an example of a branching parser. It creates branches for ambiguous syntax and later discards wrong ones.

mvdan · 2021-03-17T18:15:34Z

I see; I think this general idea is https://en.wikipedia.org/wiki/GLR_parser.

I get that we're using a recursive descent parser, which is less powerful than something like an LR parser, which could be upgraded to GLR. But I still think it would be possible to have the "branching" in our Go implementation. A naive approach would be to add one goroutine per extra branch, since we use the call stack to store part of the state.

mvdan added enhancement help wanted labels Mar 17, 2021

This was referenced Mar 17, 2021

syntax: support backtracking in the parser if the input implements io.Seeker #350

Closed

syntax: permit simple commands like x[0] #604

Closed

syntax: ambiguity when parsing globs in indexed array elements #558

Open

This was referenced Dec 14, 2021

shfmt: reached ) without matching (( with )) #774

Closed

syntax: error parsing tilde expansions when declaring an associative array #764

Closed

mvdan mentioned this issue Dec 12, 2022

Associative array key with hyphen breaks in v3.6.0 #956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syntax: support a limited version of ambiguous parsing #686

syntax: support a limited version of ambiguous parsing #686

mvdan commented Mar 17, 2021 •

edited

Loading

mvdan commented Mar 17, 2021 •

edited

Loading

kaey commented Mar 17, 2021 •

edited

Loading

mvdan commented Mar 17, 2021

syntax: support a limited version of ambiguous parsing #686

syntax: support a limited version of ambiguous parsing #686

Comments

mvdan commented Mar 17, 2021 • edited Loading

mvdan commented Mar 17, 2021 • edited Loading

kaey commented Mar 17, 2021 • edited Loading

mvdan commented Mar 17, 2021

mvdan commented Mar 17, 2021 •

edited

Loading

mvdan commented Mar 17, 2021 •

edited

Loading

kaey commented Mar 17, 2021 •

edited

Loading