Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syntax: support a limited version of ambiguous parsing #686

Open
mvdan opened this issue Mar 17, 2021 · 3 comments
Open

syntax: support a limited version of ambiguous parsing #686

mvdan opened this issue Mar 17, 2021 · 3 comments

Comments

@mvdan
Copy link
Owner

mvdan commented Mar 17, 2021

POSIX Shell and Bash, as languages, have a few areas of ambiguity which are partially documented at https://github.com/mvdan/sh#caveats. The two main ones are:

  1. Open parentheses. Is $(( the start of $((2 + 3)), or $((cmd1); cmd2)?

  2. Parsing indexes. Is foo=([ the start of foo=([3]=bar), or foo=([^_]*.txt)?

In both of those cases, the parsing for the two variants is pretty different. Our parser currently does not backtrack, nor does it parse multiple "decision paths" at once, so we choose an arbitrary option and error if we made the wrong choice.

Erroring like that is a bad idea, because it breaks compatibility with the language, and it causes confusion with users.

There are some other cases of ambiguity which we do handle today, like whether foo is followed by () and is thus a func declaration, or is followed by a space and is thus a simple command. Those we can deal with just fine, because they require peeking just one token, instead of potentially an unlimited amount of tokens.

There are two ways we could add proper support for ambiguity:

  1. When we reach an ambiguous point, we store the parse state, and continue parsing the common path. If we encounter an error before the ambiguity is resolved, we go back to the saved point, and try with the alternative path. If we finish the ambiguity with no error, we drop the stored parse state. This is the backtracking that Bash does, I believe.

  2. When we reach an ambiguous point, we clone the parse state into a second copy, and use the two states to parse the input until the ambiguity is resolved. If the common one succeeds, we drop the other state and continue as normal. Otherwise, if the alternative state succeeds, we continue with that one.

The problem with option 1 is that we could potentially hold onto a very big buffer in memory, if the ambiguity spans megabytes or gigabytes of input. Setting a limit there could be reasonable, but any limit is going to be too small for some people and too large for others, whatever we do. If it's small, it could be that adding a comment inside an ambiguous node could result in a parse error. If it's big, large ambiguous nodes could make the memory usage grow unexpectedly.

In the past, in #350 I talked about those problems, and I also mentioned an alternative using io.Seek. Ultimately, I decided against both of those implementations.

I don't think option 1 can be implemented with our stream-based parser. Let's now consider option 2.

The upside is that we don't need buffering or seeking. The downside is the added CPU cost: cloning of the state (position, next token, etc) plus tokenizing and parsing the ambiguous bytes twice. I reckon that's reasonable, though; resolving ambiguity will always have a cost.

We can also limit the added CPU cost to a maximum of 2x by not supporting nested ambiguity, such as:

$(( # first fork; 2x the cost
$(( # second fork, 4x the cost!
$(( # third fork; 8x the cost...!

In those cases, we could just take the common path for the nested ambiguities, or we could error with a message like "reached the maximum level of nested ambiguity". Or a mix of both, like trying the common case, and if that fails, adding text to the end of the error like "refusing to disambiguate because we reached the maximum level of ambiguity".

I think this should also be configurable, for example with a maximum level of ambiguity to support. The default could be 1, limiting the CPU cost to 2x. One could set it to 0 to go back to the old behavior of never supporting ambiguity, or go higher to 2 or even 3 if the improved compatibility is more important than the exponentially increasing CPU cost.

This will be a big refactor, but one that I think is necessary if we want the parser to be fully compatible with the language spec. It's hard to estimate how long it would take to implement, but it's safe to say it would be at least a week's worth of work.

@mvdan
Copy link
Owner Author

mvdan commented Mar 17, 2021

There is also an option 3 which I didn't mention: peek up to N tokens to disambiguate. However, in practice this has a lot of the same problems as option 1, because we still have to choose an arbitrary limit. Plus, I think we would have to fully parse the tokens to disambiguate, so it's not cheaper than any of the other two options.

@kaey
Copy link
Contributor

kaey commented Mar 17, 2021

As a related work https://tree-sitter.github.io/tree-sitter/ is an example of a branching parser. It creates branches for ambiguous syntax and later discards wrong ones.

@mvdan
Copy link
Owner Author

mvdan commented Mar 17, 2021

I see; I think this general idea is https://en.wikipedia.org/wiki/GLR_parser.

I get that we're using a recursive descent parser, which is less powerful than something like an LR parser, which could be upgraded to GLR. But I still think it would be possible to have the "branching" in our Go implementation. A naive approach would be to add one goroutine per extra branch, since we use the call stack to store part of the state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants