Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity in mangling grammar around type qualifiers #179

Open
statham-arm opened this issue Mar 19, 2024 · 8 comments
Open

Ambiguity in mangling grammar around type qualifiers #179

statham-arm opened this issue Mar 19, 2024 · 8 comments

Comments

@statham-arm
Copy link

I've been playing around recently with feeding this ABI's mangled-name grammar to an Earley parser, with the aim of using it as a reference decoder for mangled names, in cases where a production demangler is misbehaving or two demanglers disagree on the interpretation of a name.

This exercise caused me to spot a couple of bugs in the grammar. One is already reported here (#120) so I won't go into that one further. The other is a cycle in the grammar, related to type qualifiers. The following productions exist in the grammar:

<type> ::= <qualified-type>
<qualified-type> ::= <qualifiers> <type>

And <qualifiers> can derive the empty string, because it expands to a series of things each of which is optional (any number of extended-qualifiers including 0, then optional r, V and K in that order).

Therefore, a derivation can contain the sequence <type><qualified-type><qualifiers> <type>, and <qualifiers> can be empty, which leads you back to <type> matching exactly the same sequence of tokens.

This is a problem for algorithmic parsing, because there are multiple formal parse trees for the same input, each going round the same pointless cycle a different number of times (although it's a benign ambiguity, since all those parse trees describe the same semantics).

But this cycle in the grammar also causes some unwanted things to be legal, because going round the cycle more than once permits more than one <qualifiers> nonterminal to appear before <type>. For example, my test parser accepts PKVi as a valid type, by going round the cycle twice, with the outer <qualified-type> consuming the K and the inner one consuming the V. The text alongside these grammar productions suggests that that was not intentional, and that PVKi is the only correct description of a pointer to volatile const int.

I think the following redesign eliminates the ambiguity, causing the formal grammar to reflect the intent expressed by the text:

<type> ::= <qualifiers> <unqualified-type>
<unqualified-type> ::= [all the other productions currently listed for <type>]

This structure forces exactly one instance of <qualifiers> to appear in a <type>, eliminating the ambiguity and the unwanted reorderings. (But we keep the property that <qualifiers> can be empty, which allows plain i and so on to still work.)

(However, I haven't checked what this change does to the question of what things are substitution candidates.)

@rjmccall
Copy link
Collaborator

Both the fully-qualified and unqualified types need to introduce substitution candidates, but we shouldn't do it twice for unqualified types. I'm hesitant to do some major refactor of the tree, but if you can find a clean solution, I'm open to taking it.

@statham-arm
Copy link
Author

Hmm, yes, that is awkward to specify in standard CFG notation, because you really want to define <qualifiers> in a way that makes it not allowed to expand to the empty string: you want to say that every single thing in it is optional, but one of them has to be there.

If you defined <qualifiers> so that it had to be non-empty, you could do something like this:

<type> ::= <qualifiers> <subtype>
       ::= <unqualified-type>
<subtype> ::= <unqualified-type>
<unqualified-type> ::= [big list of more specific productions]

and then designate <type> and <subtype> as the nonterminals that are substitution candidates. That way there's always an SC for the overall type, and an extra one for the unqualified type only if at least one qualifier is present.

But in plain CFG notation it's pretty awkward to write <qualifiers> in a way that forces it to be non-empty. The only approach I can think of is to break it up into subcases based on the first non-empty part:

<qualifiers> ::= <extended-qualifier>+ [r] [V] [K]
             ::=                        r  [V] [K]
             ::=                            V  [K]
             ::=                                K

and, really, yuck!

@rjmccall
Copy link
Collaborator

Yeah. Unfortunately, I think this is a situation where creating a formally correct and unambiguous grammar is actually in tension with clarity for human readers, which seems like the more important goal. I'm inclined to just close this, but I can leave it open if you'd like to continue thinking about it for a while.

@statham-arm
Copy link
Author

No argument there – of course I agree that clarity for humans is more important. But it would be nice if we could also have a reference parser that verifiably matches the spec.

Would you be open to a compromise in which the grammar is refactored as in the first snippet above, but with the RHS reference to <qualifiers> simply adding a note that it must be non-empty?

<type> ::= <qualifiers> <subtype> # but only if <qualifiers> is non-empty
       ::= <unqualified-type>
<subtype> ::= <unqualified-type>
<unqualified-type> ::= [big list of more specific productions]

That loses the really ugly part, which is the cumbersome redefinition of <qualifiers> via lots of horrible repetitive productions to force it to be non-empty.

I don't know that you could make an LR parser generator accept a grammar in this unusual representation (even if you wrote a custom one), but I do know that the simpler and more brute-force Earley algorithm would cope with it fine. Perhaps that would still be adequately clear to humans, while also permitting at least one parsing technology to handle the grammar?

@rjmccall
Copy link
Collaborator

Yeah, I think a note saying that <qualifiers> must be non-empty (and maximal?) would be totally reasonable.

@statham-arm
Copy link
Author

I'd prefer just "non-empty", and to handle "maximal" by reorganising the grammar to ensure it can't generate two adjacent <qualifiers>. "Maximal" is a harder property to check for automatically!

@rjmccall
Copy link
Collaborator

Do you have a suggestion of how to do that which doesn't look awful?

@statham-arm
Copy link
Author

I thought the suggestion in my last-but-one comment was reasonably minimal, just rearranging the top-level rule or two for <type>, and annotating the one use of <qualifiers> with a note saying this rule is only valid if it's non-empty. With that, <qualifiers> itself wouldn't have to be modified at all.

(It does mean there's no formal representation of that non-emptiness requirement. But I don't think there'd be any clean way to do that unless you introduced some kind of exciting new syntax for grammar rules, like "lhs ::= rhs1 AND NOT rhs2"!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants