Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is ambiguity defined? What obligations do processors have in cases of ambiguity? #26

Closed
cmsmcq opened this issue Jan 11, 2022 · 8 comments
Assignees
Milestone

Comments

@cmsmcq
Copy link
Contributor

cmsmcq commented Jan 11, 2022

See https://lists.w3.org/Archives/Public/public-ixml/2022Jan/0030.html and ensuing thread.

In the current spec, two passages are relevant:

If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting parse must be marked as ambiguous ...

and

If more than one parse tree describes the input, the processor must serialize one of them. It is not defined how this choice is made, but the resulting parse should be marked as ambiguous ...

Several questions arise which appear to need answers:

  • What does "more than one parse results" mean? What does it mean for a "parse tree" to "describe" the input?
  • Is the parse, or parse tree, the XML output specified by the ixml grammar, or some other structure?
  • If some other structure, what structure / what tree?
  • Is a processor required to report ambiguity, or only advised to do so? (MUST or SHOULD?)
  • Is a processor required or advised to search or test for ambiguity and to report accordingly? Or is a processor free to check for ambiguity only in special cases?
@ndw ndw added this to the Version 1.0 milestone Apr 3, 2022
@cmsmcq
Copy link
Contributor Author

cmsmcq commented Apr 12, 2022

The draft of 29 March says, in the section on conformance of processors:

Note: the requirements require that grammars be processed by an algorithm that accepts and parses any context-free grammar; known algorithms of this class include [Earley], [Unger], [CYK], [GLR], and [GLL]; see also [Grune]. Different algorithms may differ slightly in how ambiguity is defined.

I do not believe the last sentence is correct, because when applied to context-free grammars, ambiguity is not at all dependent on the parsing algorithm.

The variation we have encountered is due to variations in ixml processors, not in parsing algorithms. The place to mention it is in the section headed "Parsing", after the paragraph reading:

Processors must accept and parse any conforming grammar, and produce at least one parse of supplied input that matches the grammar starting at the root symbol. If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting serialization should including the attribute ixml:state="ambiguous" on the document element. The ixml namespace URI is "http://invisiblexml.org/NS".

I would propose to add a note:

Note: owing to variations in the way ixml processors may prepare grammars for parsing, different processors may vary in whether some sentences are flagged as ambiguous or not.

I also note that the main text classes the reporting of ambiguity as a should (which is what I think we agreed on, in the call of 22 March), but the conformance section continues to list it as a must.

@ndw
Copy link
Contributor

ndw commented Apr 12, 2022

From the 12 April meeting, some discussion of where the text on ambiguity should go. "In the section on parsing" is proposed.

@cmsmcq
Copy link
Contributor Author

cmsmcq commented Apr 25, 2022

The minutes of 12 April are a little sketchy on the discussion of this issue and the nature of my concerns. For the record, I will try to record them more precisely here.

The current (2022-04-15) conformance section follows the references to the Earley, Unger, CYK, GLL, and GLR algorithms with the sentence

Different algorithms may differ slightly in how ambiguity is defined.

This suggests incorrectly that the definition of ambiguity is somehow tied up with the algorithm used by a parser, which is false, and that Earley, Unger, CYK, GLL, and GLR do not all use the same definition, which is not even false because they don't rely on or use any definition of ambiguity at all. Conventional treatments of automata theory define ambiguity for BNF grammars, and while the form of words may vary the various formal definitions are all provably equivalent. Our difficulties with ambiguity result from the fact that ixml is an EBNF notation, for which the conventional definitions do not apply directly, and we have imposed a conformance requirement (weakened in some places in the spec to a recommendation) for the identification of ambiguity without having any definition of what we mean by "ambiguity". So the inaccuracy of the current text is exacerbated by its attempt to pin the responsibility for the problem on other people instead of admitting that it's our own responsibility.

@ndw
Copy link
Contributor

ndw commented May 10, 2022

At the 10 May 2022 CG meeting, Steven asserted this was completed. If it isn't, I propose that a new issue be raised (or reopen this one).

@ndw ndw closed this as completed May 10, 2022
@cmsmcq
Copy link
Contributor Author

cmsmcq commented May 17, 2022

Not only the information about variable results in the detection of ambiguity, but also the mention of general parsing algorithms, has moved from the section on processor conformance to the section "Parsing". Personally, I thought the mention of the algorithms worked slightly better where it was, but its location is an editorial issue and I defer to the editor's judgement.

But the statement about the definition of ambiguity continues to be erroneous and misleading. The current text of the Parsing section reads in part:

If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting serialization should including the attribute ixml:state="ambiguous" on the document element. The ixml namespace URI is "http://invisiblexml.org/NS". Known algorithms that accept and parse any context-free grammar include [Earley], [Unger], [CYK], [GLR], and [GLL]; see also [Grune]; different algorithms may differ slightly in how ambiguity is defined.

The section on processor conformance reads in part:

If more than one parse tree describes the input, the processor must serialize one of them. It is not defined how this choice is made, but the resulting serialization must by default include the attribute ixml:state="ambiguous" on the document element. Processors may provide a user option to suppress that attribute; they may also provide a user option to produce more than one parse tree.

There are problems here:

  1. For "should including" read "should by default include".
  2. It is not the case that the parsing algorithms mentioned differ in how ambiguity is defined. None of them define ambiguity. This is not a matter of nuance. It is not a judgement call. Our spec should not be spreading misinformation about grammars and parsing.
  3. The current wording fails to convey the crucial information, which is that the presence or absence of ixml:state="ambiguous" for a given input grammar and input string may vary among conforming ixml processors.
  4. The main text and the list of conformance requirements should agree on the choice of "must" or "should".

I propose two changes. In the section on parsing, in the text quoted above insert "by default" after "should" (to agree with the section on conformance), change "including" to "include" (to make the sentence grammatical), delete the erroneous statement about parsing algorithms, and insert an explicit statement that different processors may vary in flagging a given sentence as ambiguous, so that the text quoted above is replaced by:

If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting serialization should by default include the attribute ixml:state="ambiguous" on the document element. D The ixml namespace URI is "http://invisiblexml.org/NS".
Owing to variations in the way ixml processors may prepare grammars for parsing, different processors may vary in whether a given sentence is flagged as ambiguous or not.
Known algorithms that accept and parse any context-free grammar include [Earley], [Unger], [CYK], [GLR], and [GLL]; see also [Grune].

In the passage quoted above from the section on parser conformance, replace the word "must" with the word "should", so the text reads

If more than one parse tree describes the input, the processor must serialize one of them. It is not defined how this choice is made, but the resulting serialization should by default include the attribute ixml:state="ambiguous" on the document element. Processors may provide a user option to suppress that attribute; they may also provide a user option to produce more than one parse tree.

Because the current text does not resolve the problem identified by this issue report, I am reopening the issue.

@cmsmcq cmsmcq reopened this May 17, 2022
@ndw
Copy link
Contributor

ndw commented May 17, 2022

Thank you, Michael. FWIW, your proposed changes look fine to me.

@cmsmcq
Copy link
Contributor Author

cmsmcq commented May 17, 2022

An even simpler solution for the first change (to the section on parsing) is to say less and make the passage quoted read just:

If more than one parse results, one is chosen; it is not defined how this choice is made, but the resulting serialization may include the attribute ixml:state="ambiguous" on the document element. The ixml namespace URI is "http://invisiblexml.org/NS". Known algorithms that accept and parse any context-free grammar include [Earley], [Unger], [CYK], [GLR], and [GLL]; see also [Grune].

An analogous change would then be necessary in the conformance section, to make the passage quoted there read:

If more than one parse tree describes the input, the processor must serialize one of them. It is not defined how this choice is made, but the resulting serialization may include the attribute ixml:state="ambiguous" on the document element. Processors may provide a user option to suppress that attribute; they may also provide a user option to produce more than one parse tree.

@ndw
Copy link
Contributor

ndw commented May 31, 2022

Agreed resolved at the 31 May 2022 CG meeting.

@ndw ndw closed this as completed May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants