Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

https://github.com/invisibleXML/ixml/blob/master/samples/URI/rfc-3987.ixml #139

Open
spemberton opened this issue Aug 12, 2022 · 11 comments

Comments

@spemberton
Copy link
Member

spemberton commented Aug 12, 2022

Although this is straight out of the RFC, it is not good enough for proper use.
HEXDIG should include "a"-"f"
ipchar, iunreserved and ucschar should have a "-" before the rule.
The grammar is ambiguous, but that needs work to investigate (on it).

@spemberton
Copy link
Member Author

Also unused rules:
**** Unused rules: {"CR"; "DQUOTE"; "IRI-reference"; "LF"; "SP"; "absolute-IRI"; "ipath"; "reserved"}

@ndw
Copy link
Contributor

ndw commented Aug 12, 2022

Regarding "a"-"f", the ABNF doesn't include the lowercase versions, but the relevant part of RFC 2234 is apparently:

   NOTE:     ABNF strings are case-insensitive and
             the character set for these strings is us-ascii.

So all quoted strings in the ABNF form have to be changed to support mixed case.

@spemberton
Copy link
Member Author

Removing unused rules, more rules become unused:
**** Unused rules: {"IRI-reference"; "gen-delims"}
**** Unused rule: {"irelative-ref"}
**** Unused rule: {"irelative-part"}
**** Unused rule: {"ipath-noscheme"}
**** Unused rule: {"isegment-nz-nc"}

@spemberton
Copy link
Member Author

One source of ambiguity is:
The input from line.pos 9.8 to 9.19 can be interpreted as 'ihost' in 2 different ways:
1: ihost[9.8:]: IPv4address[:9.19]
2: ihost[9.8:]: ireg-name[:9.19]

This is because "192.168.0.org" is a valid ireg-name, and they don't bother to discern.

That is "192.168.0.0" matches ireg-name anyway.

And that is because they are lazy and don't discern subdomains, just allowing a host to be any mixture of ALPHA | DIGIT | "-" | "." | "_" | "~" | ucschar. (which I believe isn't syntactically valid)

@spemberton
Copy link
Member Author

Regarding "a"-"f", the ABNF doesn't include the lowercase versions, but the relevant part of RFC 2234 is apparently:

   NOTE:     ABNF strings are case-insensitive and
             the character set for these strings is us-ascii.

So all quoted strings in the ABNF form have to be changed to support mixed case.

I believe all other parts of the grammar already supports mixed case.

@ndw
Copy link
Contributor

ndw commented Aug 12, 2022

It might be useful to write test cases against the sample grammars. My processor in --pedantic mode would have flagged the unused productions, I think.

@spemberton
Copy link
Member Author

Commenting out the use of ipv4 in ihost makes all my test examples (not a huge number) unambiguous.

@cmsmcq
Copy link
Contributor

cmsmcq commented Aug 12, 2022

Some of the suggestions in this issue seem to me to make sense; others do not.

Our judgement may depend on what we think the purpose of the exercise is. My goal was an ixml translation of the grammar in the RFC, with marks to make the XML nicer (for some subjective judgement of 'niceness'). I did not think the goal was to suggest improvements to the normative grammar in the RFC.

I don't object in principle to a sample grammar that deviates in well defined ways from the normative grammar for the language in question, but I think it needs to be strongly motivated and the deviations clearly explained. If we think, for example, that the ixml grammar would be more useful if we made host and ihost unambiguous, or if ireg-name were defined as

ireg-name = label ++ ".".
label = ...

or as

ireg-name = (sub-domain ** ".", ".")?, TLD.
sub-domain = label.
TLD = label.
-label = ...

then we can do so, but we need to explain (first to each other and then to the public) why we think that's more helpful and what class of domain names will be grammatical in the normative grammar but ungrammatical in ours, or vice versa, and why we think deviating from the normative spec for those domain names will probably not matter in practice. So far, I haven't seen any reason to change my understanding of the goal of these grammars.

  • Eliminating unused nonterminals

    I am reluctant to do this, at least for some nonterminals.

    Both RFC 3986 nor RFC 3987 use the same set of production rules to define multiple objects. Implicitly, they each define multiple grammars with distinct root symbols and the same set of production rules. Which start symbol you use depends on what you are trying to do. For that reason, I am reluctant to remove (say) the definition for IRI-reference or absolute-IRI (or even ipath) from the grammar.

    I don't have a very strong opinion about the low-level rules imported from RFC 2234. The spec explicitly imports the rules shown, and my recollection is that I kept them all even though some of them are not actually used, because that seemed to me a more accurate reflection of the RFC. Retaining them seems less important to me than retaining the alternative roots for the grammar. I don't, however, see that they do anyone any harm.

  • Hiding low-level nonterminals.

    Agreed.

    It would probably be better to make IRI-reference the start symbol for the IRI grammar, analogous to the choice made in the translation of 3986. It's clear, looking at the grammars, that when I did the translation I spent more time tweaking 3986. More generally, I think it would be helpful to align the two grammars better by using IRI-reference as the root and hiding the productions SP mentions (and any others which correspond to nonterminals hidden in the 3986 grammar, including IRI-reference).

  • Allowing lower-case hex characters

    Agreed. Thank you; good catch.

  • Test cases.

    Agreed.

    The directory contains a file with 110 URIs gathered from the examples used in the specification to illustrate various syntactic possibilities. Turning that into a set of tests for the test collection strikes me as a good idea, as does making similar test collections for the other sample grammars. For at least a few grammars, I think it would be good to have a thorough set of positive and negative test cases; currently, I believe we achieve that (or come close) for the specification grammar, but not for any others. The real-world grammars in our samples directory are the best candidates for that treatment.

My apologies if anything in this comment seems terse or ungenerous; my ego seems to be reacting with less equanimity than one could wish to some of the wording in the comments on this issue.

@spemberton
Copy link
Member Author

spemberton commented Aug 12, 2022 via email

@spemberton
Copy link
Member Author

spemberton commented Aug 13, 2022 via email

@spemberton
Copy link
Member Author

spemberton commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants