Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should document the XML tag set that results from parsing an ixml grammar with the ixml specification grammar #137

Open
ndw opened this issue Aug 3, 2022 · 4 comments

Comments

@ndw
Copy link
Contributor

ndw commented Aug 3, 2022

No description provided.

@cmsmcq
Copy link
Contributor

cmsmcq commented Aug 12, 2022

I think that would be a good idea.

One question is: what form should the documentation take? The two possibilities I see are:

  • A styled copy of the RNG file

    In some projects, I have documented schemas by just using the facilities in RNG to insert documentation into the RNG file and supplying an XSLT stylesheet. Alas, I don't see any of those schemas on the web and I don't see a way to attach a screen shot to this comment, but the basic result is, of course, just a rendered form of the RNG schema, with commentary styled however you like. It's not literate programming -- the schema order is necessarily preserved -- , but it's better than nothing.

[Edit, 9 October. It turns out one can attach an image after all. Here is a screen shot of an RNG schema with embedded TEI-encoded documentation.]
Screenshot from 2022-10-09 16-26-36
[For what it's worth, here is the XML form of the section on the tei:c element.]
Screenshot from 2022-10-09 16-34-00

  • Alphabetical reference documentation (with maybe a little prose at the front)

    Ever since I saw the Formex manual many years ago, I have liked the idea of a prose description followed by an alphabetical sequence of reference documentation on the elements and attributes of the vocabulary, along the lines of the JATS tag set documentation or the TEI reference documentation.

    If documentation along the lines of the test-catalog reference documentation is acceptable, we can get it by using the TEI vocabulary for tag set documentation. (The TEI documentation typically assumes that you want to generate the schema from the same source file, but as that example shows, it is possible to work around that.)

    I suspect there may be a Docbook-based way to do this, also. (Of course there is.)

    So if we do it this way, it's just a question of whether we prefer to use TEI or Docbook or something else to write the reference documentation. (Or maybe it's just a question of whether Norm gets around to generating the skeleton document first, or I do.)

@cmsmcq
Copy link
Contributor

cmsmcq commented Nov 15, 2022

We discussed this on the call of 15 November; NDTW suggested that we could avoid having to chose between TEI and Docbook as the basis by doing the entire thing in XHTML, which was accepted as a Solomonic decision. He took an action to build a prototype.

@ndw
Copy link
Contributor Author

ndw commented Nov 16, 2022

Where are the tools that build the RNG grammar from the ixml grammar? I'm a bit confused by some of the output, for example:

   <rng:define name="e.version">
      <rng:element name="version">
         <rng:ref name="extension-attributes"/>
         <rng:interleave>
            <rng:ref name="extension-elements"/>
            <rng:group>
               <rng:ref name="RS"/>
               <rng:ref name="RS"/>
               <rng:ref name="string"/>
               <rng:ref name="s"/>
            </rng:group>
         </rng:interleave>
      </rng:element>
   </rng:define>

Why is RS duplicated? (And elsewhere, why is s duplicated?)

I wonder if a mechanically generated grammar will ever be simple enough to be usefully documented.

I'm finding this, for example, hard to follow and difficult to imagine documenting:

whitespace = h.whitespace
h.whitespace =
  # alt with no realized children
  empty
  | tab
  | lf
  | cr
tab = h.tab
h.tab =
  # alt with no realized children
  empty

@cmsmcq
Copy link
Contributor

cmsmcq commented Nov 16, 2022

The RNG is generated by running the Gingersnap stylesheet ixml-to-rng.xsl on ixml.xml (using Saxon HE). The RNC is generated from the RNG using trang.

The design principle of the transform, for what it is worth, is to stay as close to the structure of the ixml as possible. (That's not an end in itself, but it does help keep the transform simple by eliminating the temptation to perform simplifications of various kinds. RNG is well suited for simplification, so I just shoved responsibility for all simplifications and normalizations onto RNG tools.) The material quoted in Norm's comment may become clearer with (a) consideration of the corresponding ixml and (b) some explanation of the naming conventions used to handle the marks and tmarks of the ixml.

The ixml rule for whitespace is:

-whitespace: -[Zs]; tab; lf; cr.

Since the default marking for whitespace is -, any reference elsewhere in the grammar to whitespace (without a mark) will have the same effect as a reference to -whitespace. It will be, in effect, a reference to hidden whitespace (as opposed to whitespace-as-attribute or whitespace-as-element, which would also be possible). We record that with a definition: whitespace without a prefix means the same as h.whitespace (the prefix h. being used to render the mark - which hides a nonterminal).

whitespace = h.whitespace

The four right-hand sides of the ixml rule for whitespace turn into four disjuncts in the definition of h.whitespace. Since -[Zs] will produce the empty string in the visible-XML grammar, it is rendered with the RNC keyword empty; the three nonterminals are rendered as they appear.

h.whitespace =
  # alt with no realized children
  empty
  | tab
  | lf
  | cr

Next, we come to the definition of tab. The ixml rule is

-tab: -#9.

which says first that an unqualified reference to tab is hidden (so we will need to say that tab = h.tab, by default the nonterminal tab is hidden), and second that that hidden nonterminal will dominate the empty string (since the terminal -#9 is hidden).

tab = h.tab
h.tab =
  # alt with no realized children
  empty

I hope it is now slightly easier to follow.

It is probably not any easier to imagine documenting it, but since none of the nonterminals involved here turn into elements or attributes in the visible-XML form of the grammar, I don't think it needs to be documented. What need to be documented are the elements and attributes that can appear in an ixml grammar written in XML, and their content models. Because hidden nonterminals can easily get in the way and make the schema harder to understand, it may be best to start from a version of the schema in which many of the definitions have been expanded in place. I spend some time fiddling with Erik van der Vlist's sequence of XSLT stylesheets which implement the simplification / rewriting rules of the RNG spec, before discovering that the simplest way to get a schema in a form suitable for this kind of documentation is to use the undocumented -s option to Jing:

jing -s my-schema.rng > my-schema.simplified.rng 

I would show you the simplified form of the bit quoted above, if it existed, but in the simplified form of the schema, the whitespace nonterminal / definition has disappeared entirely. Which is, I think as it should be.

However, the definition of version shown does look a bit troubling. What's with the double RS?

The ixml spec grammar specifies:

version: -"ixml", RS, -"version", RS, string, s, -'.' .

The three hidden terminals disappear in the XML (they are hidden), so the right-hand side turns into RS, RS, string, s. If all of these were visible nonterminals, the ixml grammar would require that there be two RS elements before the string element; as it happens, RS is hidden, as are all of its potential descendants except for comment. The Jing-simplified version of ixml.rnc defines version as:

version =
  element version {
    attribute * - local:* { text }*,
    (_1*
     & ((comment?)+,
        (comment?)+,
        attribute string { text },
        (comment?)+))
  }

where _1 is a nameless definition for extension elements, and the double RS is visible in (comment?)+, (comment?)+.

Hmm. This is OK for me to eyeball when I am trying to look to see which children an element can have, but maybe some further simplification in the content model would help. It would be nice if the documentation could say the content model of the version element is something like

version = element version { 
    foreign-attributes*, 
    (foreign-elements* 
    & 
    (comment*, string, comment*)) 
}

or even something simpler which just ignores the foreign attributes and foreign elements (e.g. by starting the simplification with ixml-strict.rng, not ixml.rng).

But for that we appear to need to go back to Erik's transforms and figure out which ones will give us something closer to what we want. (Unless we can to find a different undocumented option to Jing.)

I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants