Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make reference.docx pass validation #9263

Merged
merged 11 commits into from Dec 17, 2023
Merged

Make reference.docx pass validation #9263

merged 11 commits into from Dec 17, 2023

Conversation

edwintorok
Copy link
Contributor

I've used 2 docx validators:

They spotted some genuine errors (an extra > after a close tag, I assume a typo?), but there are also a lot of annoying errors about "Element is not expected". Took me a while to figure out that they complain about the order of XML tags!
Both the XSD and RELAXNG schemas for ISO/IEC 29500 use xs:sequence in a lot of places which demands a particular ordering for the XML tags. I don't know whether any application actually cares about this order, but it is better to fix them, otherwise the real validation errors are difficult to see due to all the noise.

With the changes in this PR the reference.docx now validates with both of the above validators, and at least LibreOffice and Google Docs can still open the .docx files.

Bugs fixed:

  • various XML tag ordering issues in reference.docx
  • duplicate XML tags that should be unique (pStyle)
  • value of tblW (the format of this value changed between various revisions of the ISO standard)
  • a missing required attribute on cnfStyle
  • a trailing > after a w:color (i.e. >>)

The actual output of pandoc doesn't always validate, and I haven't looked at validation anything else than docx, but lets start by fixing the reference docx in this PR.

The changes here are 1 commit / error fixed with one testfile regeneration commit at the end, but if you prefer I can squash them into a single commit or rebase&regen at each step, whichever you prefer (to retain bisectability post merge).

```
./tmp/styles-pretty.xml:30: element qFormat: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}qFormat': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}rPr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblPr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}trPr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tcPr, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblStylePr ).
```

According to `wml.xsd` it must come before `pPr`:
```
<xsd:complexType name="CT_Style">
    <xsd:sequence>
      [...]
      <xsd:element name="qFormat" type="CT_OnOff" minOccurs="0"/>
      [...]
      <xsd:element name="pPr" type="CT_PPrGeneral" minOccurs="0" maxOccurs="1"/>
```

Signed-off-by: Edwin Török <edwin@etorok.net>
```
./tmp/styles-pretty.xml:111: element spacing: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}spacing': This element is not expected. Expected is one of ( {http://schemas.openxmlformats.org/wordprocessingml/2006/main}textDirection, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}textAlignment, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}textboxTightWrap, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}outlineLvl, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}divId, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}cnfStyle, {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPrChange ).
```

According to `wml.xsd` `spacing` must be placed before `jc`:
```
 <xsd:sequence>
      <xsd:element name="spacing" type="CT_Spacing" minOccurs="0"/>
      [...]
      <xsd:element name="jc" type="CT_Jc" minOccurs="0"/>

```

Signed-off-by: Edwin Török <edwin@etorok.net>
There was an extra `>` which showed up as "character content" in the XML:
```
/tmp/styles-pretty.xml:113: element rPr: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rPr': Character content other than whitespace is not allowed because the content type is 'element-only'.
```

Signed-off-by: Edwin Török <edwin@etorok.net>
According to `wml.xsd` the order must be:
```
<xsd:sequence>
      <xsd:element name="tcBorders" type="CT_TcBorders" minOccurs="0" maxOccurs="1"/>
      [...]
      <xsd:element name="vAlign" type="CT_VerticalJc" minOccurs="0"/>
```

Signed-off-by: Edwin Török <edwin@etorok.net>
```
./tmp/document-pretty.xml:260: element tblW: Schemas validity error : Element '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}tblW', attribute '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}w': '0.0' is not a valid value of the union type '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}ST_MeasurementOrPercent'.
```

See http://officeopenxml.com/WPtableWidth.php, there is a disagreement
here between standard versions on whether a `%` is required or not when
type=`pct`, but the default is 0 when omitted, so just delete this
entry.

Signed-off-by: Edwin Török <edwin@etorok.net>
Error from OOXMLValidator:
```
  {
        "Description": "The required attribute 'val' is missing.",
        "Path": {
            "NamespacesDefinitions": [
                "xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\""
            ],
            "Namespaces": {

            },
            "XPath": "/w:document[1]/w:body[1]/w:tbl[1]/w:tr[1]/w:trPr[1]/w:cnfStyle[1]",
            "PartUri": "/word/document.xml"
        },
        "Id": "Sch_MissRequiredAttribute",
        "ErrorType": "Schema"
    },
```

This is a bitmask where the first bit means 'first row', which is set as
an attribute already.

Signed-off-by: Edwin Török <edwin@etorok.net>
```
{
        "Description": "The element has unexpected child element 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:doNotTrackMoves'.",
        "Path": {
            "NamespacesDefinitions": [
                "xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\""
            ],
            "Namespaces": {

            },
            "XPath": "/w:settings[1]",
            "PartUri": "/word/settings.xml"
        },
        "Id": "Sch_UnexpectedElementContentExpectingComplex",
        "ErrorType": "Schema"
    }
```

According to `wml.xsd` the order is:
```
 <xsd:complexType name="CT_Settings">
    <xsd:sequence>
      <xsd:element name="doNotTrackMoves" type="CT_OnOff" minOccurs="0"/>
      [...]
      <xsd:element name="footnotePr" type="CT_FtnDocProps" minOccurs="0"/>
```

Signed-off-by: Edwin Török <edwin@etorok.net>
From OOXMLValidator:
```
    {
        "Description": "The element has unexpected child element 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:b'.",
        "Path": {
            "NamespacesDefinitions": [
                "xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\""
            ],
            "Namespaces": {

            },
            "XPath": "/w:styles[1]/w:style[9]/w:rPr[1]",
            "PartUri": "/word/styles.xml"
        },
        "Id": "Sch_UnexpectedElementContentExpectingComplex",
        "ErrorType": "Schema"
    },
```

Signed-off-by: Edwin Török <edwin@etorok.net>
```
  {
        "Description": "The element has unexpected child element 'http://schemas.openxmlformats.org/wordprocessingml/2006/main:bCs'.",
        "Path": {
            "NamespacesDefinitions": [
                "xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\""
            ],
            "Namespaces": {

            },
            "XPath": "/w:styles[1]/w:style[15]/w:rPr[1]",
            "PartUri": "/word/styles.xml"
        },
        "Id": "Sch_UnexpectedElementContentExpectingComplex",
        "ErrorType": "Schema"
    },
```

Signed-off-by: Edwin Török <edwin@etorok.net>
Using `make test TESTARGS=--accept`

Signed-off-by: Edwin Török <edwin@etorok.net>
@jgm
Copy link
Owner

jgm commented Dec 17, 2023

Many thanks! This is great. I had no idea these elements had to go in a certain order.
Please let us know if there are other validation issues affecting pandoc-generated documents.

@jgm jgm merged commit 5875de3 into jgm:main Dec 17, 2023
9 of 12 checks passed
@edwintorok
Copy link
Contributor Author

edwintorok commented Dec 17, 2023

Thanks, I've opened a separate issue #9264, the reference doc validates now, but an empty doc created by pandoc does not. I guess the xml order is lost, luckily only in settings.xml, but I don't know enough about how Haskell's XML library works (it'll probably need to sort the tags to keep them in the original order?)

@edwintorok edwintorok deleted the validation branch December 17, 2023 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants