Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCX reader should handle table caption created in non-English Microsoft Word #9518

Closed
rgaiacs opened this issue Feb 27, 2024 · 0 comments
Closed
Labels

Comments

@rgaiacs
Copy link
Contributor

rgaiacs commented Feb 27, 2024

DOCX reader should

instead of checking the styleId, is look up the style id and check the style's <w:name> element to see if it is "caption".

as pointed by @jgm.

Previous discussed at #9515

I have my Microsoft Word in German and my document in English.

Screenshot 2024-02-26 170602

I create a table using the Microsoft Word built-in interface.

Screenshot 2024-02-26 170752

And I add a caption using the Microsoft Word built-in dialogue window.

Screenshot 2024-02-26 170852

Because my document is in English, Word automatically set the caption to "Table".

The final minimal working example is mwe-using-german-word.docx.

When I run pandoc --from docx --to html mwe-using-german-word.docx, the output is

<p>Lorem ipsum</p>
<p>Table 1 Example</p>
<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>C</td>
<td>D</td>
</tr>
</tbody>
</table>

instead of

<p>Lorem ipsum</p>
<table>
<caption><p>Example</p></caption>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1</td>
<td>2</td>
</tr>
</tbody>
</table>

that is produced by the same command (pandoc --from docx --to html) but using mwe-using-english-word.docx as input.

XML of non-English Document

The caption is

    <w:p w14:paraId="1FADD07B" w14:textId="3660CC9A" w:rsidR="00917377" w:rsidRDefault="00917377" w:rsidP="00917377">
      <w:pPr>
        <w:pStyle w:val="Beschriftung"/>
        <w:keepNext/>
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">Table </w:t>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
        <w:instrText xml:space="preserve"> SEQ Table \* ARABIC </w:instrText>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="separate"/>
      </w:r>
      <w:r>
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:t>1</w:t>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="end"/>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:proofErr w:type="spellStart"/>
      <w:r>
        <w:t>Example</w:t>
      </w:r>
      <w:proofErr w:type="spellEnd"/>
    </w:p>

XML of English Document

    <w:p w14:paraId="5DE3A68F" w14:textId="153D5F3C" w:rsidR="000E6255" w:rsidRDefault="000E6255" w:rsidP="000E6255">
      <w:pPr>
        <w:pStyle w:val="Caption"/>
        <w:keepNext/>
      </w:pPr>
      <w:r>
        <w:t xml:space="preserve">Table </w:t>
      </w:r>
      <w:fldSimple w:instr=" SEQ Table \* ARABIC ">
        <w:r>
          <w:rPr>
            <w:noProof/>
          </w:rPr>
          <w:t>1</w:t>
        </w:r>
      </w:fldSimple>
      <w:r>
        <w:t xml:space="preserve"> Example</w:t>
      </w:r>
    </w:p>
@rgaiacs rgaiacs added the bug label Feb 27, 2024
@jgm jgm closed this as completed in 6f87c9e Feb 28, 2024
jgm added a commit that referenced this issue Feb 28, 2024
Normally these occur outside the table element itself, but they
should still be parsed as captions in this case.

Closes #9518.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant