DOCX table with rowspan to RST produces invalid syntax #4059

rasky · 2017-11-11T11:56:41Z

The attached minimal DOCX contains what appears to be a single table with 1 column and 1 header row. When generating a RST output, a malformed syntax is produced: the header row has 1 column, but the other rows have 2 columns. I don't know whether the bug is in the DOCX reader or RST writer.

Command line: pandoc --from=docx --to=rst single-table.docx --output=single-table.rst

Verified with pandoc 2.0.1.1.

single-table.docx

The text was updated successfully, but these errors were encountered:

mb21 · 2017-11-11T12:12:55Z

Pandoc currently does not support col- or row-spans. All rows need the same number of tables which this example doesn't have (header row has one cell, the rows below have two cells).

rasky · 2017-11-11T12:32:27Z

Looks like it's #1024 then

mb21 · 2017-11-11T12:36:42Z

Duplicate of #1024

rasky · 2017-11-11T15:23:58Z

@mb21 thinking again, the output produced by the RST writer is malformed (invalid syntax). I think the writer should at least strive to always produce a well formed RST.

In this case, the writer could fallback to generating a table where the number of columns is the max of the columns defined for each row. The table would still be wrong compared to the original, but at least the RST would be well-formed.

mb21 · 2017-11-11T16:21:06Z

Fair enough... I'm not sure whether we'll get around to that before #1024 though, and whether the docx reader or rst writer is to blame (see #3648)

For the record, your docx input produces something like the following with -to native:

Table [] [AlignDefault] [0.0]
 [[]]
 [[[Plain [Str "lista"]]]
 ,[[]
  ,[Para [Str "Eliminati""]
   ,Para [Str "Sono"]]]
  ]

danse · 2017-11-14T13:08:50Z

i don't see the connection with #3648. as far as i understand, here the translation goes through a native table where the number of cells is not consistent. i see three places where we could try to fix this:

in the reader
in a layer that is meant to transform the native representation after reading and before writing
in the writer

i would exclude 3 because it would require to work on all writers, and i am not sure that a layer like 2 exists in the code base. therefore if you agree i would try to reproduce this behaviour in the test suite for the DOCX reader. i tried to look into the source of the DOCX and reducing it to a minimal case doesn't seem easy to me, so i would add the document here to test/docx as it is

mb21 · 2017-11-14T13:18:44Z

@danse I guess I agree with you. The reader should be changed to always produce a table representation where all lists have the same length (this was how 3648 was resolved as well).

i would add the document here to test/docx as it is

I would at least make sure it's only two rows with one word in each cell, to keep the tests somewhat readable.

danse · 2017-11-14T13:41:33Z

@rasky when i try to modify the DOCX file it gets sanitised, and editing it manually would be awkward. would it be easy for you to reduce it to the minimum, so that we can include it in a test case? or do you have any hints about how to do it?

danse · 2017-11-14T17:44:14Z

the problem in the document seems to be simply an inconsistent number of w:tc tags within a w:tr tag. the parser does what seems more natural, it just reads the document without making any assumption nor enforcing any number of cells

what do we want to do?

in order to produce well-formed tables we could count the maximum row size and pad with empty cells in rows that don't reach that number. whether the padding should go to the beginning or the end of the row is arbitrary

where do we want to do it?

modifying the reader would probably mean adding such logic here, but to be honest this is generic logic which could be applied to all the tables read by any reader, so we could opt to keep it outside the reader in a sanitisation layer which would reduce malformed files produced by any writer. since this has a performance cost, we could think of adding a --sanitise or --fix flag

mb21 · 2017-11-14T18:02:24Z

I guess @jgm will have to chime in on this. Not sure we want to invest too much time on this with #1024 hopefully being resolved soon...

rasky · 2017-11-14T19:29:37Z

@danse sorry I can't help you on reducing the DOCX. I don't know how it was produced, but it's a "real document" that I reduced by simply deleting everything but the table.

What I don't understand is: does the native format allow for tables where each row has a different number of columns? If that's well-defined, then it's a bug in the writer because RST cannot represent it, and we probably need to add empty columns. If instead the native format doesn't allow them, then it is a bug in the DOCX read that produces them... but then I wonder why there is no sanity check that an internal malformed table cannot be created in the first place.

danse · 2017-11-15T11:49:35Z

Not sure we want to invest too much time on this with #1024 hopefully being resolved soon...

as far as i understand, #1024 is about something different

@danse sorry I can't help you on reducing the DOCX. I don't know how it was produced, but it's a "real document" that I reduced by simply deleting everything but the table.

okay i can reduce the document manually

What I don't understand is: does the native format ...

the header and every cell is a distinct list in the native format. sanitisation can be introduced in different points, i guess that it would be better to wait for @jgm to point to a direction

jgm · 2017-11-15T18:48:41Z

Probably the easiest immediate fix would be to change the table builders (in Text.Pandoc.Builder) so they add empty cells to ensure that the table is of uniform width. +++ Francesco Occhipinti [Nov 15 17 11:49 ]:

…

Not sure we want to invest too much time on this with [1]#1024 hopefully being resolved soon... as far as i understand, [2]#1024 is about something different ***@***.*** sorry I can't help you on reducing the DOCX. I don't know how it was produced, but it's a "real document" that I reduced by simply deleting everything but the table. okay i can reduce the document manually What I don't understand is: does the native format ... the header and every cell is a distinct list in [4]the native format. sanitisation can be introduced in different points, i guess that it would be better to wait for ***@***.*** to point to a direction — You are receiving this because you were mentioned. Reply to this email directly, [6]view it on GitHub, or [7]mute the thread. References 1. #1024 2. #1024 3. https://github.com/danse 4. https://github.com/jgm/pandoc-types/blob/master/Text/Pandoc/Definition.hs#L229-L233 5. https://github.com/jgm 6. #4059 (comment) 7. https://github.com/notifications/unsubscribe-auth/AAAL5Ac61jrmMdurw4MgpYsLL9Tcl4Luks5s2s_QgaJpZM4Qad4D

danse · 2017-11-22T14:55:18Z

i wrote this test to replicate the problem on pandoc-types, let me know about any feedback so that i can adjust early. i see that in the table builder there is already some logic to extract the max length, so i expect the fix to be easy

danse · 2017-11-22T15:12:19Z

found out that i have to adjust a few details in the test, anyway i'd welcome feedback about the general direction

danse · 2017-11-23T18:48:20Z

okay, i adjusted the test and found a minimal solution which i propose in jgm/pandoc-types#36. since i'm new to the project, i understand that there might be many parts of my pull request that require additional work, please point me to the desired improvements. i also noticed that we might apply similar padding easily to the columns

danse · 2018-04-11T15:40:51Z

finally closed in Pandoc 2.1.3! 🙌

the lack of support for spans causes an output different from what one could expect, but there is no data loss nor syntax error

+-----------------------------------+-----------------------------------+
| **lista dei principali            |                                   |
| cambiamenti rispetto la revisione |                                   |
| precedente:**                     |                                   |
+-----------------------------------+-----------------------------------+
|                                   | Eliminati tutti i riferimenti     |
|                                   | all'articolo 5, comma 1, lettera  |
|                                   | b) del CAD (abolito con dlgs 26   |
|                                   | agosto 2016, n. 179).             |
|                                   |                                   |
|                                   | Sono interessai più paragrafi del |
|                                   | testo.                            |
+-----------------------------------+-----------------------------------+
|                                   | Sostituito l'intero capitolo 2    |
|                                   | "**Errore. L'origine riferimento  |
|                                   | non è stata trovata.**".          |
|                                   |                                   |
|                                   | L'aggiornamento riguarda          |
|                                   | unicamente le modifiche           |
|                                   | introdotte con la monografia      |
|                                   | "Utilizzo del codice IUV per Enti |
|                                   | pluri-intermediati", documento    |
|                                   | sottoposto a consultazione        |
|                                   | pubblica nel mese di maggio 2016. |
+-----------------------------------+-----------------------------------+
|                                   | Sostituito il paragrafo **Errore. |
|                                   | L'origine riferimento non è stata |
|                                   | trovata.** con eliminazione dei   |
|                                   | riferimenti al vecchio testo      |
|                                   | dell'articolo 5 del CAD.          |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+

pad table headers up to max row length to avoid syntax errors, closes…

mb21 marked this as a duplicate of #1024 Nov 11, 2017

mb21 closed this as completed Nov 11, 2017

mb21 reopened this Nov 11, 2017

mb21 changed the title ~~DOCX/RST: invalid table~~ DOCX table with rowspan to RST produces invalid syntax Nov 11, 2017

mb21 added format:Docx reader labels Nov 14, 2017

danse added a commit to danse/pandoc-types that referenced this issue Nov 22, 2017

add a test case for jgm/pandoc#4059 (to be squashed)

953bf35

danse mentioned this issue Dec 4, 2017

pad table headers up to max row length to avoid syntax errors, closes… jgm/pandoc-types#36

Merged

danse added a commit to danse/pandoc-types that referenced this issue Jan 15, 2018

build tables with consistent lengths, see jgm/pandoc#4059

442d858

danse added a commit to danse/pandoc-types that referenced this issue Jan 15, 2018

build tables with consistent lengths, see jgm/pandoc#4059

ed19be7

danse added a commit to danse/pandoc-types that referenced this issue Jan 15, 2018

build tables with consistent lengths, see jgm/pandoc#4059

e354416

danse added a commit to italia/pandoc-types that referenced this issue Mar 19, 2018

build tables with consistent lengths, see jgm/pandoc#4059

fc45711

jgm closed this as completed Apr 11, 2018

mb21 pushed a commit to mb21/pandoc-types that referenced this issue Sep 16, 2018

Merge pull request jgm#36 from danse/jgm/pandoc#4059

5fead2b

pad table headers up to max row length to avoid syntax errors, closes…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOCX table with rowspan to RST produces invalid syntax #4059

DOCX table with rowspan to RST produces invalid syntax #4059

rasky commented Nov 11, 2017

mb21 commented Nov 11, 2017

rasky commented Nov 11, 2017

mb21 commented Nov 11, 2017

rasky commented Nov 11, 2017

mb21 commented Nov 11, 2017 •

edited

Loading

danse commented Nov 14, 2017

mb21 commented Nov 14, 2017

danse commented Nov 14, 2017

danse commented Nov 14, 2017

mb21 commented Nov 14, 2017

rasky commented Nov 14, 2017

danse commented Nov 15, 2017

jgm commented Nov 15, 2017 via email

danse commented Nov 22, 2017

danse commented Nov 22, 2017

danse commented Nov 23, 2017

danse commented Apr 11, 2018

DOCX table with rowspan to RST produces invalid syntax #4059

DOCX table with rowspan to RST produces invalid syntax #4059

Comments

rasky commented Nov 11, 2017

mb21 commented Nov 11, 2017

rasky commented Nov 11, 2017

mb21 commented Nov 11, 2017

rasky commented Nov 11, 2017

mb21 commented Nov 11, 2017 • edited Loading

danse commented Nov 14, 2017

mb21 commented Nov 14, 2017

danse commented Nov 14, 2017

danse commented Nov 14, 2017

what do we want to do?

where do we want to do it?

mb21 commented Nov 14, 2017

rasky commented Nov 14, 2017

danse commented Nov 15, 2017

jgm commented Nov 15, 2017 via email

danse commented Nov 22, 2017

danse commented Nov 22, 2017

danse commented Nov 23, 2017

danse commented Apr 11, 2018

mb21 commented Nov 11, 2017 •

edited

Loading