Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCX table with rowspan to RST produces invalid syntax #4059

Closed
rasky opened this issue Nov 11, 2017 · 17 comments
Closed

DOCX table with rowspan to RST produces invalid syntax #4059

rasky opened this issue Nov 11, 2017 · 17 comments

Comments

@rasky
Copy link

rasky commented Nov 11, 2017

The attached minimal DOCX contains what appears to be a single table with 1 column and 1 header row. When generating a RST output, a malformed syntax is produced: the header row has 1 column, but the other rows have 2 columns. I don't know whether the bug is in the DOCX reader or RST writer.

Command line: pandoc --from=docx --to=rst single-table.docx --output=single-table.rst

Verified with pandoc 2.0.1.1.

single-table.docx

@mb21
Copy link
Collaborator

mb21 commented Nov 11, 2017

Pandoc currently does not support col- or row-spans. All rows need the same number of tables which this example doesn't have (header row has one cell, the rows below have two cells).

@rasky
Copy link
Author

rasky commented Nov 11, 2017

Looks like it's #1024 then

@mb21
Copy link
Collaborator

mb21 commented Nov 11, 2017

Duplicate of #1024

@mb21 mb21 marked this as a duplicate of #1024 Nov 11, 2017
@mb21 mb21 closed this as completed Nov 11, 2017
@rasky
Copy link
Author

rasky commented Nov 11, 2017

@mb21 thinking again, the output produced by the RST writer is malformed (invalid syntax). I think the writer should at least strive to always produce a well formed RST.

In this case, the writer could fallback to generating a table where the number of columns is the max of the columns defined for each row. The table would still be wrong compared to the original, but at least the RST would be well-formed.

@mb21
Copy link
Collaborator

mb21 commented Nov 11, 2017

Fair enough... I'm not sure whether we'll get around to that before #1024 though, and whether the docx reader or rst writer is to blame (see #3648)

For the record, your docx input produces something like the following with -to native:

Table [] [AlignDefault] [0.0]
 [[]]
 [[[Plain [Str "lista"]]]
 ,[[]
  ,[Para [Str "Eliminati""]
   ,Para [Str "Sono"]]]
  ]

@mb21 mb21 reopened this Nov 11, 2017
@mb21 mb21 changed the title DOCX/RST: invalid table DOCX table with rowspan to RST produces invalid syntax Nov 11, 2017
@danse
Copy link
Contributor

danse commented Nov 14, 2017

i don't see the connection with #3648. as far as i understand, here the translation goes through a native table where the number of cells is not consistent. i see three places where we could try to fix this:

  1. in the reader
  2. in a layer that is meant to transform the native representation after reading and before writing
  3. in the writer

i would exclude 3 because it would require to work on all writers, and i am not sure that a layer like 2 exists in the code base. therefore if you agree i would try to reproduce this behaviour in the test suite for the DOCX reader. i tried to look into the source of the DOCX and reducing it to a minimal case doesn't seem easy to me, so i would add the document here to test/docx as it is

@mb21
Copy link
Collaborator

mb21 commented Nov 14, 2017

@danse I guess I agree with you. The reader should be changed to always produce a table representation where all lists have the same length (this was how 3648 was resolved as well).

i would add the document here to test/docx as it is

I would at least make sure it's only two rows with one word in each cell, to keep the tests somewhat readable.

@danse
Copy link
Contributor

danse commented Nov 14, 2017

@rasky when i try to modify the DOCX file it gets sanitised, and editing it manually would be awkward. would it be easy for you to reduce it to the minimum, so that we can include it in a test case? or do you have any hints about how to do it?

@danse
Copy link
Contributor

danse commented Nov 14, 2017

the problem in the document seems to be simply an inconsistent number of w:tc tags within a w:tr tag. the parser does what seems more natural, it just reads the document without making any assumption nor enforcing any number of cells

what do we want to do?

in order to produce well-formed tables we could count the maximum row size and pad with empty cells in rows that don't reach that number. whether the padding should go to the beginning or the end of the row is arbitrary

where do we want to do it?

modifying the reader would probably mean adding such logic here, but to be honest this is generic logic which could be applied to all the tables read by any reader, so we could opt to keep it outside the reader in a sanitisation layer which would reduce malformed files produced by any writer. since this has a performance cost, we could think of adding a --sanitise or --fix flag

@mb21
Copy link
Collaborator

mb21 commented Nov 14, 2017

I guess @jgm will have to chime in on this. Not sure we want to invest too much time on this with #1024 hopefully being resolved soon...

@rasky
Copy link
Author

rasky commented Nov 14, 2017

@danse sorry I can't help you on reducing the DOCX. I don't know how it was produced, but it's a "real document" that I reduced by simply deleting everything but the table.

What I don't understand is: does the native format allow for tables where each row has a different number of columns? If that's well-defined, then it's a bug in the writer because RST cannot represent it, and we probably need to add empty columns. If instead the native format doesn't allow them, then it is a bug in the DOCX read that produces them... but then I wonder why there is no sanity check that an internal malformed table cannot be created in the first place.

@danse
Copy link
Contributor

danse commented Nov 15, 2017

Not sure we want to invest too much time on this with #1024 hopefully being resolved soon...

as far as i understand, #1024 is about something different

@danse sorry I can't help you on reducing the DOCX. I don't know how it was produced, but it's a "real document" that I reduced by simply deleting everything but the table.

okay i can reduce the document manually

What I don't understand is: does the native format ...

the header and every cell is a distinct list in the native format. sanitisation can be introduced in different points, i guess that it would be better to wait for @jgm to point to a direction

@jgm
Copy link
Owner

jgm commented Nov 15, 2017 via email

danse added a commit to danse/pandoc-types that referenced this issue Nov 22, 2017
@danse
Copy link
Contributor

danse commented Nov 22, 2017

i wrote this test to replicate the problem on pandoc-types, let me know about any feedback so that i can adjust early. i see that in the table builder there is already some logic to extract the max length, so i expect the fix to be easy

@danse
Copy link
Contributor

danse commented Nov 22, 2017

found out that i have to adjust a few details in the test, anyway i'd welcome feedback about the general direction

@danse
Copy link
Contributor

danse commented Nov 23, 2017

okay, i adjusted the test and found a minimal solution which i propose in jgm/pandoc-types#36. since i'm new to the project, i understand that there might be many parts of my pull request that require additional work, please point me to the desired improvements. i also noticed that we might apply similar padding easily to the columns

@danse
Copy link
Contributor

danse commented Apr 11, 2018

finally closed in Pandoc 2.1.3! 🙌

the lack of support for spans causes an output different from what one could expect, but there is no data loss nor syntax error

+-----------------------------------+-----------------------------------+
| **lista dei principali            |                                   |
| cambiamenti rispetto la revisione |                                   |
| precedente:**                     |                                   |
+-----------------------------------+-----------------------------------+
|                                   | Eliminati tutti i riferimenti     |
|                                   | all'articolo 5, comma 1, lettera  |
|                                   | b) del CAD (abolito con dlgs 26   |
|                                   | agosto 2016, n. 179).             |
|                                   |                                   |
|                                   | Sono interessai più paragrafi del |
|                                   | testo.                            |
+-----------------------------------+-----------------------------------+
|                                   | Sostituito l'intero capitolo 2    |
|                                   | "**Errore. L'origine riferimento  |
|                                   | non è stata trovata.**".          |
|                                   |                                   |
|                                   | L'aggiornamento riguarda          |
|                                   | unicamente le modifiche           |
|                                   | introdotte con la monografia      |
|                                   | "Utilizzo del codice IUV per Enti |
|                                   | pluri-intermediati", documento    |
|                                   | sottoposto a consultazione        |
|                                   | pubblica nel mese di maggio 2016. |
+-----------------------------------+-----------------------------------+
|                                   | Sostituito il paragrafo **Errore. |
|                                   | L'origine riferimento non è stata |
|                                   | trovata.** con eliminazione dei   |
|                                   | riferimenti al vecchio testo      |
|                                   | dell'articolo 5 del CAD.          |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+
|                                   |                                   |
+-----------------------------------+-----------------------------------+

@jgm jgm closed this as completed Apr 11, 2018
mb21 pushed a commit to mb21/pandoc-types that referenced this issue Sep 16, 2018
pad table headers up to max row length to avoid syntax errors, closes…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants