-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCX table with rowspan to RST produces invalid syntax #4059
Comments
Pandoc currently does not support col- or row-spans. All rows need the same number of tables which this example doesn't have (header row has one cell, the rows below have two cells). |
Looks like it's #1024 then |
Duplicate of #1024 |
@mb21 thinking again, the output produced by the RST writer is malformed (invalid syntax). I think the writer should at least strive to always produce a well formed RST. In this case, the writer could fallback to generating a table where the number of columns is the max of the columns defined for each row. The table would still be wrong compared to the original, but at least the RST would be well-formed. |
Fair enough... I'm not sure whether we'll get around to that before #1024 though, and whether the docx reader or rst writer is to blame (see #3648) For the record, your docx input produces something like the following with
|
i don't see the connection with #3648. as far as i understand, here the translation goes through a native table where the number of cells is not consistent. i see three places where we could try to fix this:
i would exclude 3 because it would require to work on all writers, and i am not sure that a layer like 2 exists in the code base. therefore if you agree i would try to reproduce this behaviour in the test suite for the DOCX reader. i tried to look into the source of the DOCX and reducing it to a minimal case doesn't seem easy to me, so i would add the document here to |
@danse I guess I agree with you. The reader should be changed to always produce a table representation where all lists have the same length (this was how 3648 was resolved as well).
I would at least make sure it's only two rows with one word in each cell, to keep the tests somewhat readable. |
@rasky when i try to modify the DOCX file it gets sanitised, and editing it manually would be awkward. would it be easy for you to reduce it to the minimum, so that we can include it in a test case? or do you have any hints about how to do it? |
the problem in the document seems to be simply an inconsistent number of what do we want to do?in order to produce well-formed tables we could count the maximum row size and pad with empty cells in rows that don't reach that number. whether the padding should go to the beginning or the end of the row is arbitrary where do we want to do it?modifying the reader would probably mean adding such logic here, but to be honest this is generic logic which could be applied to all the tables read by any reader, so we could opt to keep it outside the reader in a sanitisation layer which would reduce malformed files produced by any writer. since this has a performance cost, we could think of adding a |
@danse sorry I can't help you on reducing the DOCX. I don't know how it was produced, but it's a "real document" that I reduced by simply deleting everything but the table. What I don't understand is: does the native format allow for tables where each row has a different number of columns? If that's well-defined, then it's a bug in the writer because RST cannot represent it, and we probably need to add empty columns. If instead the native format doesn't allow them, then it is a bug in the DOCX read that produces them... but then I wonder why there is no sanity check that an internal malformed table cannot be created in the first place. |
as far as i understand, #1024 is about something different
okay i can reduce the document manually
the header and every cell is a distinct list in the native format. sanitisation can be introduced in different points, i guess that it would be better to wait for @jgm to point to a direction |
Probably the easiest immediate fix would be to change the
table builders (in Text.Pandoc.Builder) so they add empty
cells to ensure that the table is of uniform width.
+++ Francesco Occhipinti [Nov 15 17 11:49 ]:
… Not sure we want to invest too much time on this with [1]#1024
hopefully being resolved soon...
as far as i understand, [2]#1024 is about something different
***@***.*** sorry I can't help you on reducing the DOCX. I don't know
how it was produced, but it's a "real document" that I reduced by
simply deleting everything but the table.
okay i can reduce the document manually
What I don't understand is: does the native format ...
the header and every cell is a distinct list in [4]the native format.
sanitisation can be introduced in different points, i guess that it
would be better to wait for ***@***.*** to point to a direction
—
You are receiving this because you were mentioned.
Reply to this email directly, [6]view it on GitHub, or [7]mute the
thread.
References
1. #1024
2. #1024
3. https://github.com/danse
4. https://github.com/jgm/pandoc-types/blob/master/Text/Pandoc/Definition.hs#L229-L233
5. https://github.com/jgm
6. #4059 (comment)
7. https://github.com/notifications/unsubscribe-auth/AAAL5Ac61jrmMdurw4MgpYsLL9Tcl4Luks5s2s_QgaJpZM4Qad4D
|
i wrote this test to replicate the problem on pandoc-types, let me know about any feedback so that i can adjust early. i see that in the |
found out that i have to adjust a few details in the test, anyway i'd welcome feedback about the general direction |
okay, i adjusted the test and found a minimal solution which i propose in jgm/pandoc-types#36. since i'm new to the project, i understand that there might be many parts of my pull request that require additional work, please point me to the desired improvements. i also noticed that we might apply similar padding easily to the columns |
finally closed in Pandoc 2.1.3! 🙌 the lack of support for spans causes an output different from what one could expect, but there is no data loss nor syntax error
|
pad table headers up to max row length to avoid syntax errors, closes…
The attached minimal DOCX contains what appears to be a single table with 1 column and 1 header row. When generating a RST output, a malformed syntax is produced: the header row has 1 column, but the other rows have 2 columns. I don't know whether the bug is in the DOCX reader or RST writer.
Command line:
pandoc --from=docx --to=rst single-table.docx --output=single-table.rst
Verified with pandoc 2.0.1.1.
single-table.docx
The text was updated successfully, but these errors were encountered: