Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandoc does not pick up the figure which pandoc-crossref specifies using "<figure" #9720

Closed
jiucenglou opened this issue May 3, 2024 · 22 comments

Comments

@jiucenglou
Copy link

jiucenglou commented May 3, 2024

Explain the problem.

I have a folder tree of

user@localhost:/mnt/d/mwe3$ tree .
.
├── Ch3
│   ├── Ch3.md
│   ├── Ch3_tmp.docx
│   ├── Ch3_tmp.md
│   └── img
│       └── mech.jpg
├── pandoc
└── pandoc-crossref

Ch3.md is

# title

## results

![mech scheme.](Ch3/./img/mech.jpg){#fig:mech height=12.09cm }

and I am using the following two runs to get docx

./pandoc  -F pandoc-crossref Ch3/Ch3.md --resource-path=Ch3 -o Ch3/Ch3_tmp.md
./pandoc  Ch3/Ch3_tmp.md -o Ch3/Ch3_tmp.docx

As shown below for the intermediate Ch3_tmp.md, the latest pandoc & pandoc-crossref starts to specify the figure using <figure.
However, now the second run above generates a docx file without the figure in it...
With pandoc 2.19 and the compatible pandoc-crossref, <figure is not yet used and docx file resulted contains the figure.
Could you suggest what I could do to use the latest pandoc to generate a docx file with the figure in it ?
Many thanks !

# title

## results

<figure id="fig:mech">
<img src="Ch3/./img/mech.jpg" style="height:12.09cm"
alt="mech scheme." />
<figcaption>Figure 1: mech scheme.</figcaption>
</figure>

Pandoc version?
latest 3.13

@jiucenglou jiucenglou added the bug label May 3, 2024
@jgm
Copy link
Owner

jgm commented May 3, 2024

Please report this to pandoc-crossref instead.

@jgm jgm closed this as completed May 3, 2024
@jiucenglou
Copy link
Author

Please report this to pandoc-crossref instead.

Thank you for your instruction ! Do you suggest pandoc-crossref should not generate <figure in markdown output format in the first place ? Does pandoc ignore <figure in markdown input format ?

@jgm
Copy link
Owner

jgm commented May 3, 2024

OK. Actually, this may point to something that can be done in pandoc.

@jgm jgm reopened this May 3, 2024
@jgm
Copy link
Owner

jgm commented May 3, 2024

I imagine that pandoc-crossref is inserting something like this into the AST:

[ Figure
    ( "fig:mech" , [] , [] )
    (Caption
       Nothing [ Plain [ Str "The" , Space , Str "caption" ] ])
    [ Plain
        [ Image
            ( "" , [] , [ ("style", "height:12.09cm"), ("alt", "alt text")])
      [ Str "scheme" ]
            ( "myfig.jpg" , "" )
        ]
    ]
]

The problem is that pandoc's markdown writer will render this as HTML. And then, if you try to go from that markdown to docx, the raw HTML will disappear.

Why does the markdown writer use raw HTML here? I'm not sure. You can disable raw HTML, though, with -t markdown-raw_html and then you'll get something like

:::: {#fig:mech .figure}
![mech scheme.](Ch3/./img/mech.jpg){style="height:12.09cm"}

::: caption
Figure 1: mech scheme.
:::
::::

and that, I think, will go through to docx.

I think the markdown writer should probably just generate a standard implicit_figures style figure here, so let's consider this a change request for the markdown writer.

@jgm
Copy link
Owner

jgm commented May 3, 2024

In retrospect I don't think this is a problem for pandoc-crossref, so you can cancel any request you made there.

@jgm
Copy link
Owner

jgm commented May 3, 2024

OK, I see what is going on here.

The HTML you display above was probably the result of rendering this AST element (inserted by pandoc-crossref):

[ Figure
    ( "fig:mech" , [] , [] )
    (Caption
       Nothing
       [ Plain
           [ Str "Figure"
           , Space
           , Str "1:"
           , Space
           , Str "mech"
           , Space
           , Str "scheme."
           ]
       ])
    [ Plain
        [ Image
            ( "" , [] , [ ( "style" , "height:12.09cm" ) ] )
            [ Str "mech" , Space , Str "scheme." ]
            ( "Ch3/./img/mech.jpg" , "" )
        ]
    ]
]

In deciding whether to use an implicit figure, the markdown writer tries to determine whether this representation would capture all of the information in this Figure element. One case in which it wouldn't is the case where the image has an image description/alt text that is different from the figure's caption. (An implicit figure just takes the caption from what would otherwise be the image's alt text.) So the writer tests for this. Notice that the caption and the image description are almost the same in this case: the difference is that the caption also includes the label "Figure 1:". Anyway, it's because of that that we fall back to raw HTML.

I suppose one way around this would be to just check that the suffix of the Caption matches the image description. This might lead to some false positives, but it's probably fairly reliable.

@jgm
Copy link
Owner

jgm commented May 3, 2024

What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).

@jiucenglou
Copy link
Author

What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).

It seems to me that pandoc-crossref when invoked is responsible for naming the figures (and the tables, and the equations). Would this information help with the decision ? :D

@jgm
Copy link
Owner

jgm commented May 4, 2024

I think we need feedback from @lierdakil on this.

@lierdakil
Copy link
Contributor

lierdakil commented May 4, 2024

I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?

@lierdakil
Copy link
Contributor

Honestly, Markdown-to-Markdown conversions were never a target, and in Pandoc, Markdown is not guaranteed to round-trip in the first place. I could make a patch changing the alt text to match the caption though 🤷

@jiucenglou
Copy link
Author

jiucenglou commented May 4, 2024

I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?

The reason why my workflow depends/depended on intermediate markdowns is chapter-wise references (should be chapter-wise bibliography if I remembered correctly) :D. A few years ago I read from the google discussion group about this idea (I cannot find it since the group is not accessible....)

@lierdakil
Copy link
Contributor

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

@lierdakil
Copy link
Contributor

lierdakil commented May 4, 2024

Anyway, probably worth making the change regardless.

This should work: lierdakil/pandoc-crossref@5f2b087

There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.

@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):

P.S. I'll make a release proper probably tomorrow lest I forget.

@jiucenglou
Copy link
Author

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

Many thanks ! Using the command line syntax below to use native as an intermediate format seems very well

./pandoc  -F pandoc-crossref Ch3/Ch3.md --resource-path=Ch3 -t native -o Ch3/Ch3_tmp.txt
./pandoc -f native  Ch3/Ch3_tmp.txt -o Ch3/Ch3_tmp.docx

@jiucenglou
Copy link
Author

Anyway, probably worth making the change regardless.

This should work: lierdakil/pandoc-crossref@5f2b087

There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.

@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):

P.S. I'll make a release proper probably tomorrow lest I forget.

I can test and report back. Would you suggest to keep using native as intermediate format even with the new patch ?

@lierdakil
Copy link
Contributor

Would you suggest to keep using native as intermediate format even with the new patch ?

I don't know the particulars of your setup, so it's up to you. If you don't really care about the intermediate format, native or json would be the best choice if it works, as they're guaranteed to preserve the AST. OTOH, if you want to do some postprocessing on the intermediate files (not with pandoc filters), use whatever you can postprocess 🤷

@jiucenglou
Copy link
Author

jiucenglou commented May 4, 2024

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

In my real use case, the two command lines look like

"${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t markdown-citations  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
"${Pandoc}"  "${Header}"  "${TmpMd3}"  --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

I mean, the first run has a -t markdown-citations option to generate chapter-wise bibliography. Could you help to suggest if -t native can work or I should use something like -t native-citations ? Many thanks !

@lierdakil
Copy link
Contributor

As native preserves the whole AST, it also preserves the result of --citeproc. So it shouldn't need any qualifiers. For example, the command pandoc --citeproc -t native /tmp/test.md | pandoc -f native -t docx -o /tmp/test.docx produces the following docx:
image

test.md is as follows:

---
references:
- type: article-journal
  id: WatsonCrick1953
  author:
  - family: Watson
    given: J. D.
  - family: Crick
    given: F. H. C.
  issued:
    date-parts:
    - - 1953
      - 4
      - 25
  title: 'Molecular structure of nucleic acids: a structure for
    deoxyribose nucleic acid'
  title-short: Molecular structure of nucleic acids
  container-title: Nature
  volume: 171
  issue: 4356
  page: 737-738
  DOI: 10.1038/171737a0
  URL: https://www.nature.com/articles/171737a0
  language: en-GB
---

@WatsonCrick1953

@jiucenglou
Copy link
Author

jiucenglou commented May 4, 2024

As shown below, I tried native on my real use case and I got couldn't read native on my second run.
I could not get a minimal working example in the time being, but will post again if I could get a minimal working example.

      "${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t markdown-citations  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
      "${Pandoc}"  "${Header}"  "${TmpMd3}"  --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"
      "${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t native  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
      "${Pandoc}"  "${Header}"  "${TmpMd3}"  -f native --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

@lierdakil
Copy link
Contributor

Meanwhile, turns out I forgot to update some tests, so that CI build failed. Anyway, I'll just cut a release I guess, and we can do another one if this doesn't work out for some reason. For future reference, 0.3.17.1 (artefacts not yet built, but this time CI should finish fine 🤞)

@jgm
Copy link
Owner

jgm commented May 5, 2024

Thanks @lierdakil - it looks like this isn't going to require pandoc changes, so I'll close this issue.

@jgm jgm closed this as completed May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants