pandoc does not pick up the figure which pandoc-crossref specifies using "<figure" #9720

jiucenglou · 2024-05-03T19:51:08Z

Explain the problem.

I have a folder tree of

user@localhost:/mnt/d/mwe3$ tree .
.
├── Ch3
│   ├── Ch3.md
│   ├── Ch3_tmp.docx
│   ├── Ch3_tmp.md
│   └── img
│       └── mech.jpg
├── pandoc
└── pandoc-crossref

Ch3.md is

# title

## results

![mech scheme.](Ch3/./img/mech.jpg){#fig:mech height=12.09cm }

and I am using the following two runs to get docx

./pandoc  -F pandoc-crossref Ch3/Ch3.md --resource-path=Ch3 -o Ch3/Ch3_tmp.md
./pandoc  Ch3/Ch3_tmp.md -o Ch3/Ch3_tmp.docx

As shown below for the intermediate Ch3_tmp.md, the latest pandoc & pandoc-crossref starts to specify the figure using <figure.
However, now the second run above generates a docx file without the figure in it...
With pandoc 2.19 and the compatible pandoc-crossref, <figure is not yet used and docx file resulted contains the figure.
Could you suggest what I could do to use the latest pandoc to generate a docx file with the figure in it ?
Many thanks !

# title

## results

<figure id="fig:mech">
<img src="Ch3/./img/mech.jpg" style="height:12.09cm"
alt="mech scheme." />
<figcaption>Figure 1: mech scheme.</figcaption>
</figure>

Pandoc version?
latest 3.13

The text was updated successfully, but these errors were encountered:

jgm · 2024-05-03T20:12:45Z

Please report this to pandoc-crossref instead.

jiucenglou · 2024-05-03T20:34:23Z

Please report this to pandoc-crossref instead.

Thank you for your instruction ! Do you suggest pandoc-crossref should not generate <figure in markdown output format in the first place ? Does pandoc ignore <figure in markdown input format ?

jgm · 2024-05-03T21:34:07Z

OK. Actually, this may point to something that can be done in pandoc.

jgm · 2024-05-03T21:39:15Z

I imagine that pandoc-crossref is inserting something like this into the AST:

[ Figure
    ( "fig:mech" , [] , [] )
    (Caption
       Nothing [ Plain [ Str "The" , Space , Str "caption" ] ])
    [ Plain
        [ Image
            ( "" , [] , [ ("style", "height:12.09cm"), ("alt", "alt text")])
      [ Str "scheme" ]
            ( "myfig.jpg" , "" )
        ]
    ]
]

The problem is that pandoc's markdown writer will render this as HTML. And then, if you try to go from that markdown to docx, the raw HTML will disappear.

Why does the markdown writer use raw HTML here? I'm not sure. You can disable raw HTML, though, with -t markdown-raw_html and then you'll get something like

:::: {#fig:mech .figure}
![mech scheme.](Ch3/./img/mech.jpg){style="height:12.09cm"}

::: caption
Figure 1: mech scheme.
:::
::::

and that, I think, will go through to docx.

I think the markdown writer should probably just generate a standard implicit_figures style figure here, so let's consider this a change request for the markdown writer.

jgm · 2024-05-03T21:40:25Z

In retrospect I don't think this is a problem for pandoc-crossref, so you can cancel any request you made there.

jgm · 2024-05-03T22:25:01Z

OK, I see what is going on here.

The HTML you display above was probably the result of rendering this AST element (inserted by pandoc-crossref):

[ Figure
    ( "fig:mech" , [] , [] )
    (Caption
       Nothing
       [ Plain
           [ Str "Figure"
           , Space
           , Str "1:"
           , Space
           , Str "mech"
           , Space
           , Str "scheme."
           ]
       ])
    [ Plain
        [ Image
            ( "" , [] , [ ( "style" , "height:12.09cm" ) ] )
            [ Str "mech" , Space , Str "scheme." ]
            ( "Ch3/./img/mech.jpg" , "" )
        ]
    ]
]

In deciding whether to use an implicit figure, the markdown writer tries to determine whether this representation would capture all of the information in this Figure element. One case in which it wouldn't is the case where the image has an image description/alt text that is different from the figure's caption. (An implicit figure just takes the caption from what would otherwise be the image's alt text.) So the writer tests for this. Notice that the caption and the image description are almost the same in this case: the difference is that the caption also includes the label "Figure 1:". Anyway, it's because of that that we fall back to raw HTML.

I suppose one way around this would be to just check that the suffix of the Caption matches the image description. This might lead to some false positives, but it's probably fairly reliable.

jgm · 2024-05-03T22:27:06Z

What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).

jiucenglou · 2024-05-04T09:36:48Z

What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).

It seems to me that pandoc-crossref when invoked is responsible for naming the figures (and the tables, and the equations). Would this information help with the decision ? :D

jgm · 2024-05-04T16:58:33Z

I think we need feedback from @lierdakil on this.

lierdakil · 2024-05-04T17:09:03Z

I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?

lierdakil · 2024-05-04T17:13:49Z

Honestly, Markdown-to-Markdown conversions were never a target, and in Pandoc, Markdown is not guaranteed to round-trip in the first place. I could make a patch changing the alt text to match the caption though 🤷

jiucenglou · 2024-05-04T17:52:17Z

I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?

The reason why my workflow depends/depended on intermediate markdowns is chapter-wise references (should be chapter-wise bibliography if I remembered correctly) :D. A few years ago I read from the google discussion group about this idea (I cannot find it since the group is not accessible....)

lierdakil · 2024-05-04T17:59:46Z

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

lierdakil · 2024-05-04T18:00:44Z

Anyway, probably worth making the change regardless.

This should work: lierdakil/pandoc-crossref@5f2b087

There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.

@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):

P.S. I'll make a release proper probably tomorrow lest I forget.

jiucenglou · 2024-05-04T18:37:12Z

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

Many thanks ! Using the command line syntax below to use native as an intermediate format seems very well

./pandoc  -F pandoc-crossref Ch3/Ch3.md --resource-path=Ch3 -t native -o Ch3/Ch3_tmp.txt
./pandoc -f native  Ch3/Ch3_tmp.txt -o Ch3/Ch3_tmp.docx

jiucenglou · 2024-05-04T18:41:24Z

Anyway, probably worth making the change regardless.

This should work: lierdakil/pandoc-crossref@5f2b087

There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.

@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):

https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-Linux-20240504-5f2b087.tar.xz

https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-macOS-20240504-5f2b087.tar.xz

https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-Windows-20240504-5f2b087.7z

P.S. I'll make a release proper probably tomorrow lest I forget.

I can test and report back. Would you suggest to keep using native as intermediate format even with the new patch ?

lierdakil · 2024-05-04T18:44:51Z

Would you suggest to keep using native as intermediate format even with the new patch ?

I don't know the particulars of your setup, so it's up to you. If you don't really care about the intermediate format, native or json would be the best choice if it works, as they're guaranteed to preserve the AST. OTOH, if you want to do some postprocessing on the intermediate files (not with pandoc filters), use whatever you can postprocess 🤷

jiucenglou · 2024-05-04T18:47:05Z

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

In my real use case, the two command lines look like

"${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t markdown-citations  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
"${Pandoc}"  "${Header}"  "${TmpMd3}"  --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

I mean, the first run has a -t markdown-citations option to generate chapter-wise bibliography. Could you help to suggest if -t native can work or I should use something like -t native-citations ? Many thanks !

lierdakil · 2024-05-04T18:58:21Z

As native preserves the whole AST, it also preserves the result of --citeproc. So it shouldn't need any qualifiers. For example, the command pandoc --citeproc -t native /tmp/test.md | pandoc -f native -t docx -o /tmp/test.docx produces the following docx:

test.md is as follows:

---
references:
- type: article-journal
  id: WatsonCrick1953
  author:
  - family: Watson
    given: J. D.
  - family: Crick
    given: F. H. C.
  issued:
    date-parts:
    - - 1953
      - 4
      - 25
  title: 'Molecular structure of nucleic acids: a structure for
    deoxyribose nucleic acid'
  title-short: Molecular structure of nucleic acids
  container-title: Nature
  volume: 171
  issue: 4356
  page: 737-738
  DOI: 10.1038/171737a0
  URL: https://www.nature.com/articles/171737a0
  language: en-GB
---

@WatsonCrick1953

jiucenglou · 2024-05-04T19:27:14Z

As shown below, I tried native on my real use case and I got couldn't read native on my second run.
I could not get a minimal working example in the time being, but will post again if I could get a minimal working example.

      "${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t markdown-citations  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
      "${Pandoc}"  "${Header}"  "${TmpMd3}"  --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

      "${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t native  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
      "${Pandoc}"  "${Header}"  "${TmpMd3}"  -f native --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

lierdakil · 2024-05-04T21:44:18Z

Meanwhile, turns out I forgot to update some tests, so that CI build failed. Anyway, I'll just cut a release I guess, and we can do another one if this doesn't work out for some reason. For future reference, 0.3.17.1 (artefacts not yet built, but this time CI should finish fine 🤞)

jgm · 2024-05-05T02:48:48Z

Thanks @lierdakil - it looks like this isn't going to require pandoc changes, so I'll close this issue.

jiucenglou added the bug label May 3, 2024

jgm closed this as completed May 3, 2024

jgm reopened this May 3, 2024

jgm added format:Markdown writer labels May 3, 2024

jiucenglou mentioned this issue May 4, 2024

pandoc could not pick up the figure which pandoc-crossref specifies using "<figure" lierdakil/pandoc-crossref#434

Closed

jgm closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandoc does not pick up the figure which pandoc-crossref specifies using "<figure" #9720

pandoc does not pick up the figure which pandoc-crossref specifies using "<figure" #9720

jiucenglou commented May 3, 2024 •

edited

jgm commented May 3, 2024

jiucenglou commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jiucenglou commented May 4, 2024

jgm commented May 4, 2024

lierdakil commented May 4, 2024 •

edited

lierdakil commented May 4, 2024

jiucenglou commented May 4, 2024 •

edited

lierdakil commented May 4, 2024

lierdakil commented May 4, 2024 •

edited

jiucenglou commented May 4, 2024

jiucenglou commented May 4, 2024

lierdakil commented May 4, 2024

jiucenglou commented May 4, 2024 •

edited

lierdakil commented May 4, 2024

jiucenglou commented May 4, 2024 •

edited

lierdakil commented May 4, 2024

jgm commented May 5, 2024

pandoc does not pick up the figure which pandoc-crossref specifies using "<figure" #9720

pandoc does not pick up the figure which pandoc-crossref specifies using "<figure" #9720

Comments

jiucenglou commented May 3, 2024 • edited

jgm commented May 3, 2024

jiucenglou commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jgm commented May 3, 2024

jiucenglou commented May 4, 2024

jgm commented May 4, 2024

lierdakil commented May 4, 2024 • edited

lierdakil commented May 4, 2024

jiucenglou commented May 4, 2024 • edited

lierdakil commented May 4, 2024

lierdakil commented May 4, 2024 • edited

jiucenglou commented May 4, 2024

jiucenglou commented May 4, 2024

lierdakil commented May 4, 2024

jiucenglou commented May 4, 2024 • edited

lierdakil commented May 4, 2024

jiucenglou commented May 4, 2024 • edited

lierdakil commented May 4, 2024

jgm commented May 5, 2024

jiucenglou commented May 3, 2024 •

edited

lierdakil commented May 4, 2024 •

edited

jiucenglou commented May 4, 2024 •

edited

lierdakil commented May 4, 2024 •

edited

jiucenglou commented May 4, 2024 •

edited

jiucenglou commented May 4, 2024 •

edited