DocBook reader ignores the `id` attribute of `formalpara` #8666

tombolano · 2023-03-03T14:39:37Z

Explain the problem.

DocBook reader ignores the id attribute of formalpara elements. This attribute is needed for cross-references.

I found this problem when trying to convert an asciidoc document that references code blocks. Since pandoc does not support direct asciidoc conversion, I used the DocBook backend of asciidoctor to generate a DocBook document, but I found that when I tried to convert the DocBook document to other formats, the references to the code blocks were broken.

For a minimal example, consider this asciidoc code:

= My document

My code is in <<my_code_id>>.

.Code caption
[#my_code_id,bash]
----
echo "hello world"
----

When converting to docbook with asciidoctor -b docbook example.adoc the following DocBook is produced:

<?xml version="1.0" encoding="UTF-8"?>
<?asciidoc-toc?>
<?asciidoc-numbered?>
<article xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
<info>
<title>My document</title>
<date>2023-03-03</date>
</info>
<simpara>My code is in <xref linkend="my_code_id"/>.</simpara>
<formalpara xml:id="my_code_id">
<title>Code caption</title>
<para>
<programlisting language="bash" linenumbering="unnumbered">echo "hello world"</programlisting>
</para>
</formalpara>
</article>

Then, when pandoc reads the DocBook code with the command pandoc -t native -f docbook the following AST is returned:

[ Para
    [ Str "My"
    , Space
    , Str "code"
    , Space
    , Str "is"
    , Space
    , Str "in"
    , Space
    , Link
        ( "" , [] , [] )
        [ Str "formalpara_title" ]
        ( "#my_code_id" , "" )
    , Str "."
    ]
, Div
    ( "" , [ "formalpara-title" ] , [] )
    [ Para [ Strong [ Str "Code" , Space , Str "caption" ] ] ]
, CodeBlock ( "" , [ "bash" ] , [] ) "echo \"hello world\""
]

The problem here is that in the AST the Div element is missing the id and thus the previous reference to the code element is broken. The expected Div should be:

Div
    ( "my_code_id" , [ "formalpara-title" ] , [] )
    [ Para [ Strong [ Str "Code" , Space , Str "caption" ] ] ]

Pandoc version?
Pandoc development version

Possible fix
I have never programmed in haskell, but I looked around the code a bit and I found a working solution, this is the diff:

diff --git a/src/Text/Pandoc/Readers/DocBook.hs b/src/Text/Pandoc/Readers/DocBook.hs
index e11da4253..cf08d04d6 100644
--- a/src/Text/Pandoc/Readers/DocBook.hs
+++ b/src/Text/Pandoc/Readers/DocBook.hs
@@ -858,7 +858,7 @@ parseBlock (Elem e) =
         "para"  -> parseMixed para (elContent e)
         "formalpara" -> do
            tit <- case filterChild (named "title") e of
-                        Just t  -> divWith ("",["formalpara-title"],[]) .
+                        Just t  -> divWith (attrValue "id" e,["formalpara-title"],[]) .
                                    para .  strong <$> getInlines t
                         Nothing -> return mempty
            (tit <>) <$> parseMixed para (elContent e)

This fixes the id attribute, but note that there is also the related issue #3657 for which the role attributes are not saved.

The text was updated successfully, but these errors were encountered:

tombolano · 2023-03-06T21:48:21Z

I looked a bit more into the DocBook standard and into the Pandoc DocBook reader code and I think that this might be better solved by using the already defined parseAdmonition function.

Moreover, I realized that DocBook also has the example and sidebar elements. These are currently not parsed, but are similar to formalpara, i.e., they are containers with a title, so they can also be parsed with the parseAdmonition function. Note that since these two elements are currently not parsed the id and title elements are lost in the conversion.

Thus, consider this DocBook example (example.xml):

<?xml version="1.0" encoding="UTF-8"?>
<?asciidoc-toc?>
<?asciidoc-numbered?>
<article xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
<info>
<title>My Document</title>
<date>2023-03-06</date>
</info>
<formalpara xml:id="my_code_id">
<title>Code title</title>
<para>
<programlisting language="bash" linenumbering="unnumbered">echo "hello world!"</programlisting>
</para>
</formalpara>
<example xml:id="my_example_id">
<title>Example title</title>
<simpara>example content</simpara>
</example>
<sidebar xml:id="my_sidebar_id">
<title>Sidebar title</title>
<simpara>sidebar content</simpara>
</sidebar>
</article>

with the current pandoc development version the AST obtained with pandoc -f docbook -t native is the following:

[ Div
    ( "" , [ "formalpara-title" ] , [] )
    [ Para [ Strong [ Str "Code" , Space , Str "title" ] ] ]
, CodeBlock ( "" , [ "bash" ] , [] ) "echo \"hello world!\""
, Para [ Str "example" , Space , Str "content" ]
, Para [ Str "sidebar" , Space , Str "content" ]
]

Note that ids and titles are lost

Now if we modify the code to use parseAdmonition funcion with these changes:

diff --git a/src/Text/Pandoc/Readers/DocBook.hs b/src/Text/Pandoc/Readers/DocBook.hs
index 855f1d188..521f9ec89 100644
--- a/src/Text/Pandoc/Readers/DocBook.hs
+++ b/src/Text/Pandoc/Readers/DocBook.hs
@@ -786,6 +786,9 @@ blockTags = Set.fromList $
 admonitionTags :: [Text]
 admonitionTags = ["caution","danger","important","note","tip","warning"]
 
+titledBlockElements :: [Text]
+titledBlockElements = ["example", "formalpara", "sidebar"]
+
 -- Trim leading and trailing newline characters
 trimNl :: Text -> Text
 trimNl = T.dropAround (== '\n')
@@ -849,12 +852,6 @@ parseBlock (Elem e) =
         "toc"   -> skip -- skip TOC, since in pandoc it's autogenerated
         "index" -> skip -- skip index, since page numbers meaningless
         "para"  -> parseMixed para (elContent e)
-        "formalpara" -> do
-           tit <- case filterChild (named "title") e of
-                        Just t  -> divWith ("",["formalpara-title"],[]) .
-                                   para .  strong <$> getInlines t
-                        Nothing -> return mempty
-           (tit <>) <$> parseMixed para (elContent e)
         "simpara"  -> parseMixed para (elContent e)
         "ackno"  -> parseMixed para (elContent e)
         "epigraph" -> parseBlockquote
@@ -899,6 +896,7 @@ parseBlock (Elem e) =
         "refsect3" -> sect 3
         "refsection" -> gets dbSectionLevel >>= sect . (+1)
         l | l `elem` admonitionTags -> parseAdmonition l
+        l | l `elem` titledBlockElements -> parseAdmonition l
         "area" -> skip
         "areaset" -> skip
         "areaspec" -> skip

the resulting AST obtained with pandoc -f docbook -t native is the following:

[ Div
    ( "my_code_id" , [ "formalpara" ] , [] )
    [ Div
        ( "" , [ "title" ] , [] )
        [ Plain [ Str "Code" , Space , Str "title" ] ]
    , CodeBlock ( "" , [ "bash" ] , [] ) "echo \"hello world!\""
    ]
, Div
    ( "my_example_id" , [ "example" ] , [] )
    [ Div
        ( "" , [ "title" ] , [] )
        [ Plain [ Str "Example" , Space , Str "title" ] ]
    , Para [ Str "example" , Space , Str "content" ]
    ]
, Div
    ( "my_sidebar_id" , [ "sidebar" ] , [] )
    [ Div
        ( "" , [ "title" ] , [] )
        [ Plain [ Str "Sidebar" , Space , Str "title" ] ]
    , Para [ Str "sidebar" , Space , Str "content" ]
    ]
]

Note that in the above formalpara element both the title and the content of the formalpara are inside a Div with class formalpara, which I think is much better than before, were separate Div and Block elements were created.

tombolano added the bug label Mar 3, 2023

snwnde mentioned this issue Mar 3, 2023

DocBook reader ignores the id attribute of informalequation #8664

Open

atomfrede mentioned this issue Jan 22, 2024

Verweise gehen bei Word-Konvertierung verloren java-aktuell/java-aktuell-asciidoc-template#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocBook reader ignores the `id` attribute of `formalpara` #8666

DocBook reader ignores the `id` attribute of `formalpara` #8666

tombolano commented Mar 3, 2023 •

edited

Loading

tombolano commented Mar 6, 2023

DocBook reader ignores the id attribute of formalpara #8666

DocBook reader ignores the id attribute of formalpara #8666

Comments

tombolano commented Mar 3, 2023 • edited Loading

tombolano commented Mar 6, 2023

DocBook reader ignores the `id` attribute of `formalpara` #8666

DocBook reader ignores the `id` attribute of `formalpara` #8666

tombolano commented Mar 3, 2023 •

edited

Loading