Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocBook reader ignores the id attribute of formalpara #8666

Open
tombolano opened this issue Mar 3, 2023 · 1 comment
Open

DocBook reader ignores the id attribute of formalpara #8666

tombolano opened this issue Mar 3, 2023 · 1 comment
Labels

Comments

@tombolano
Copy link

tombolano commented Mar 3, 2023

Explain the problem.

DocBook reader ignores the id attribute of formalpara elements. This attribute is needed for cross-references.

I found this problem when trying to convert an asciidoc document that references code blocks. Since pandoc does not support direct asciidoc conversion, I used the DocBook backend of asciidoctor to generate a DocBook document, but I found that when I tried to convert the DocBook document to other formats, the references to the code blocks were broken.

For a minimal example, consider this asciidoc code:

= My document

My code is in <<my_code_id>>.

.Code caption
[#my_code_id,bash]
----
echo "hello world"
----

When converting to docbook with asciidoctor -b docbook example.adoc the following DocBook is produced:

<?xml version="1.0" encoding="UTF-8"?>
<?asciidoc-toc?>
<?asciidoc-numbered?>
<article xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
<info>
<title>My document</title>
<date>2023-03-03</date>
</info>
<simpara>My code is in <xref linkend="my_code_id"/>.</simpara>
<formalpara xml:id="my_code_id">
<title>Code caption</title>
<para>
<programlisting language="bash" linenumbering="unnumbered">echo "hello world"</programlisting>
</para>
</formalpara>
</article>

Then, when pandoc reads the DocBook code with the command pandoc -t native -f docbook the following AST is returned:

[ Para
    [ Str "My"
    , Space
    , Str "code"
    , Space
    , Str "is"
    , Space
    , Str "in"
    , Space
    , Link
        ( "" , [] , [] )
        [ Str "formalpara_title" ]
        ( "#my_code_id" , "" )
    , Str "."
    ]
, Div
    ( "" , [ "formalpara-title" ] , [] )
    [ Para [ Strong [ Str "Code" , Space , Str "caption" ] ] ]
, CodeBlock ( "" , [ "bash" ] , [] ) "echo \"hello world\""
]

The problem here is that in the AST the Div element is missing the id and thus the previous reference to the code element is broken. The expected Div should be:

Div
    ( "my_code_id" , [ "formalpara-title" ] , [] )
    [ Para [ Strong [ Str "Code" , Space , Str "caption" ] ] ]

Pandoc version?
Pandoc development version

Possible fix
I have never programmed in haskell, but I looked around the code a bit and I found a working solution, this is the diff:

diff --git a/src/Text/Pandoc/Readers/DocBook.hs b/src/Text/Pandoc/Readers/DocBook.hs
index e11da4253..cf08d04d6 100644
--- a/src/Text/Pandoc/Readers/DocBook.hs
+++ b/src/Text/Pandoc/Readers/DocBook.hs
@@ -858,7 +858,7 @@ parseBlock (Elem e) =
         "para"  -> parseMixed para (elContent e)
         "formalpara" -> do
            tit <- case filterChild (named "title") e of
-                        Just t  -> divWith ("",["formalpara-title"],[]) .
+                        Just t  -> divWith (attrValue "id" e,["formalpara-title"],[]) .
                                    para .  strong <$> getInlines t
                         Nothing -> return mempty
            (tit <>) <$> parseMixed para (elContent e)

This fixes the id attribute, but note that there is also the related issue #3657 for which the role attributes are not saved.

@tombolano
Copy link
Author

I looked a bit more into the DocBook standard and into the Pandoc DocBook reader code and I think that this might be better solved by using the already defined parseAdmonition function.

Moreover, I realized that DocBook also has the example and sidebar elements. These are currently not parsed, but are similar to formalpara, i.e., they are containers with a title, so they can also be parsed with the parseAdmonition function. Note that since these two elements are currently not parsed the id and title elements are lost in the conversion.

Thus, consider this DocBook example (example.xml):

<?xml version="1.0" encoding="UTF-8"?>
<?asciidoc-toc?>
<?asciidoc-numbered?>
<article xmlns="http://docbook.org/ns/docbook" xmlns:xl="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
<info>
<title>My Document</title>
<date>2023-03-06</date>
</info>
<formalpara xml:id="my_code_id">
<title>Code title</title>
<para>
<programlisting language="bash" linenumbering="unnumbered">echo "hello world!"</programlisting>
</para>
</formalpara>
<example xml:id="my_example_id">
<title>Example title</title>
<simpara>example content</simpara>
</example>
<sidebar xml:id="my_sidebar_id">
<title>Sidebar title</title>
<simpara>sidebar content</simpara>
</sidebar>
</article>

with the current pandoc development version the AST obtained with pandoc -f docbook -t native is the following:

[ Div
    ( "" , [ "formalpara-title" ] , [] )
    [ Para [ Strong [ Str "Code" , Space , Str "title" ] ] ]
, CodeBlock ( "" , [ "bash" ] , [] ) "echo \"hello world!\""
, Para [ Str "example" , Space , Str "content" ]
, Para [ Str "sidebar" , Space , Str "content" ]
]

Note that ids and titles are lost

Now if we modify the code to use parseAdmonition funcion with these changes:

diff --git a/src/Text/Pandoc/Readers/DocBook.hs b/src/Text/Pandoc/Readers/DocBook.hs
index 855f1d188..521f9ec89 100644
--- a/src/Text/Pandoc/Readers/DocBook.hs
+++ b/src/Text/Pandoc/Readers/DocBook.hs
@@ -786,6 +786,9 @@ blockTags = Set.fromList $
 admonitionTags :: [Text]
 admonitionTags = ["caution","danger","important","note","tip","warning"]
 
+titledBlockElements :: [Text]
+titledBlockElements = ["example", "formalpara", "sidebar"]
+
 -- Trim leading and trailing newline characters
 trimNl :: Text -> Text
 trimNl = T.dropAround (== '\n')
@@ -849,12 +852,6 @@ parseBlock (Elem e) =
         "toc"   -> skip -- skip TOC, since in pandoc it's autogenerated
         "index" -> skip -- skip index, since page numbers meaningless
         "para"  -> parseMixed para (elContent e)
-        "formalpara" -> do
-           tit <- case filterChild (named "title") e of
-                        Just t  -> divWith ("",["formalpara-title"],[]) .
-                                   para .  strong <$> getInlines t
-                        Nothing -> return mempty
-           (tit <>) <$> parseMixed para (elContent e)
         "simpara"  -> parseMixed para (elContent e)
         "ackno"  -> parseMixed para (elContent e)
         "epigraph" -> parseBlockquote
@@ -899,6 +896,7 @@ parseBlock (Elem e) =
         "refsect3" -> sect 3
         "refsection" -> gets dbSectionLevel >>= sect . (+1)
         l | l `elem` admonitionTags -> parseAdmonition l
+        l | l `elem` titledBlockElements -> parseAdmonition l
         "area" -> skip
         "areaset" -> skip
         "areaspec" -> skip

the resulting AST obtained with pandoc -f docbook -t native is the following:

[ Div
    ( "my_code_id" , [ "formalpara" ] , [] )
    [ Div
        ( "" , [ "title" ] , [] )
        [ Plain [ Str "Code" , Space , Str "title" ] ]
    , CodeBlock ( "" , [ "bash" ] , [] ) "echo \"hello world!\""
    ]
, Div
    ( "my_example_id" , [ "example" ] , [] )
    [ Div
        ( "" , [ "title" ] , [] )
        [ Plain [ Str "Example" , Space , Str "title" ] ]
    , Para [ Str "example" , Space , Str "content" ]
    ]
, Div
    ( "my_sidebar_id" , [ "sidebar" ] , [] )
    [ Div
        ( "" , [ "title" ] , [] )
        [ Plain [ Str "Sidebar" , Space , Str "title" ] ]
    , Para [ Str "sidebar" , Space , Str "content" ]
    ]
]

Note that in the above formalpara element both the title and the content of the formalpara are inside a Div with class formalpara, which I think is much better than before, were separate Div and Block elements were created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant