HTML Reader: Drops any tags inside <pre><code> #5333

sky-y · 2019-02-27T07:28:20Z

pandoc -f html (HTML Reader) drops any tags inside <pre><code>.

I want that it keeps tags inside <pre><code>. How do I fix it?

Version

macOS Mojave 10.14

$ pandoc -v
pandoc 2.6
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2, skylighting 0.7.5

Example: Current (v2.6)

$ echo '<pre><code><b>aaa</b></code></pre>' | pandoc -f html -t html
<pre><code>aaa</code></pre>
$ echo '<pre><code><div class="hoge">aaa</div></code></pre>' | pandoc -f html -t html
<pre><code>aaa</code></pre>
$ echo '<pre><code><pre><code>aaa</code></pre></code></pre>' | pandoc -f html -t html
<pre><code>aaa</code></pre>

Example: Expected

$ echo '<pre><code><b>aaa</b></code></pre>' | pandoc -f html -t html
<pre><code><b>aaa</b></code></pre>
$ echo '<pre><code><div class="foo">aaa</div></code></pre>' | pandoc -f html -t html
<pre><code><div class="foo">aaa</div></code></pre>
$ echo '<pre><code><pre><code>aaa</code></pre></code></pre>' | pandoc -f html -t html
<pre><code><pre><code>aaa</code></pre></code></pre>

Reference: Code in jgm/pandoc

tagToString in pCodeBlock at src/Text/Pandoc/Readers/HTML.hs of jgm/pandoc might cause it.

pCodeBlock :: PandocMonad m => TagParser m Blocks
pCodeBlock = try $ do
  TagOpen _ attr' <- pSatisfy (matchTagOpen "pre" [])
  let attr = toStringAttr attr'
  contents <- manyTill pAnyTag (pCloses "pre" <|> eof)
  let rawText = concatMap tagToString contents
  -- drop leading newline if any
  let result' = case rawText of
                     '\n':xs -> xs
                     _       -> rawText
  -- drop trailing newline if any
  let result = case reverse result' of
                    '\n':_ -> init result'
                    _      -> result'
  return $ B.codeBlockWith (mkAttr attr) result

tagToString :: Tag Text -> String
tagToString (TagText s)      = T.unpack s
tagToString (TagOpen "br" _) = "\n"
tagToString _                = ""

I think:

tagToString (TagText s)      = T.unpack s  -- text inside <pre><code> without tags?
tagToString _                = ""          -- drops tags inside <pre><code>?

The text was updated successfully, but these errors were encountered:

sky-y · 2019-02-27T08:01:08Z

I self-solved it: use HTML entities.

$ echo '<b>aaa</b>' | ruby -r 'cgi' -ne 'print CGI.escapeHTML $_'
&lt;b&gt;aaa&lt;/b&gt;
$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t html
<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>
$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t native
[CodeBlock ("",[],[]) "<b>aaa</b>"]

sky-y · 2019-02-28T04:55:24Z

My goal is to manipulate an AST of <pre><code><b>aaa</b></code></pre> in a filter. My subgoal is to make the output as JSON formats.

But I misunderstood above. I cannot use it in a filter directly because the outputs below are escaped HTML strings.

$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t html
<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>
$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t native
[CodeBlock ("",[],[]) "<b>aaa</b>"]

In this case I need to write an another filter to unescape HTML entities.

I reopen this issue because it is not solved.

jgm · 2019-02-28T05:58:36Z

Please see this paragraph from the beginning of the manual:

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc’s simple document model. While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.

Pandoc's internal representation of a code block is just a big literal string, so no formatting is allowed. Yes, the conversion is lossy in this case. That is to be expected, as noted above.

jgm · 2019-02-28T05:59:59Z

See #221

sky-y closed this as completed Feb 27, 2019

sky-y reopened this Feb 28, 2019

jgm closed this as completed Feb 28, 2019

linuxmail mentioned this issue Mar 11, 2019

everything between pre is lost outofcontrol/mediawiki-to-gfm#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Reader: Drops any tags inside <pre><code> #5333

HTML Reader: Drops any tags inside <pre><code> #5333

sky-y commented Feb 27, 2019

sky-y commented Feb 27, 2019

sky-y commented Feb 28, 2019 •

edited

jgm commented Feb 28, 2019

jgm commented Feb 28, 2019

HTML Reader: Drops any tags inside <pre><code> #5333

HTML Reader: Drops any tags inside <pre><code> #5333

Comments

sky-y commented Feb 27, 2019

Version

Example: Current (v2.6)

Example: Expected

Reference: Code in jgm/pandoc

sky-y commented Feb 27, 2019

sky-y commented Feb 28, 2019 • edited

jgm commented Feb 28, 2019

jgm commented Feb 28, 2019

sky-y commented Feb 28, 2019 •

edited