Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Reader: Drops any tags inside <pre><code> #5333

Closed
sky-y opened this issue Feb 27, 2019 · 4 comments
Closed

HTML Reader: Drops any tags inside <pre><code> #5333

sky-y opened this issue Feb 27, 2019 · 4 comments

Comments

@sky-y
Copy link

sky-y commented Feb 27, 2019

pandoc -f html (HTML Reader) drops any tags inside <pre><code>.

I want that it keeps tags inside <pre><code>. How do I fix it?

Version

  • macOS Mojave 10.14
$ pandoc -v
pandoc 2.6
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2, skylighting 0.7.5

Example: Current (v2.6)

$ echo '<pre><code><b>aaa</b></code></pre>' | pandoc -f html -t html
<pre><code>aaa</code></pre>
$ echo '<pre><code><div class="hoge">aaa</div></code></pre>' | pandoc -f html -t html
<pre><code>aaa</code></pre>
$ echo '<pre><code><pre><code>aaa</code></pre></code></pre>' | pandoc -f html -t html
<pre><code>aaa</code></pre>

Example: Expected

$ echo '<pre><code><b>aaa</b></code></pre>' | pandoc -f html -t html
<pre><code><b>aaa</b></code></pre>
$ echo '<pre><code><div class="foo">aaa</div></code></pre>' | pandoc -f html -t html
<pre><code><div class="foo">aaa</div></code></pre>
$ echo '<pre><code><pre><code>aaa</code></pre></code></pre>' | pandoc -f html -t html
<pre><code><pre><code>aaa</code></pre></code></pre>

Reference: Code in jgm/pandoc

tagToString in pCodeBlock at src/Text/Pandoc/Readers/HTML.hs of jgm/pandoc might cause it.

pCodeBlock :: PandocMonad m => TagParser m Blocks
pCodeBlock = try $ do
  TagOpen _ attr' <- pSatisfy (matchTagOpen "pre" [])
  let attr = toStringAttr attr'
  contents <- manyTill pAnyTag (pCloses "pre" <|> eof)
  let rawText = concatMap tagToString contents
  -- drop leading newline if any
  let result' = case rawText of
                     '\n':xs -> xs
                     _       -> rawText
  -- drop trailing newline if any
  let result = case reverse result' of
                    '\n':_ -> init result'
                    _      -> result'
  return $ B.codeBlockWith (mkAttr attr) result

tagToString :: Tag Text -> String
tagToString (TagText s)      = T.unpack s
tagToString (TagOpen "br" _) = "\n"
tagToString _                = ""

I think:

tagToString (TagText s)      = T.unpack s  -- text inside <pre><code> without tags?
tagToString _                = ""          -- drops tags inside <pre><code>?
@sky-y
Copy link
Author

sky-y commented Feb 27, 2019

I self-solved it: use HTML entities.

$ echo '<b>aaa</b>' | ruby -r 'cgi' -ne 'print CGI.escapeHTML $_'
&lt;b&gt;aaa&lt;/b&gt;
$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t html
<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>
$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t native
[CodeBlock ("",[],[]) "<b>aaa</b>"]

@sky-y sky-y closed this as completed Feb 27, 2019
@sky-y
Copy link
Author

sky-y commented Feb 28, 2019

My goal is to manipulate an AST of <pre><code><b>aaa</b></code></pre> in a filter. My subgoal is to make the output as JSON formats.

But I misunderstood above. I cannot use it in a filter directly because the outputs below are escaped HTML strings.

$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t html
<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>
$ echo '<pre><code>&lt;b&gt;aaa&lt;/b&gt;</code></pre>' | pandoc -f html -t native
[CodeBlock ("",[],[]) "<b>aaa</b>"]

In this case I need to write an another filter to unescape HTML entities.

I reopen this issue because it is not solved.

@sky-y sky-y reopened this Feb 28, 2019
@jgm
Copy link
Owner

jgm commented Feb 28, 2019

Please see this paragraph from the beginning of the manual:

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc’s simple document model. While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.

Pandoc's internal representation of a code block is just a big literal string, so no formatting is allowed. Yes, the conversion is lossy in this case. That is to be expected, as noted above.

@jgm jgm closed this as completed Feb 28, 2019
@jgm
Copy link
Owner

jgm commented Feb 28, 2019

See #221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants