Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandoc escapes characters over-aggressively when writing markdown #6259

Closed
khatchad opened this issue Apr 7, 2020 · 16 comments
Closed

Pandoc escapes characters over-aggressively when writing markdown #6259

khatchad opened this issue Apr 7, 2020 · 16 comments

Comments

@khatchad
Copy link

khatchad commented Apr 7, 2020

Suppose I want to use pandoc to convert between markdown flavors:

$ echo "# Header #1" | pandoc -t markdown

I get the following output with the #1 escaped:

Header \#1
==========

How do I prevent pandoc from doing this?

Version Info

$ pandoc --version
pandoc 2.5
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7
Default user data directory: /home/rk1424/.pandoc
Copyright (C) 2006-2018 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose
@jgm
Copy link
Owner

jgm commented Apr 7, 2020

There isn't a way to prevent this at the moment, but the escaping isn't "incorrect," only unnecessary. This is perfectly valid markdown. # is escaped to avoid interpretation as a markdown control character. For example, there's a difference between

# Header #

which yields

<h1>Header</h1>

and

Header \#

which yields

<h1>Header #</h1>

In this particular case the escaping isn't necessary. Pandoc could probably be smarter about detecting such cases, but this isn't a bug.

@khatchad khatchad changed the title Pandoc incorrectly escapes characters when converting to markdown from markdown Pandoc undesirably escapes characters when converting to markdown from markdown Apr 7, 2020
@khatchad
Copy link
Author

khatchad commented Apr 7, 2020

Thanks for the feedback. I changed the title, but I still think it's an open question regarding whether it is a bug for this one reason:

[Markdown's] key design goal is readability – that the language be readable as-is, without looking like it has been marked up with tags or formatting instructions ... -- Markdown, https://en.wikipedia.org/w/index.php?title=Markdown&oldid=946233394 (last visited Apr. 7, 2020).

Thus, the readability of the produced markdown is questionable.

@jgm
Copy link
Owner

jgm commented Apr 7, 2020

I certainly agree that it would be better not to put in backslashes except when necessary.
We could try to improve the heuristics the markdown writer is currently using.

@jayrobwilliams
Copy link

Depending on your use case, I've found a potential workaround for the time being. I'm using R Markdown to render .Rmd files to .md for Jekyll. If you create a file fn.md

This is a footenote`[^1]:`{=html}.

`[^1]:`{=html} And here is the content of the footnote.

And then render it to gfm, it will pass the raw 'html' through and then drop the attribute tags, so pandoc fn.md -t gfm results in:

This is a footenote[^1]:.

[^1]: And here is the content of the footnote.

effectively preventing the markdown from getting backslash escaped, and giving me working footnotes for Jekyll. The key is to render it to gfm, because regular markdown will keep the attribute tags; pandoc fn.md -t markdown yields:

This is a footenote`[^1]:`{=html}.

`[^1]:`{=html} And here is the content of the footnote.

@laoshaw
Copy link

laoshaw commented May 16, 2021

it does the same thing when converting from html to markdown, while it does not break things, it's truly an eyesore.

is there a list for what it will be escaping so I can use sed/whatever to remove those added escape(\#, \., \[, \], ...) as a second-stage processing?

@jgm jgm changed the title Pandoc undesirably escapes characters when converting to markdown from markdown Pandoc escapes characters over-aggressively when writing markdown May 16, 2021
@jgm
Copy link
Owner

jgm commented May 16, 2021

<, >, \, `, *, _, [, ], #
@ if citations extension is enabled
| if pipe_tables enabled
^ if superscript enabled
~ if strikeout or subscript enabled
$ if tex_math_dollars enabled
. (when followed by.), ", ', - (when followed by -) if smart enabled
_ if necessary

@jgm
Copy link
Owner

jgm commented May 16, 2021

@jayrobwilliams this seems unduly complex, given that pandoc has built in support for this style of footnotes. Have you tried -t gfm+footnotes?

jgm added a commit that referenced this issue May 16, 2021
@jgm
Copy link
Owner

jgm commented May 16, 2021

I've pushed a change that should reduce unnecessary escapes for # and >.

@laoshaw
Copy link

laoshaw commented May 17, 2021

<, >, \, `, *, _, [, ], #
@ if citations extension is enabled
| if pipe_tables enabled
^ if superscript enabled
~ if strikeout or subscript enabled
$ if tex_math_dollars enabled
. (when followed by.), ", ', - (when followed by -) if smart enabled
_ if necessary

. + - are also impacted I think, also fenced-code-block are escaped unnecessarily it seems

@jgm
Copy link
Owner

jgm commented May 17, 2021

+ is only escaped if they occur at the start of a line (and followed by whitespace), because if unescaped they'd start a list.

- is only escaped in a potential list context (see + above) or in the context -- (where it would be an en dash if smart is enabled)

., as noted, is only escaped in the context .. (if smart is enabled).

@jayrobwilliams
Copy link

@jayrobwilliams this seems unduly complex, given that pandoc has built in support for this style of footnotes. Have you tried -t gfm+footnotes?

@jgm works perfectly! I totally missed in the documentation that you can append extensions to output formats; didn't event think to look for that since footnotes work natively with standard markdown. Thank you!

@aslmx
Copy link

aslmx commented Oct 31, 2021

I'm also having my issues with escaping of characters.

I want my template to allow for inclusion of a PDF that is put before the PDF that is produced by the markdown.

To have a simple solution (it might be optimized) i have two variables. One that is checked if the pages should be included, a second is the file name. (The use case here is to include the assignment that you are solving, just to explain the variable names ;))

assignment:
  include: 1 
  file: "assignment/task_2.pdf"

I have the following code in the template, using the package pdfpages

%debug: include assignment? $assignment.include$ $assignment.file$ 

$if(assignment.include)$
% include assignment seems on?
\includepdf[pages=-]{$assignment.file$}
$else$
%include assignment was off?
$endif$

However, if like shown above the filename contains an underscore, the intermediate .tex will have it escaped

like assignment/task\_2.pdf

This will fail to convert to pdf then.

It works fine wihtout underscores.

I have not yet found out how to either unescape this in latex (suboptimal I'd say) or (better) to not have pandoc escape this in the first place.

Any idea? is it possible?

I have tried to put the variable value into quotes, double quotes, ticks (` )... to no avail.

Thanks

@jgm
Copy link
Owner

jgm commented Oct 31, 2021

There's an easy solution in your case @aslmx :

assignment:
  include: 1 
  file:  '`assignment/task_2.pdf`{=latex}'

(This is the "raw attribute" and will cause the content to be passed to LaTeX unmodified.)

@aslmx
Copy link

aslmx commented Nov 2, 2021

This is the "raw attribute" and will cause the content to be passed to LaTeX unmodified.

Thanks. I was looking for something like this. I will try it and report if it does not work - i assume it works fine.

thanks & br

@jgm jgm closed this as completed Nov 2, 2021
aslmx pushed a commit to aslmx/pandoc-wbh-template that referenced this issue Nov 3, 2021
kreativmonkey pushed a commit to wbh-community/pandoc-wbh-template that referenced this issue Nov 3, 2021
* erster Aufschlag für includepdf

* beispiel.md: added comment for skipfirstpage

* integrate changes as suggested in pandoc issue: jgm/pandoc#6259 (comment)

* Modified readme.md to cater for the possibility to include ranges from PDF files

* Forgot the exmaple block at begin of README.md - updated it m(

* typo inf README.md - updated it m(

Co-authored-by: Sebastian / sebbo <sebastian@1337lounge.de>
@jeffkimbrel
Copy link

jeffkimbrel commented Mar 8, 2023

In case anyone finds this useful, it wasn't immediately obvious to me how to get @jayrobwilliams solution to work for multiline code, as I was initially trying to add the {=html} after the last three backticks. But, adding it after the first three backticks, as though you are calling the syntax highlighting, works for me and keeps multiline working.

For example, a multiline quote placed as normal markdown gets collapse to one line...

> line 1
> line 2

turns to

> line1 line2

But this works correctly after pandoc conversion...

```{=html}
> test 1
> test 2
\```

@W1Real
Copy link

W1Real commented Aug 21, 2023

For anyone looking for a simple compromise/work-around, if your goal is to just get readability you can use rst (reStructuredText). But be warned that someone mentioned that it doesn't deal well with links, I haven't tested links. But for a plain .docx (only text contained inside, no formatting) it worked well for me.

Shout out to this StackOverflow answer to the question Pandoc Markdown to Plain Text Formatting: https://stackoverflow.com/a/61622727

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants