Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed HTML can cause pandoc to "hang" never generating output. #6247

Closed
michael-conrad opened this issue Apr 1, 2020 · 3 comments
Closed

Comments

@michael-conrad
Copy link

michael-conrad commented Apr 1, 2020

~$ pandoc --version
pandoc 2.9.2.1
Compiled with pandoc-types 1.20, texmath 0.12.0.1, skylighting 0.8.3.2
Default user data directory: /home/muksihs/.local/share/pandoc or /home/muksihs/.pandoc
Copyright (C) 2006-2020 John MacFarlane
Web:  https://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
pandoc -o 2301.md --to=commonmark --from=html 2301.html

Sample malformed HTML which causes hang:

2301.html.tar.gz

@jgm jgm changed the title Malformed HTML can cause pandoc to "hange" never generating output. Malformed HTML can cause pandoc to "hang" never generating output. Apr 2, 2020
@jgm
Copy link
Owner

jgm commented Apr 2, 2020

I tried this and it didn't hang; it completed in 6 seconds (which is slow, admittedly) with 2MB maximum memory use.

@michael-conrad
Copy link
Author

michael-conrad commented Apr 2, 2020

Sorry, that's the wrong file (have a lot of files).
Check this one.

0101.html.tar.gz

It is currently sitting at:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND  

24266 muksihs   20   0  1.000t  53432  35832 R 100.0  0.2   1:44.97 pandoc   

@jgm jgm closed this as completed in 792f1a6 Apr 2, 2020
@jgm
Copy link
Owner

jgm commented Apr 2, 2020

OK, I fixed the issue; the file now takes 0.13s to convert.

However, some other issues:

  • If you go -t commonmark, you'll just get HTML tables, since commonmark doesn't have a table syntax.
  • You should ideally be able to rectify this by doing -t gfm, which has pipe tables, but something is preventing these tables from rendering as pipe tables. (The problem is the <br> elements which require line breaks: pipe table content must fit on a single line. If you strip out the <br> elements with sed or perl before passing to pandoc, you should get good results.)
  • -t markdown will give you decent pandoc markdown tables, but they won't be GitHub-compatible if that's important to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants