Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow performance with some markdown options #2730

Closed
wch opened this issue Feb 19, 2016 · 4 comments
Closed

Very slow performance with some markdown options #2730

wch opened this issue Feb 19, 2016 · 4 comments

Comments

@wch
Copy link

wch commented Feb 19, 2016

When converting some files from Markdown to HTML, performance can be very slow, depending on the markdown variant and options selected. The time grows exponentially, as shown in the graph below.

For this example, I have a very basic input -- it's just raw HTML with some JSON content embedded in a <script> tag. (We're using markdown as an input format because sometimes the HTML is intermingled with markdown. But in this example, the actual content is just HTML.)

index.html:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
</head>
<body>
<script>{"x": "blah blah blah"}</script>
</body>
</html>

This is paired with a minimal template file:

$body$

And it's run through pandoc with:

pandoc index.html --from markdown_strict --output output.html --template template.html

The problem is that, with the content I have (the "blah blah blah" is replaced with a bunch of R code in a string), pandoc is extremely slow. Here's a graph of time, with 50KB, 100KB, and 150KB of text in the <script></script> tags, with various flavors of markdown. Note the log y scale:

image

For markdown_strict, the time for 50KB is 0.37 seconds; for 100KB, it's 3.1 seconds, and for 150KB, it's 25.5 seconds. If the input is a megabyte in size, the conversion time with this exponential growth rate would be about 30,000,000,000,000,000 seconds. My actual data is over two megabytes, so there would be many more zeros on there. :)

In the graph, I've also compared it to markdown and commonmark, which are much faster, as well as markdown-markdown_in_html_blocks and markdown+markdown_attribute, which are just as slow as markdown_strict. I would have expected the markdown-markdown_in_html_blocks and markdown+markdown_attribute options to be faster than markdown, but that opposite appears to be true.

The example input files are in https://github.com/wch/pandoc-hang, with a subdirectory for each input file size. For example, the 100KB input file is in:
https://github.com/wch/pandoc-hang/tree/master/simplified-100kb

I also tried changing the specific content in the <script> tags, and that makes a big difference in speed. In my use case, it's R code in a string, but when I replace it with just blank spaces, the conversion is fast for all of those settings. So there's something about that particular content that slows it down.

@jgm
Copy link
Owner

jgm commented Feb 19, 2016

+++ Winston Chang [Feb 19 16 09:37 ]:

fast for all of those settings. So there's something about that
particular content that slows it down.

Does it contain < characters?

@wch
Copy link
Author

wch commented Feb 19, 2016

Yes, many of them.

@jgm jgm closed this as completed in 1534052 Feb 20, 2016
@jgm
Copy link
Owner

jgm commented Feb 20, 2016

Thanks for the excellent, detailed bug report. I think this commit fixes the problem (I tested on your files). But let me know if it doesn't.

jgm added a commit that referenced this issue Feb 21, 2016
This should give better performance.

See #2730.
@wch
Copy link
Author

wch commented Feb 22, 2016

Great, thanks for the quick fix!

c-forster pushed a commit to c-forster/pandoc that referenced this issue Mar 4, 2016
This version avoids an exponential performance problem with `<script>` tags,
and it should be faster in general.

Closes jgm#2730.
c-forster pushed a commit to c-forster/pandoc that referenced this issue Mar 4, 2016
This should give better performance.

See jgm#2730.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants