Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential parsing performance for nested brackets #1735

Open
jgm opened this issue Nov 3, 2014 · 9 comments
Open

Exponential parsing performance for nested brackets #1735

jgm opened this issue Nov 3, 2014 · 9 comments

Comments

@jgm
Copy link
Owner

jgm commented Nov 3, 2014

This concerns the Markdown reader.

% time python -c 'count = 10; print "[" * count; print "]" * count' | pandoc
<p>[[[[[[[[[[]]]]]]]]]]</p>
python -c 'count = 10; print "[" * count; print "]" * count'  0.02s user 0.01s system 97% cpu 0.031 total
pandoc  0.32s user 0.02s system 96% cpu 0.346 total
% time python -c 'count = 11; print "[" * count; print "]" * count' | pandoc
<p>[[[[[[[[[[[]]]]]]]]]]]</p>
python -c 'count = 11; print "[" * count; print "]" * count'  0.02s user 0.01s system 97% cpu 0.026 total
pandoc  0.66s user 0.02s system 98% cpu 0.694 total
% time python -c 'count = 12; print "[" * count; print "]" * count' | pandoc
<p>[[[[[[[[[[[[]]]]]]]]]]]]</p>
python -c 'count = 12; print "[" * count; print "]" * count'  0.02s user 0.01s system 98% cpu 0.030 total
pandoc  1.37s user 0.03s system 99% cpu 1.414 total

Parsing time doubles for each additional nesting.
See commonmark/commonmark-spec#166.

@jgm
Copy link
Owner Author

jgm commented Apr 19, 2015

Here's the root of the problem (trace added to charsInBalancedBrackets). On input [[[[a]]]]:

charsInBalancedBrackets at "[[[a]]]]\n\n\n"
charsInBalancedBrackets at "[[a]]]]\n\n\n"
charsInBalancedBrackets at "[a]]]]\n\n\n"
charsInBalancedBrackets at "a]]]]\n\n\n"
charsInBalancedBrackets at "a]]]"
charsInBalancedBrackets at "[a]]]"
charsInBalancedBrackets at "a]]]"
charsInBalancedBrackets at "a]"
charsInBalancedBrackets at "[[a]]]"
charsInBalancedBrackets at "[a]]]"
charsInBalancedBrackets at "a]]]"
charsInBalancedBrackets at "a]]"
charsInBalancedBrackets at "[a]]"
charsInBalancedBrackets at "a]]"
charsInBalancedBrackets at "a]"
charsInBalancedBrackets at "a]]]]\n\n\n"
charsInBalancedBrackets at "[a]]]]\n\n\n"
charsInBalancedBrackets at "a]]]]\n\n\n"
charsInBalancedBrackets at "a]"
charsInBalancedBrackets at "[[a]]]]\n\n\n"
charsInBalancedBrackets at "[a]]]]\n\n\n"
charsInBalancedBrackets at "a]]]]\n\n\n"
charsInBalancedBrackets at "a]]"
charsInBalancedBrackets at "[a]]"
charsInBalancedBrackets at "a]]"
charsInBalancedBrackets at "a]"
charsInBalancedBrackets at "[[[a]]]]\n\n\n"
charsInBalancedBrackets at "[[a]]]]\n\n\n"
charsInBalancedBrackets at "[a]]]]\n\n\n"
charsInBalancedBrackets at "a]]]]\n\n\n"
charsInBalancedBrackets at "a]]]"
charsInBalancedBrackets at "[a]]]"
charsInBalancedBrackets at "a]]]"
charsInBalancedBrackets at "a]"
charsInBalancedBrackets at "[[a]]]"
charsInBalancedBrackets at "[a]]]"
charsInBalancedBrackets at "a]]]"
charsInBalancedBrackets at "a]]"
charsInBalancedBrackets at "[a]]"
charsInBalancedBrackets at "a]]"
charsInBalancedBrackets at "a]"

jgm added a commit that referenced this issue Apr 20, 2015
This version should be a bit more efficient.

This doesn't help with #1735, however.
@jgm
Copy link
Owner Author

jgm commented Apr 20, 2015

More diagnostics (with a trace inside inlinesInBalancedBrackets):

%  python -c 'print "[a]"' | dist/build/pandoc/pandoc 
parsing inlines in "a"
parsing inlines in "a"
<p>[a]</p>
%  python -c 'print "[[a]]"' | dist/build/pandoc/pandoc
parsing inlines in "[a]"
parsing inlines in "a"
parsing inlines in "a"
parsing inlines in "[a]"
parsing inlines in "a"
<p>[[a]]</p>
%  time python -c 'print "[[[a]]]"' | dist/build/pandoc/pandoc
parsing inlines in "[[a]]"
parsing inlines in "a"
parsing inlines in "[a]"
parsing inlines in "a"
parsing inlines in "a"
parsing inlines in "[a]"
parsing inlines in "a"
parsing inlines in "[[a]]"
parsing inlines in "a"
parsing inlines in "[a]"
parsing inlines in "a"
<p>[[[a]]]</p>

@jgm
Copy link
Owner Author

jgm commented Apr 20, 2015

The problem is that, if we get something like

[[foo]]

we first try to parse [[foo]] as a link, which requires parsing [foo] with stateAllowLinks set to False. If this fails, we then need to parse [foo] again, this time with stateAllowLinks set to True.

All of this is completely fixed in CommonMark. We worked hard on a set of rules for links that could be parsed efficiently, and cmark can handle thousands-deep nested brackets without batting an eye. So it may be that the best solution here would be to retrofit pandoc to use something like the same algorithm. This would be a pretty large change, though, and might break things in some corner cases.

I am open to other ideas. There may be a good way of fixing this in pandoc without any substantive revisions, but I haven't thought of it.

@jgm
Copy link
Owner Author

jgm commented Apr 20, 2015

A fallback measure might be putting some kind of fixed limit on charsInBalancedBrackets (hence, on the number of nested brackets that can be contained in a link or image alt text). That should be enough to prevent this DoS vector, though it's ugly and unprincipled.

@jgm
Copy link
Owner Author

jgm commented May 14, 2015

Note; a fixed limit on depth for charsInBalancedBrackets isn't sufficient; we still get exponential growth.

@jgm jgm removed this from the 1.14 milestone May 20, 2015
jgm added a commit that referenced this issue Sep 16, 2017
Eventually we'll add `processEmphasis` and `processBracketed`
to this.

This will allow us to conform to CommonMark rules and
fix #3903 and #1735.
jgm added a commit that referenced this issue Jan 2, 2018
The rewrite is much more direct, avoiding parseFromString.
And it performs significantly better; unfortunately, parsing
time still increases exponentially.

See #1735.
@jgm jgm mentioned this issue Jun 14, 2019
9 tasks
@jgm
Copy link
Owner Author

jgm commented May 14, 2021

Similar case from #7283:

![0[(../../link0.svg){width=100%}
![1[(../../link1.svg){width=100%}
![2[(../../link2.svg){width=100%}
![3[(../../link3.svg){width=100%}
![4[(../../link4.svg){width=100%}
![5[(../../link5.svg){width=100%}
![6[(../../link6.svg){width=100%}
![7[(../../link7.svg){width=100%}
![8[(../../link8.svg){width=100%}
![9[(../../link9.svg){width=100%}

@davidmerfield
Copy link

davidmerfield commented Sep 8, 2021

I came across a further example of what I believe to be this problem, similar to #7283. Converting a sizeable Markdown document containing a large number of internal links. Two of these links were incorrectly formed, lacking a closing bracket for the link text, for example:

[thermography[(#thermography)

Input:

glossary.txt on Gist

Arguments:

+RTS -M50M -RTS -f markdown+autolink_bare_uris-simple_tables-multiline_tables-tex_math_dollars-yaml_metadata_block-implicit_figures+lists_without_preceding_blankline-blank_before_header-blank_before_blockquote-citations -t html5 --columns 1000 --no-highlight --email-obfuscation=none

Version:

pandoc 2.11.3.2
Compiled with pandoc-types 1.22, texmath 0.12.1, skylighting 0.10.2,
citeproc 0.3.0.3, ipynb 0.1.0.1

@hoijui
Copy link

hoijui commented Nov 15, 2023

An other example of this:

�[0m�[33m[WARN] �[0m

(log output with special sequences for coloring on the terminal)

@hoijui
Copy link

hoijui commented Nov 15, 2023

Would it be possible/sensible to issue a log message/warning, when reaching level 3 or so of [ depth?
That would help people a lot to find the culprit. I am sure there are 10 times more people then visible in this issue that came across this, but did not end up here.
It would also make it much easier to identify duplicate bug reports in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants