Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catastrophic backtracking with large inline image #399

Open
herndlm opened this issue Mar 11, 2024 · 2 comments
Open

Catastrophic backtracking with large inline image #399

herndlm opened this issue Mar 11, 2024 · 2 comments

Comments

@herndlm
Copy link

herndlm commented Mar 11, 2024

Hi,

with both Markdown and MarkdownExtra there are issues with very long content like inline images.

$text = preg_replace_callback(
'{
(^.+?) # $1: Header text
(?:[ ]+ ' . $this->id_class_attr_catch_re . ' )? # $3 = id/class attributes
[ ]*\n(=+|-+)[ ]*\n+ # $3: Header footer
}mx',
array($this, '_doHeaders_callback_setext'), $text);
and also the simpler
$text = preg_replace_callback('{ ^(.+?)[ ]*\n(=+|-+)[ ]*\n+ }mx',
array($this, '_doHeaders_callback_setext'), $text);
lead to "Backtrack limit exhausted" errors and empty content with the default pcre.backtrack_limit of 1000000. I was about to create a re-producer, but I basically see the same issue on regex101.com and it might be simpler to play around with there: https://regex101.com/r/BcWE82/1

Do you think this is something that can be improved or is it suggested to increase pcre.backtrack_limit to deal with this? This could be the same as #149

@michelf
Copy link
Owner

michelf commented Mar 11, 2024

In general, excessive backtracking is solved by fixing the regular expressions to avoid doing unnecessary things. Backtracking is prone to result in combinatory explosions where there's no reasonable limit that would work… at best you'll just hang for a few minutes/hours/days depending on the input.

But in this case I believe the problem is that it backtracks a fixed number of time per character but since you have about 720K characters it reaches the limit quite fast, despite being constant-time. So maybe increasing the limit would work fine in this case.

Ideally we'd eliminate this unnecessary backtracking. I played with the regex you put in the regex101 link and came up with this reasonable replacement:

^((?>.+))\n(=+|-+)[ ]*\n+

or this equivalent one:

^([^\n]+)\n(=+|-+)[ ]*\n+

Neither will cause backtracking. Either one can be used as a replacement. More work is needed to improve on the one in the MarkdownExtra parser which is a bit more complex.

It's likely others regular expressions could cause excessive backtracking like this. I don't test with 720K character lines so I don't know.

@herndlm
Copy link
Author

herndlm commented Mar 12, 2024

you're right, I could make it work by increasing pcre.backtrack_limit and it seems to scale linear with the line length basically. your regexes here seem to work fine too, feel free to decide what to do. Adapting that regex sounds like a good idea, but you're right MarkdownExtra should be adapted/improved too most likely. if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants