Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template parsing (lexing) can have quadratic time complexity #857

Closed
petee-d opened this issue May 27, 2018 · 0 comments
Closed

Template parsing (lexing) can have quadratic time complexity #857

petee-d opened this issue May 27, 2018 · 0 comments

Comments

@petee-d
Copy link

petee-d commented May 27, 2018

Expected Behavior

Parsing the following template should have linear complexity and be done in reasonable time.

from jinja2 import Environment
Environment().from_string(' ' * 20000)

Actual Behavior

It has quadratic complexity and takes a few seconds at 20000, a minute at 100k.

In general, it appears any large segments of whitespace (" \t\n") without jinja tags that consume whitespace will degrade template parsing performance significantly.

# This parses quickly due to "-"
Environment().from_string('{{ 0 -}}' + ' ' * 20000)
Environment().from_string(' ' * 20000 + '{{- 0 }}')
# This is very slow
Environment().from_string('{{ 0 }}' + ' ' * 20000)
Environment().from_string(' ' * 20000 + '{{ 0 }}')

Cause

I investigated the cause of the slow rendering and it lies in two pathological regular expressions used in Jinja lexer, "root" and TOKEN_RAW_BEGIN (not illustrated in examples but suffers from the same issue). In the extremely simplified version of the root lexer regex shown below, notice whitespace may be consumed either by (.*?) or by \s*, with the latter having priority due to being greedy, but only in case the block of whitespace is followed by variable (or block/comment) opening tag with the "-" modifier (that forces left-stripping whitespace). This is a pathological case for the built-in Python re module and will have quadratic complexity due to backtracking done after consuming every single whitespace character (don't quote me on that though).

r'(.*?)(?:(?P<variable_begin>\s*{{-|{{))'

Jinja lexer does that to implement the whitespace stripping feature on the left side of tags, but this approach is extremely inefficient with built in Python re module (note the regex library doesn't suffer from this, but it is slower in general).

Proposed fix

I created a fix for this issue by re-implementing the way striping whitespace on the left side of tags is done (including the lstrip_blocks environment flag) by doing this outside of regular expressions, which in my opinion simply are not the right tool for this task. I didn't change the implementation of striping whitespace to the right of tags as the "-" character is matched before any whitespace, changing the state of the regex automaton before any backtracking needs to be done.

#858

Implications

The most significant implication is Jinja being extremely slow when parsing templates with segments of whitespace in tens of thousands. Such templates should be rare, but may occur. The application where I discovered this issue has almost 100k Jinja templates created by semi-technical users and a few of them contained such large segments of whitespace, most likely added by accident.

The more frequent implication is Jinja template parsing being sub optimal, as most Jinja templates will contains blocks of whitespace in tens or hundreds of characters. The performance impact of quadratic complexity on such small scale won't be that dramatic, but I measured over 2x average improvement in template parsing speeds after applying the fix on the above mentioned almost 100k different templates. I will post a histogram in the PR description.

Environment

  • Python version: tested with 2.7, 3.6, apparently any Python version
  • Jinja version: 2.8.1, current master, apparently any Jinja version in last few years
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant