Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lexer being extremely slow on large amounts of whitespace. Fixes #857 #858

Merged
merged 3 commits into from Oct 7, 2019

Conversation

@petee-d
Copy link

petee-d commented May 27, 2018

Fix for #857, where parsing whitespace in templates usually has quadratic time complexity.

It addresses two pathological regular expressions used in Jinja lexer, "root" and TOKEN_RAW_BEGIN (not illustrated in examples but suffers from the same issue). In the extremely simplified version of the root lexer regex shown below, notice whitespace may be consumed either by (.*?) or by \s*, with the latter having priority due to being greedy, but only in case the block of whitespace is followed by variable (or block/comment) opening tag with the "-" modifier (that forces left-stripping whitespace). This is a pathological case for the built-in Python re module and will have quadratic complexity due to backtracking done after consuming every single whitespace character (don't quote me on that though).

r'(.*?)(?:(?P<variable_begin>\s*{{-|{{))'

Jinja lexer does that to implement the whitespace stripping feature on the left side of tags, but this approach is extremely inefficient with built in Python re module (note the regex library doesn't suffer from this, but it is slower in general).

My approach uses .rstrip or .rfind + another regular expression to remove the whitespace inside the tokenizer.

Potential improvements

  • I found the code to be quite hacky, especially the inspection of type of tokens in method tokeniter (shouldn't it be tokenizer btw?), when I needed to add a flag for a rule to attempt l-stripping, I saw no better way to continue in that pattern if I wanted to avoid larger refactoring. Any ideas?
  • Is this project still sticking to 80 character maximum per line, as lexer.py apparently does? It makes the code very ugly and I see some modules no longer adhere to that limit. Is 120 character line maximum acceptable?
  • It feels like the code somewhat avoids private helper methods. I especially feel like my new code block for l-stripping should be in such a separate method and I will be happy to move it to one unless you are actually trying to avoid the overhead of calling methods (which in my opinion should be negligible here).

Testing and performance measurement

I tested the fix on almost 100k of different Jinja2 templates, created by users with different technical skills and used for various purposes (from templates without any jinja tags to huge personalized email templates). Parsing all templates produced the exact same parsing tree both before and after my fix. The Jinja2 test suite of course passes as well.

The table below shows the parsing times of the templates I tested the fix with, each parsing attempt before or after was added to a bucket based on rounded log2 of the parsing time. You can see the 4 extremely slow cases being fixed and a general improvement of parsing time.

Seconds | Before | After
0.00006 |    873 |  1775
0.00012 |  19690 | 21348
0.00024 |   8805 |  9151
0.00049 |   5294 |  5331
0.00098 |   3603 |  2146
0.00195 |   2002 |  4824
0.00391 |   3109 |  8404
0.00781 |   7778 |  7759
0.01563 |   8455 |  3419
0.03125 |   4889 |  2234
0.06250 |   1711 |    63
0.12500 |    177 |     3
0.25000 |     67 |     0
0.50000 |      0 |     0
1.00000 |      0 |     0
2.00000 |      0 |     0
4.00000 |      2 |     0
8.00000 |      2 |     0

I didn't add any test case. I could add a test that parses something like (' ' * 40000) and asserts the time it took didn't cross a reasonably large boundary, but I generally dislike such tests as they may be fragile to CI environment issues. Let me know what you think.

@petee-d

This comment has been minimized.

Copy link
Author

petee-d commented May 27, 2018

Note the build failed on something unrelated to my changes (Python 3.3.6 environment creation and it passes locally for me), could someone with repo access try to restart it or fix the underlying issue?

@kevin-brown kevin-brown force-pushed the exponea:857-lexer-whitespace-performance branch from 50aab7a to 7d00a40 May 6, 2019
@davidism davidism added this to the 2.11.0 milestone Oct 5, 2019
@davidism

This comment has been minimized.

Copy link
Member

davidism commented Oct 5, 2019

Sorry for letting this sit so long! I've marked this for 2.11. Unfortunately, there have been some other changes to the lexer and lstrip_blocks, and this doesn't merge cleanly. If you see this and can address this, let me know, as you understand the changes better than me.

@davidism

This comment has been minimized.

Copy link
Member

davidism commented Oct 5, 2019

I rebased, and it seems like your fix for this also addressed #748, or at least it passes the tests added by #1014. I'll check a little more.

@petee-d

This comment has been minimized.

Copy link
Author

petee-d commented Oct 6, 2019

Oh, hey there. I'm afraid I only barely remember the changes I have done almost year and a half ago, but can help if needed. I understand you already rebased this (without pushing I guess), so I won't dive into it unless you say you'd like me to.

BTW, I would be quite happy if this got merged (or otherwise fixed), our project was using a fork with my fix ever since I made it and being able to update Jinja would be nice.

@davidism

This comment has been minimized.

Copy link
Member

davidism commented Oct 6, 2019

No problem, I think I have a handle on it, I'll try to push and merge later today.

davidism added 2 commits Oct 7, 2019
@davidism

This comment has been minimized.

Copy link
Member

davidism commented Oct 7, 2019

For some reason I couldn't push my rebase to the source repo for this PR, but GitHub happily let me merge and commit to it using the online editor. 🤷‍♂

I reorganized the code a bit and added a ton of comments in case I ever need to revisit this.

@davidism davidism merged commit 562e178 into pallets:master Oct 7, 2019
10 checks passed
10 checks passed
Tests Build #20191007.2 succeeded
Details
Tests (Job Docs) Job Docs succeeded
Details
Tests (Job PyPy 3 Linux) Job PyPy 3 Linux succeeded
Details
Tests (Job Python 2.7 Linux) Job Python 2.7 Linux succeeded
Details
Tests (Job Python 2.7 Windows) Job Python 2.7 Windows succeeded
Details
Tests (Job Python 3.5 Linux) Job Python 3.5 Linux succeeded
Details
Tests (Job Python 3.6 Linux) Job Python 3.6 Linux succeeded
Details
Tests (Job Python 3.7 Linux) Job Python 3.7 Linux succeeded
Details
Tests (Job Python 3.7 Mac) Job Python 3.7 Mac succeeded
Details
Tests (Job Python 3.7 Windows) Job Python 3.7 Windows succeeded
Details
davidism added a commit that referenced this pull request Oct 7, 2019
jinja2/lexer.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.