Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make parsing of text be non-quadratic. #579

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Feb 27, 2024

  1. Make parsing of text be non-quadratic.

    In Python, appending strings is not guaranteed to be constant-time,
    since they are documented to be immutable.  In some corner cases,
    CPython is able to make these operations constant-time, but reaching
    into ETree objects is not such a case.
    
    This leads to parse times being quadratic in the size of the text in
    the input in pathological cases where parsing outputs a large number
    of adjacent text nodes which must be combined (e.g. HTML-escaped
    values).  Specifically, we expect doubling the size of the input to
    result in approximately doubling the time to parse; instead, we
    observe quadratic behavior:
    
    ```
    In [1]: import html5lib
    
    In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
    2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    
    In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
    6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    
    In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
    19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
    ```
    
    Switch from appending to the internal `str`, to appending text to an
    array of text chunks, as appends can be done in constant time.  Using
    `bytearray` is a similar solution, but benchmarks slightly worse
    because the strings must be encoded before being appended.
    
    This improves parsing of text documents noticeably:
    
    ```
    In [1]: import html5lib
    
    In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
    2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    
    In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
    3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    
    In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
    8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
    ```
    alexmv committed Feb 27, 2024
    Configuration menu
    Copy the full SHA
    075cb7c View commit details
    Browse the repository at this point in the history