Make parsing of text be non-quadratic. #579

alexmv · 2024-02-27T19:49:35Z

In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.

This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:

In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("&lt;" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("&lt;" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("&lt;" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

Switch from appending to the internal str, to appending text to an array of text chunks, as appends can be done in constant time. Using bytearray is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.

This improves parsing of text documents noticeably:

In [1]: import html5lib

In [2]: %timeit -n1 -r5 html5lib.parse("&lt;" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [3]: %timeit -n1 -r5 html5lib.parse("&lt;" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [4]: %timeit -n1 -r5 html5lib.parse("&lt;" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Old flamegraph:

New flamegraph:

In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case. This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior: ``` In [1]: import html5lib In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000) 2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000) 6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000) 19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each) ``` Switch from appending to the internal `str`, to appending text to an array of text chunks, as appends can be done in constant time. Using `bytearray` is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended. This improves parsing of text documents noticeably: ``` In [1]: import html5lib In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000) 2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000) 3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000) 8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) ```

andersk · 2024-02-28T01:33:43Z

This solution can’t work, as it’s a breaking change to the public API. Before:

>>> html5lib.parse("hello")[1].text
'hello'

After:

>>> html5lib.parse("hello")[1].text
<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make parsing of text be non-quadratic. #579

Make parsing of text be non-quadratic. #579

alexmv commented Feb 27, 2024 •

edited

andersk commented Feb 28, 2024

Make parsing of text be non-quadratic. #579

Are you sure you want to change the base?

Make parsing of text be non-quadratic. #579

Conversation

alexmv commented Feb 27, 2024 • edited

andersk commented Feb 28, 2024

alexmv commented Feb 27, 2024 •

edited