Should we remove the C extension? #433

davidism · 2024-04-17T22:12:50Z

It turns out that Python 3.12 is way faster at string operations than when MarkupSafe and its speedups were written. After running some benchmarks on short and long strings with and without HTML, between Python 3.8, Python 3.12, with and without speedups, the results show that the speedups only offer a fractional speedup, or are in some cases slower than native. Removing the speedups would make maintenance, release, and install much simpler, at some performance cost mitigated by general Python speedup.

Benchmark script

# bench.py
import subprocess
import sys

for name, s in (
    ("short escape", '"<strong>Hello, World!</strong>"'),
    ("long escape", '"Hello, World!" * 1000'),
    ("short plain", '"Hello, World!"'),
    ("long plain", '"Hello, World!" * 1000'),
    ("long suffix", '"<strong>Hello, World!</strong>" + "x" * 100_000')
):
    for mod in "native", "speedups":
        subprocess.run(
            [
                sys.executable,
                "-m",
                "pyperf",
                "timeit",
                "--compare-to",
                ".venv38/bin/python3.8",
                "--name",
                f"{name} {mod}",
                "-s",
                f"from markupsafe._{mod} import escape\ns = {s}",
                "escape(s)",
            ]
        )

Benchmark output

$ .venv38/bin/python3.8 bench.py
.....................
short escape native: Mean +- std dev: 678 ns +- 7 ns
.....................
short escape speedups: Mean +- std dev: 487 ns +- 12 ns
.....................
long escape native: Mean +- std dev: 22.9 us +- 0.2 us
.....................
long escape speedups: Mean +- std dev: 16.1 us +- 0.2 us
.....................
short plain native: Mean +- std dev: 538 ns +- 10 ns
.....................
short plain speedups: Mean +- std dev: 414 ns +- 1 ns
.....................
long plain native: Mean +- std dev: 22.8 us +- 0.4 us
.....................
long plain speedups: Mean +- std dev: 16.1 us +- 0.4 us
.....................
long suffix native: Mean +- std dev: 176 us +- 2 us
.....................
long suffix speedups: Mean +- std dev: 180 us +- 1 us

$ .venv312/bin/python3.12 bench.py
.....................
short escape native: Mean +- std dev: 425 ns +- 4 ns
.....................
short escape speedups: Mean +- std dev: 472 ns +- 2 ns
.....................
long escape native: Mean +- std dev: 19.8 us +- 0.1 us
.....................
long escape speedups: Mean +- std dev: 15.6 us +- 0.0 us
.....................
short plain native: Mean +- std dev: 314 ns +- 1 ns
.....................
short plain speedups: Mean +- std dev: 405 ns +- 10 ns
.....................
long plain native: Mean +- std dev: 20.5 us +- 0.6 us
.....................
long plain speedups: Mean +- std dev: 15.9 us +- 0.9 us
.....................
long suffix native: Mean +- std dev: 154 us +- 2 us
.....................
long suffix speedups: Mean +- std dev: 178 us +- 3 us

name	3.8 native	3.8 speedups	3.12 native	3.12 speedups
short escape	678ns	487ns	425ns	472ns
long escape	22.9us	16.1us	19.8us	15.6us
short plain	538ns	414ns	314ns	405ns
long plain	22.8us	16.1us	20.5us	15.9us
long suffix	176us	180us	154us	178us

I couldn't go back further than 3.8 on my machine, but I vaguely remember the benchmarks being much more significantly in favor of the C extension on even older versions. On the order of 5x rather than 1.25x.

If the speedups had never existed, users today compared to 5 years ago would see a significant speedup from Python improvements alone. The speedups are now only faster when many thousands of characters throughout a long string need to be escaped. For short strings or long strings with only minor escaping needed, the speedups are now slower than the Python implementation. Python continues to make performance improvements, such as adding a JIT compiler in 3.13, which may improve MarkupSafe native performance even more.

I have a feeling that most "real" data falls into either short strings like usernames, or medium strings of mostly non-HTML, like blog comments. In which case the speedups are not actually faster. In many cases of longer data, like blog comments, they're probably being preprocessed with Markdown, in which case they're not being escaped by MarkupSafe anyway.

I vaguely understand how the C extension works, but would have trouble maintaining it if something was needed. It also makes the build and publishing much more complicated, requiring 20+ minutes to build 50+ wheels. While the C extension is fairly straightforward, and it's unlikely there's a memory issue, it does deal directly with the internal bits of Python's unicode representation (which is also why it can't be a single abi3 wheel). Removing it would simplify the code, the build/install, and perhaps the perceived security.

The text was updated successfully, but these errors were encountered:

davidism · 2024-04-17T22:15:37Z

@carsonburr has written an optimized version in Rust, and some initial benchmarks show it to be around 2x faster or less than Python, similar to the C. But I'm waiting to hear back about running these more detailed benchmarks to see if it's more consistent than the C. Switching to Rust may be an alternative to C if so, but its implementation is just as complex as the C one because of various optimization tricks it uses.

MostAwesomeDude · 2024-04-18T02:05:14Z

Somebody's gotta say it: pure Python is easier to integrate with PyPy, for even bigger speedups. If you really want to adopt Rust, consider ensuring that your crate can bind via cffi so that PyPy compatibility is still possible.

davidism · 2024-04-18T02:49:59Z

PyPy already gets a build with only the native implementation. We don't build a platform wheel for it, so it falls back to installing from sdist, which skips compiling the speedups for PyPy.

tonybaloney · 2024-04-19T03:47:58Z

If I run the native extension's escape() function 1 million times with debug symbols, ~40% of the time running it is spend checking whether the input has a __html__ attribute (if the input is a string, we should shortcut that, PyUnicode_CheckExact is a very fast call). Then another 40% is spent running Python code.

In short, only 4% of the execution time is in the compiled C code.

tonybaloney · 2024-04-19T04:21:06Z

Submitted a PR with an optimisation that might help these figures.

davidism · 2024-04-19T04:29:56Z

Great catch, that greatly improved the performance of short escape and short plain on both Python 3.8 and 3.12. long suffix is still slow on 3.12. I wonder what happened starting in 3.10 that made the speedups slower in those cases.

Benchmark

$ python3.12 bench.py
.....................
short escape native: Mean +- std dev: 426 ns +- 2 ns
.....................
short escape speedups: Mean +- std dev: 265 ns +- 2 ns
.....................
long escape native: Mean +- std dev: 20.2 us +- 0.1 us
.....................
long escape speedups: Mean +- std dev: 15.5 us +- 0.0 us
.....................
short plain native: Mean +- std dev: 316 ns +- 2 ns
.....................
short plain speedups: Mean +- std dev: 192 ns +- 1 ns
.....................
long plain native: Mean +- std dev: 20.4 us +- 0.2 us
.....................
long plain speedups: Mean +- std dev: 15.5 us +- 0.2 us
.....................
long suffix native: Mean +- std dev: 156 us +- 1 us
.....................
long suffix speedups: Mean +- std dev: 179 us +- 1 us

$ python3.8 bench.py
.....................
short escape native: Mean +- std dev: 696 ns +- 6 ns
.....................
short escape speedups: Mean +- std dev: 366 ns +- 5 ns
.....................
long escape native: Mean +- std dev: 23.3 us +- 0.1 us
.....................
long escape speedups: Mean +- std dev: 16.2 us +- 0.3 us
.....................
short plain native: Mean +- std dev: 540 ns +- 8 ns
.....................
short plain speedups: Mean +- std dev: 287 ns +- 6 ns
.....................
long plain native: Mean +- std dev: 23.4 us +- 0.1 us
.....................
long plain speedups: Mean +- std dev: 16.2 us +- 0.2 us
.....................
long suffix native: Mean +- std dev: 180 us +- 1 us
.....................
long suffix speedups: Mean +- std dev: 184 us +- 3 us

Well good news is the speedups are still a clear improvement now. Bad news is we still have the C code and 20 minute publish workflow.

davidism · 2024-04-19T18:36:24Z

While it would make maintenance and releases easier, the speedups have now been sped up and are clearly still useful. The good news is that I have the release automated, so I at least can kick it off and come back rather than monitoring it. I'm still interested in a Rust implementation that has close to the same performance, or perhaps exploring mypyc or cython, if anyone's looking for an interesting exploratory task.

tonybaloney mentioned this issue Apr 19, 2024

Skip costly attribute check for unicode strings #434

Merged

davidism closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024

github-actions bot locked as resolved and limited conversation to collaborators May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we remove the C extension? #433

Should we remove the C extension? #433

davidism commented Apr 17, 2024 •

edited

Loading

davidism commented Apr 17, 2024

MostAwesomeDude commented Apr 18, 2024

davidism commented Apr 18, 2024 •

edited

Loading

tonybaloney commented Apr 19, 2024

tonybaloney commented Apr 19, 2024

davidism commented Apr 19, 2024 •

edited

Loading

davidism commented Apr 19, 2024

Should we remove the C extension? #433

Should we remove the C extension? #433

Comments

davidism commented Apr 17, 2024 • edited Loading

davidism commented Apr 17, 2024

MostAwesomeDude commented Apr 18, 2024

davidism commented Apr 18, 2024 • edited Loading

tonybaloney commented Apr 19, 2024

tonybaloney commented Apr 19, 2024

davidism commented Apr 19, 2024 • edited Loading

davidism commented Apr 19, 2024

davidism commented Apr 17, 2024 •

edited

Loading

davidism commented Apr 18, 2024 •

edited

Loading

davidism commented Apr 19, 2024 •

edited

Loading