Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we remove the C extension? #433

Closed
davidism opened this issue Apr 17, 2024 · 7 comments
Closed

Should we remove the C extension? #433

davidism opened this issue Apr 17, 2024 · 7 comments

Comments

@davidism
Copy link
Member

davidism commented Apr 17, 2024

It turns out that Python 3.12 is way faster at string operations than when MarkupSafe and its speedups were written. After running some benchmarks on short and long strings with and without HTML, between Python 3.8, Python 3.12, with and without speedups, the results show that the speedups only offer a fractional speedup, or are in some cases slower than native. Removing the speedups would make maintenance, release, and install much simpler, at some performance cost mitigated by general Python speedup.

Benchmark script
# bench.py
import subprocess
import sys

for name, s in (
    ("short escape", '"<strong>Hello, World!</strong>"'),
    ("long escape", '"Hello, World!" * 1000'),
    ("short plain", '"Hello, World!"'),
    ("long plain", '"Hello, World!" * 1000'),
    ("long suffix", '"<strong>Hello, World!</strong>" + "x" * 100_000')
):
    for mod in "native", "speedups":
        subprocess.run(
            [
                sys.executable,
                "-m",
                "pyperf",
                "timeit",
                "--compare-to",
                ".venv38/bin/python3.8",
                "--name",
                f"{name} {mod}",
                "-s",
                f"from markupsafe._{mod} import escape\ns = {s}",
                "escape(s)",
            ]
        )
Benchmark output
$ .venv38/bin/python3.8 bench.py
.....................
short escape native: Mean +- std dev: 678 ns +- 7 ns
.....................
short escape speedups: Mean +- std dev: 487 ns +- 12 ns
.....................
long escape native: Mean +- std dev: 22.9 us +- 0.2 us
.....................
long escape speedups: Mean +- std dev: 16.1 us +- 0.2 us
.....................
short plain native: Mean +- std dev: 538 ns +- 10 ns
.....................
short plain speedups: Mean +- std dev: 414 ns +- 1 ns
.....................
long plain native: Mean +- std dev: 22.8 us +- 0.4 us
.....................
long plain speedups: Mean +- std dev: 16.1 us +- 0.4 us
.....................
long suffix native: Mean +- std dev: 176 us +- 2 us
.....................
long suffix speedups: Mean +- std dev: 180 us +- 1 us
$ .venv312/bin/python3.12 bench.py
.....................
short escape native: Mean +- std dev: 425 ns +- 4 ns
.....................
short escape speedups: Mean +- std dev: 472 ns +- 2 ns
.....................
long escape native: Mean +- std dev: 19.8 us +- 0.1 us
.....................
long escape speedups: Mean +- std dev: 15.6 us +- 0.0 us
.....................
short plain native: Mean +- std dev: 314 ns +- 1 ns
.....................
short plain speedups: Mean +- std dev: 405 ns +- 10 ns
.....................
long plain native: Mean +- std dev: 20.5 us +- 0.6 us
.....................
long plain speedups: Mean +- std dev: 15.9 us +- 0.9 us
.....................
long suffix native: Mean +- std dev: 154 us +- 2 us
.....................
long suffix speedups: Mean +- std dev: 178 us +- 3 us
name 3.8 native 3.8 speedups 3.12 native 3.12 speedups
short escape 678ns 487ns 425ns 472ns
long escape 22.9us 16.1us 19.8us 15.6us
short plain 538ns 414ns 314ns 405ns
long plain 22.8us 16.1us 20.5us 15.9us
long suffix 176us 180us 154us 178us

I couldn't go back further than 3.8 on my machine, but I vaguely remember the benchmarks being much more significantly in favor of the C extension on even older versions. On the order of 5x rather than 1.25x.

If the speedups had never existed, users today compared to 5 years ago would see a significant speedup from Python improvements alone. The speedups are now only faster when many thousands of characters throughout a long string need to be escaped. For short strings or long strings with only minor escaping needed, the speedups are now slower than the Python implementation. Python continues to make performance improvements, such as adding a JIT compiler in 3.13, which may improve MarkupSafe native performance even more.

I have a feeling that most "real" data falls into either short strings like usernames, or medium strings of mostly non-HTML, like blog comments. In which case the speedups are not actually faster. In many cases of longer data, like blog comments, they're probably being preprocessed with Markdown, in which case they're not being escaped by MarkupSafe anyway.

I vaguely understand how the C extension works, but would have trouble maintaining it if something was needed. It also makes the build and publishing much more complicated, requiring 20+ minutes to build 50+ wheels. While the C extension is fairly straightforward, and it's unlikely there's a memory issue, it does deal directly with the internal bits of Python's unicode representation (which is also why it can't be a single abi3 wheel). Removing it would simplify the code, the build/install, and perhaps the perceived security.

@davidism
Copy link
Member Author

@carsonburr has written an optimized version in Rust, and some initial benchmarks show it to be around 2x faster or less than Python, similar to the C. But I'm waiting to hear back about running these more detailed benchmarks to see if it's more consistent than the C. Switching to Rust may be an alternative to C if so, but its implementation is just as complex as the C one because of various optimization tricks it uses.

@MostAwesomeDude
Copy link

Somebody's gotta say it: pure Python is easier to integrate with PyPy, for even bigger speedups. If you really want to adopt Rust, consider ensuring that your crate can bind via cffi so that PyPy compatibility is still possible.

@davidism
Copy link
Member Author

davidism commented Apr 18, 2024

PyPy already gets a build with only the native implementation. We don't build a platform wheel for it, so it falls back to installing from sdist, which skips compiling the speedups for PyPy.

@tonybaloney
Copy link
Contributor

If I run the native extension's escape() function 1 million times with debug symbols, ~40% of the time running it is spend checking whether the input has a __html__ attribute (if the input is a string, we should shortcut that, PyUnicode_CheckExact is a very fast call). Then another 40% is spent running Python code.

In short, only 4% of the execution time is in the compiled C code.

screenshot 2024-04-19 at 13 43 57

@tonybaloney
Copy link
Contributor

Submitted a PR with an optimisation that might help these figures.

@davidism
Copy link
Member Author

davidism commented Apr 19, 2024

Great catch, that greatly improved the performance of short escape and short plain on both Python 3.8 and 3.12. long suffix is still slow on 3.12. I wonder what happened starting in 3.10 that made the speedups slower in those cases.

Benchmark
$ python3.12 bench.py
.....................
short escape native: Mean +- std dev: 426 ns +- 2 ns
.....................
short escape speedups: Mean +- std dev: 265 ns +- 2 ns
.....................
long escape native: Mean +- std dev: 20.2 us +- 0.1 us
.....................
long escape speedups: Mean +- std dev: 15.5 us +- 0.0 us
.....................
short plain native: Mean +- std dev: 316 ns +- 2 ns
.....................
short plain speedups: Mean +- std dev: 192 ns +- 1 ns
.....................
long plain native: Mean +- std dev: 20.4 us +- 0.2 us
.....................
long plain speedups: Mean +- std dev: 15.5 us +- 0.2 us
.....................
long suffix native: Mean +- std dev: 156 us +- 1 us
.....................
long suffix speedups: Mean +- std dev: 179 us +- 1 us
$ python3.8 bench.py
.....................
short escape native: Mean +- std dev: 696 ns +- 6 ns
.....................
short escape speedups: Mean +- std dev: 366 ns +- 5 ns
.....................
long escape native: Mean +- std dev: 23.3 us +- 0.1 us
.....................
long escape speedups: Mean +- std dev: 16.2 us +- 0.3 us
.....................
short plain native: Mean +- std dev: 540 ns +- 8 ns
.....................
short plain speedups: Mean +- std dev: 287 ns +- 6 ns
.....................
long plain native: Mean +- std dev: 23.4 us +- 0.1 us
.....................
long plain speedups: Mean +- std dev: 16.2 us +- 0.2 us
.....................
long suffix native: Mean +- std dev: 180 us +- 1 us
.....................
long suffix speedups: Mean +- std dev: 184 us +- 3 us

Well good news is the speedups are still a clear improvement now. Bad news is we still have the C code and 20 minute publish workflow.

@davidism
Copy link
Member Author

While it would make maintenance and releases easier, the speedups have now been sped up and are clearly still useful. The good news is that I have the release automated, so I at least can kick it off and come back rather than monitoring it. I'm still interested in a Rust implementation that has close to the same performance, or perhaps exploring mypyc or cython, if anyone's looking for an interesting exploratory task.

@davidism davidism closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants