Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode incorrectly escaped on PyPy 7.3.1 #176

Closed
ShadowJonathan opened this issue Jan 30, 2021 · 26 comments · Fixed by #184
Closed

Unicode incorrectly escaped on PyPy 7.3.1 #176

ShadowJonathan opened this issue Jan 30, 2021 · 26 comments · Fixed by #184

Comments

@ShadowJonathan
Copy link

ShadowJonathan commented Jan 30, 2021

On PyPy (specifically, the wheel with _speedups.pypy36-pp73-x86_64-linux-gnu.so), when using markupsafe.escape(), this unicode below is incorrectly escaped;

Input: 'https://x?<ab c>&q"+%3D%2B"="fö%26=o"'

_native:
'https://x?&lt;ab c&gt;&amp;q&#34;+%3D%2B&#34;=&#34;fö%26=o&#34;'
_speedups (on pypy):
'https://x?&lt;ab c&gt;&amp;q&#34;+%3D%2B&#34;=&#34;fö%26=o'

This was caught while testing PyPy support for synapse, in matrix-org/synapse#9123, in which one of the tests hooked behind here.

I am no expert in C, but I have a feeling one of the Unicode APIs used in _speedups.c is either being used incorrectly, or has an implementation mismatch compared to cPython

@davidism
Copy link
Member

Going to need some help on this one, interactions of C and PyPy are beyond me. @mattip is there a PyPy person who might be able to spot something?

@mattip
Copy link

mattip commented Jan 30, 2021

I thought the C-extension is not used on PyPy? I will take a look.

@davidism
Copy link
Member

Wheels can now be built against PyPy, I actually tested it a while ago and only just pushed them since I was doing the other new builds.

I guess I'll have to delete the 1.1.1 wheels, I can't replace them even if we fix the issue, I'd have to make a 1.1.2 release. I probably won't do that, I'll just leave PyPy wheels disabled for 2.0 until we figure it out.

@davidism
Copy link
Member

That said, @ShadowJonathan can you test with a build from master? 2.0 will remove Python 2 support, which removed about half the C code, so it should be much easier to reason about. Just want to make sure it's not already fixed somehow.

@ShadowJonathan
Copy link
Author

I probably won't do that, I'll just leave PyPy wheels disabled for 2.0 until we figure it out.

I recommend just disabling building extensions on PyPy, instead of disabling wheels, as installing wheels is much faster than building from source (especially if that source tries to build extensions)

That said, @ShadowJonathan can you test with a build from master? 2.0 will remove Python 2 support, which removed about half the C code, so it should be much easier to reason about. Just want to make sure it's not already fixed somehow.

Currently trying that, but keep in mind that markupsafe is required by jinja2 instead of us directly, and it's possible they dont support version 2 (yet).

I thought the C-extension is not used on PyPy? I will take a look.

@mattip there's a .so file in the pypy wheel, so yeah, it is.

@ShadowJonathan
Copy link
Author

Inserting "markupsafe @ git+https://github.com/pallets/markupsafe" does not build the wheel, as per the setup.py's internal instructions (not to build on pypy). It falls back to _native, which then just works as intended.

@mattip
Copy link

mattip commented Jan 30, 2021

weird, when I do pypy setup.py bdist_wheel I get a pure-python wheel. I wonder where the .so is coming from?

@ShadowJonathan
Copy link
Author

setup.py has

if os.environ.get("CIBUILDWHEEL", "0") == "1":
    run_setup(True)

which gives cibuildwheel a carta blanca to build any kind of wheel it wants, so the released wheels contain .so files.

@davidism
Copy link
Member

You have to set CIBUILDWHEEL=1 right now, otherwise PyPy is ignored from setup.py. I was actually planning to remove that ignore before this issue was opened.

@ShadowJonathan
Copy link
Author

@davidism i just did that, while installing from master, but the issue persists.

I think disabling building extensions on pypy at all would help a lot.

@davidism
Copy link
Member

Yeah, I'll just adjust it for now so the extension isn't built directly or with cibuildwheel on PyPy. Unfortunately I'll just have to delete the 1.1.1 wheels and people will still have to install from the sdist in that case, but that was already the case until yesterday anyway.

@mattip
Copy link

mattip commented Jan 30, 2021

Of course, the wheel building the so is no excuse, PyPy should not fail like that. But it would be more performant in pure python

@davidism
Copy link
Member

Interesting, if you think PyPy wouldn't benefit from the C extension anyway then that's another reason to disable it for now. I'm also interested in, but don't know much about, the limited C API and HPy, so I'm happy to revisit it later if someone figures those out.

@mattip
Copy link

mattip commented Jan 30, 2021

HPy still hasn't released an alpha version, so it might be a while, but it is good to know there is interest. Thanks for trying to support PyPy, ping me if you need more help. I am trying to reproduce the failure from the top of this issue, will get back when I have a fix.

@davidism
Copy link
Member

davidism commented Jan 30, 2021

OK, I deleted the 1.1.1 pp36 and pp37 wheels, PyPy users will have to install from the sdist until 2.0 (which was the case anyway). I'll disable building the extension during cibuildwheel later, leaving this issue open for that. I don't think I'll be able to mark just the PyPy wheels "pure Python", so future versions will still have platform tags even though they don't have the C extension and are equivalent.

@mattip
Copy link

mattip commented Jan 31, 2021

It turns out the synapse issue mentioned above was very likely due to using an older version of pypy3.6 (released in April 2020), the 7.3.3 version released in November 2020 and available from the downloads, conda-forge, github actions, cibuildwheel, and multibuild correctly parses the input in the issue. Could someone change the title to add "... on PyPy 7.3.1" ?

@davidism
Copy link
Member

davidism commented Jan 31, 2021

As far as I know I just let cibuildwheel setup PyPy, and you say cibuildwheel uses the correct version, so I'm not sure what's going on. cc @joerick again :-)

@davidism davidism changed the title Unicode incorrectly escaped Unicode incorrectly escaped on PyPy 7.3.1 Jan 31, 2021
@davidism
Copy link
Member

davidism commented Jan 31, 2021

Although that might be better as an issue elsewhere, seems like for MarkupSafe the answer is to omit the C extension regardless since you said it's not actually faster on PyPy.

@mattip
Copy link

mattip commented Jan 31, 2021

Sorry, I wasn't clear. The issue here is that the reporter tried to build and test synapse locally with an older version of PyPy.

@davidism
Copy link
Member

Ah, I see, but now I'm not clear if there was ever an issue with MarkupSafe. Should I still avoid building the C extension for PyPy wheels?

@mattip
Copy link

mattip commented Jan 31, 2021

Both types of wheels should work properly for current versions of PyPy.

As for the c-extension: I don't know if there are real-world benchmarks for MarkupSafe. That would be the definitive test whether (additional overhead of calling the C-API + savings by using C) beats the ability of the PyPy JIT to speed up the pure-python code. My intuition, based on the many templating benchmarks at speed.pypy.org, would say that the pure-python code should be faster, but intuition and benchmarks often clash.

@davidism
Copy link
Member

davidism commented Feb 7, 2021

I ran a benchmark of native Python vs compiled speedups on PyPy. Seems we should stick with the native version for now, speedups were about twice as slow. Note that these numbers are for ridiculously large strings, most use cases will probably be reasonably fast no matter what Python implementation or speedups are used. Here's the results on PyPy 3.7 7.3.3:

.........
native: Mean +- std dev: 341 ms +- 4 ms
.........
speedups: Mean +- std dev: 770 ms +- 6 ms

PyPy was still much slower than CPython. Maybe this just isn't a task PyPy is suited to speeding up, although native escape is pretty much just 5 str.replace calls. Or maybe I got something wrong with the benchmark. I included the code I used below. Here's the results on CPython 3.7:

.....................
native: Mean +- std dev: 78.3 ms +- 0.8 ms
.....................
speedups: Mean +- std dev: 42.8 ms +- 0.7 ms

I used the following as a benchmark. Generated 10,000 lines of random UTF-8 characters, each between 1000 to 10,000 characters, wrote to a file to keep the same input across runs.

import random

printable = tuple(c for c in (chr(i) for i in range(32, 0x110000)) if c.isprintable() and not c.isspace())

with open("lines.txt", "w", encoding="utf8") as f:
    for _ in range(10000):
        f.write("".join(random.choices(printable, k=random.randint(1000, 10000))))
        f.write("\n")

Used pyperf to call _native.escape or _speedups.escape on each line.

import pyperf

from markupsafe._native import escape as native_escape
from markupsafe._speedups import escape as speedups_escape

with open("lines.txt", encoding="utf8") as f:
    lines = f.readlines()

runner = pyperf.Runner()
runner.timeit(
    name="MarkupSafe native",
    globals={"escape": native_escape, "lines": lines},
    stmt="for line in lines: escape(line)",
)
runner.timeit(
    name="MarkupSafe speedups",
    globals={"escape": speedups_escape, "lines": lines},
    stmt="for line in lines: escape(line)",
)

@davidism
Copy link
Member

davidism commented Feb 7, 2021

@mattip just want to make sure you don't spot something I missed in the benchmark before I leave this closed.

@mattip
Copy link

mattip commented Feb 7, 2021

Thanks, I converted that to a self-contained gist and we are trying to reason about how to speed up the chain-of-replace on PyPy. It seems the JIT cannot do anything with the code, and so you see the pypy performance as if it has no JIT which is about 2x from CPython. The c code does something very different: it does all the replacing at once instead of chaining across the string 4 times.

@davidism
Copy link
Member

davidism commented Feb 7, 2021

Yeah, I was very surprised at how slow the C code was though, I'm assuming that's the overhead of calling the C API? Anyway, thanks for continuing to look into this. If you find anything, maybe starting a new issue would be better, this one sort of got off topic. :-)

@ShadowJonathan
Copy link
Author

It's the overhead of the C API, there's a lot that goes into pypy emulating CPython's C API, which slows down the language a lot

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants