Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Linkify incorrectly parses query params starting with "&para" #670

Closed
filak opened this issue Jun 2, 2022 · 3 comments · Fixed by #692
Closed

bug: Linkify incorrectly parses query params starting with "&para" #670

filak opened this issue Jun 2, 2022 · 3 comments · Fixed by #692

Comments

@filak
Copy link

filak commented Jun 2, 2022

  • Python Version: 3.10.4
  • Bleach Version: 5.0.0

To Reproduce

from bleach import Linker
linker = Linker()
text = 'http://test.com?a=1&par=1&parameterA=2'
print(linker.linkify(text))
## prints:   <a href="http://test.com?a=1&amp;par=1¶meterA=2" rel="nofollow">http://test.com?a=1&amp;par=1¶meterA=2</a>

Expected behavior

## prints:   <a href="http://test.com?a=1&amp;par=1&amp;parameterA=2" rel="nofollow">http://test.com?a=1&amp;par=1&amp;parameterA=2</a>

Additional context

I believe this might happen somewhere in the html5lib_shim.py / BleachHTMLSerializer class:

class BleachHTMLSerializer(HTMLSerializer):

@filak filak added the untriaged Bug reports that haven't been triaged label Jun 2, 2022
@willkg willkg added linkify needs-your-help and removed untriaged Bug reports that haven't been triaged labels Jun 2, 2022
@willkg
Copy link
Member

willkg commented Jun 2, 2022

&para is being consumed as an entity. We fixed this in clean and I think we need to fix linkify in a similar way.

@filak
Copy link
Author

filak commented Jun 2, 2022

There are more entities with the same effect, ie. &not &reg :

from bleach import Linker
linker = Linker()
text = 'http://test.com?a=1&notify=1&register=2'
print(linker.linkify(text))
## prints:   <a href="http://test.com?a=1¬ify=1®ister=2" rel="nofollow">http://test.com?a=1¬ify=1®ister=2</a>

@willkg willkg added this to the 5.0.2 (tentative) milestone Oct 27, 2022
@jvanasco
Copy link
Contributor

jvanasco commented Nov 9, 2022

Adding for context:

This is related to #294 . The W3C calls this "fragile syntax".

IIRC, prior to the HTML5 spec the trailing semicolon for named references was NOT required, but it has been required since then. (see "Errors involving fragile syntax constructs" in the original https://dev.w3.org/html5/spec-LC/Overview.html and the current https://html.spec.whatwg.org/#syntax-errors )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants