Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extra " and ' get changed to attribute names #419

Closed
M1ha-Shvn opened this issue Nov 12, 2018 · 13 comments · Fixed by #427
Closed

extra " and ' get changed to attribute names #419

M1ha-Shvn opened this issue Nov 12, 2018 · 13 comments · Fixed by #427
Labels

Comments

@M1ha-Shvn
Copy link

M1ha-Shvn commented Nov 12, 2018

Hi.
I've found strange things in href escaping with linkify:

  1. & symbol is converted to & if it is present in href link:
linkify('<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&utm_medium=ccc&utm_campaign=crosspromo-fw1819">')
#   '<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&amp;utm_medium=cc&amp;utm_campaign=crosspromo-fw1819" rel="nofollow"></a>'
  1. Other html-escaped characters (< at link end) are not converted:
linkify('<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&utm_medium=cc&utm_campaign=crosspromo-fw1819<">')
# '<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&amp;utm_medium=cc&amp;utm_campaign=crosspromo-fw1819<" rel="nofollow"></a>'
  1. Quote in link is converted to strange tag parameter ("=""):
linkify('<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&utm_medium=cc&utm_campaign=crosspromo-fw1819\"">')
# '<a "="" href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&amp;utm_medium=cc&amp;utm_campaign=crosspromo-fw1819" rel="nofollow"></a>'

linkify('<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&utm_medium=cc&utm_campaign=crosspromo-fw1819\'">')
# '<a href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&amp;utm_medium=cc&amp;utm_campaign=crosspromo-fw1819\'" rel="nofollow"></a>'

linkify('<a href="https://test.ru/ru/estcross-promo-fw2018?utm_source=popup&utm_medium=cc&utm_campaign=crosspromo-fw1819\"\'">')
# '<a \'"="" href="https://test.ru/ru/cross-promo-fw2018?utm_source=popup&amp;utm_medium=cc&amp;utm_campaign=crosspromo-fw1819" rel="nofollow"></a>'

How can I get a correct href without any characters encoded?

@willkg
Copy link
Member

willkg commented Nov 12, 2018

Hi!

The first one is intended. Urls shouldn't have ambiguous ampersands, so linkify and clean escape those.

The second one is in a quoted attribute value, and since linkify and clean both change values so that they're quoted, it doesn't need to get escaped.

The spec for both of those is here: https://html.spec.whatwg.org/multipage/syntax.html#syntax-attributes

The first and third examples in the third one are invalid HTML, so the tokenizer and parser are trying to fix the HTML to make it valid, but not doing a very good job here. That should get fixed.

Thank you for reporting this!

@willkg willkg added the linkify label Nov 12, 2018
@M1ha-Shvn
Copy link
Author

M1ha-Shvn commented Nov 13, 2018

I want to:

  1. parse_emails
  2. call target_blank, nofollow callbacks
  3. Live correct hrefs with parameters as they are provided (without escaping symbols)

Can I achieve this in any way?

@willkg
Copy link
Member

willkg commented Nov 13, 2018

The first two requirements are fine. I don't think you can strictly achieve the third requirement with Bleach because linkify will fix invalid HTML.

Why do you need linkify to not fix the HTML so that it's valid?

@M1ha-Shvn
Copy link
Author

My usecase is wrapping all links into other ones with redirect in order to monitor clicks from emails and site popups. So what I do is:

  1. Get original html and find all tags with linkify
  2. clean html, removing bad attributes and tags with linkify
  3. Replace original links in href with out.myservice.com/api_click?url=escaped(base_url)
  4. Catch out.myservice.com/api_click and unescape url
  5. Return redirect.

In case & is replaced with &amp; on step 2, I get double escaping on step 3, so I get something like %26amp%3B in escaped url. After decoding original link on step 4 and redirecting, link is broken, &amp; in url is not parsed as original link parameters.

@willkg
Copy link
Member

willkg commented Nov 14, 2018

Can you show me the code? Otherwise I'm guessing and I don't have anything to tinker with.

@willkg
Copy link
Member

willkg commented Nov 26, 2018

@M1hacka Can you show me the code you're using?

@Deimos
Copy link
Contributor

Deimos commented Dec 6, 2018

This is causing issues for me as well, now that I switched to adding the LinkifyFilter to a Cleaner() instance to do both in one step (which I have to because of the issue with linkify escaping valid tags).

Here's a pretty minimal demonstration:

import bleach

bleach.linkify("https://www.youtube.com/watch?v=SkcucKDrbOI&amp;feature=youtu.be")

Result (expected): <a href="https://www.youtube.com/watch?v=SkcucKDrbOI&amp;feature=youtu.be" rel="nofollow">https://www.youtube.com/watch?v=SkcucKDrbOI&amp;feature=youtu.be</a>

import bleach
from functools import partial

bleach_linkifier = partial(bleach.linkifier.LinkifyFilter)
cleaner = bleach.Cleaner(filters=[bleach_linkifier])
cleaner.clean("https://www.youtube.com/watch?v=SkcucKDrbOI&amp;feature=youtu.be")

Result (not expected, link stops at the &amp;): <a href="https://www.youtube.com/watch?v=SkcucKDrbOI">https://www.youtube.com/watch?v=SkcucKDrbOI</a>&amp;feature=<a href="http://youtu.be">youtu.be</a>

@Deimos
Copy link
Contributor

Deimos commented Dec 6, 2018

One more detail: it works if the &amp; in the link is just a bare &, like https://www.youtube.com/watch?v=SkcucKDrbOI&feature=youtu.be instead. In that case, the output of both methods is:

<a href="https://www.youtube.com/watch?v=SkcucKDrbOI&amp;feature=youtu.be">https://www.youtube.com/watch?v=SkcucKDrbOI&amp;feature=youtu.be</a>

(with an additional rel="nofollow" attr added by linkify() by default)

@willkg
Copy link
Member

willkg commented Dec 12, 2018

@Deimos I'm pretty sure the issue you're describing is a different problem and not this one. Per #419 (comment), this issue is about solving the third item in the description which is "quote in link is converted to strange tag parameter ("="")".

Can you write up a new issue for your &amp; problem?

@willkg
Copy link
Member

willkg commented Dec 13, 2018

I tinkered with the " issue and I think it's a bug in html5lib:

>>> from bleach._vendor import html5lib
>>> walker = html5lib.getTreeWalker('etree')
>>> ser = html5lib.serializer.HTMLSerializer(quote_attr_values='always', omit_optional_tags=False, sanitize=False)
>>> parser = html5lib.HTMLParser()
>>> dom = parser.parseFragment('<a href="http://example.com/"">')
>>> ser.render(walker(dom))
u'<a "="" href="http://example.com/"></a>'

I don't think it should be fixing the " that way. I'll write up a bug there.

@willkg willkg changed the title Linkify escapes & in hrefs extra " and ' get changed to attribute names Dec 13, 2018
@willkg
Copy link
Member

willkg commented Dec 13, 2018

I created html5lib/html5lib-python#407 in the html5lib project. I'll follow up on the problem there and then update Bleach accordingly.

@willkg
Copy link
Member

willkg commented Dec 14, 2018

So, seems like it's getting parsed correctly and there are two parse errors that come out of that. I think we'll have to add some code to nix attributes that have invalid names.

@willkg
Copy link
Member

willkg commented Dec 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants