update bleach to 1.4, html5lib to 0.999 (bug 968226) #1735

Merged
merged 1 commit into from Mar 12, 2014

Conversation

Projects
None yet
6 participants
Contributor

magopian commented Feb 6, 2014

Bleach 1.4 uses a new version of html5lib (0.999) which deprecated the use of
simpletree.

Two different places were using html5lib, with simpletree, so the code had to
be modified.

Because of the way bleach.linkify now works, the remove_links hack
isn't working properly, so I splitted the fonctionality in half:

  • has_links, used in forms to check if there's links (text or markup)
  • the "remove links" part is now in NoLinksMixin.clean_localized_string
    and uses a combination of bleach.clean and bleach.linkify

Also, the order of links attributes have changed, so tests needed to be
modified accordingly.

apps/translations/utils.py
- else:
- return sum(walk(node) for node in tree.childNodes)
- return walk(tree)
+ sum_ = 0
@vinyll

vinyll Feb 10, 2014

Contributor

No big deal, but a name like total would avoid the _

@magopian

magopian Feb 10, 2014

Contributor

excellent idea, will do that straight away

+
+ Return the truncated text and the characters left before the limit, if any.
+
+ """
@vinyll

vinyll Feb 10, 2014

Contributor

Is there a reason for those blank lines?

@magopian

magopian Feb 10, 2014

Contributor

I'm following http://www.python.org/dev/peps/pep-0257/#multi-line-docstrings convention, should I change that?

import html5lib
import jinja2
+def truncate_text(text, limit, killwords=False, end='...'):
+ """Return as many characters as possible without going over the limit.
@vinyll

vinyll Feb 10, 2014

Contributor

A newline after the """ to introduce multiline docstring?

@magopian

magopian Feb 10, 2014

Contributor

I'm following http://www.python.org/dev/peps/pep-0257/#multi-line-docstrings convention, should I change that?

apps/addons/tests/test_models.py
@@ -936,7 +936,7 @@ def test_newlines_attribute_link_doublequote(self):
parsed = self.newlines_helper(before)
- assert parsed.endswith('google.com" rel="nofollow">test</a>')
+ assert parsed.startswith('<a rel="nofollow" href="http://')
@diox

diox Feb 10, 2014

Member

Is it possible to write this test with an equality instead of startswith/endswith hacks ? (is the attribute order fixed ?)

@magopian

magopian Feb 10, 2014

Contributor

I think so, yes, i'll try by mocking the call to linkify_with_outgoing

@@ -420,39 +420,49 @@ def clean_nl(string):
cleaned text.
"""
- html_blocks = ['blockquote', 'ol', 'li', 'ul']
+ html_blocks = ['{http://www.w3.org/1999/xhtml}blockquote',
@cvan

cvan Feb 10, 2014

Member

why is this xhtml?

@magopian

magopian Feb 10, 2014

Contributor

no the slightest clue, sorry, seems to be like that by default, should I look into changing that, if it's possible?

@cvan

cvan Feb 11, 2014

Member

I'm just curious why you changed this

@magopian

magopian Feb 11, 2014

Contributor

ah, using etree instead of simpletree incurred quite a few changes, this is one of them. The nodes' tags are of this form in etree.

apps/amo/utils.py
+ if tree.tag in html_blocks:
+ if tree.text:
+ tree.text = tree.text.lstrip('\n')
+ if not len(tree):
@cvan

cvan Feb 10, 2014

Member

if not tree:

@magopian

magopian Feb 10, 2014

Contributor

I'm not sure that would work. I want to test if there's no children (I'll had a comment there), but tree isn't false, there's at least the root node if we're here.

apps/translations/utils.py
+ if text_length < limit:
+ return text, limit - text_length
+
+ # Explicitely add "end" in any case, as Jinja can't know we're truncating
@cvan

cvan Feb 10, 2014

Member

"Explicitly"

Member

cvan commented Feb 11, 2014

I'm very confused why there are so many changes for two library upgrades. if you can shed light on this, we can merge in. otherwise, I'm very confused and concerned.

Contributor

magopian commented Feb 11, 2014

@cvan, sorry about that, I guess my explanation in the PR summary isn't clear enough, let me try to reword it:

Upgrading bleach from 1.2 to 1.4 incurred several changes:

  • bleach now uses html5lib 0.999 instead of 0.95
    • bleach 0.999 deprecated simpletree, and its API is very very different from etree's API, so the two parts in our code where we used html5lib directly needed to change (a lot)
  • bleach.linkify works a bit differently now, so I had to
    • rework/split the remove_links functionality
    • change the order of the rel and href attributes on links, in the tests

Please let me know if this isn't clear enough, or if you need any more information

x = NoLinksTranslation(localized_string=s)
- eq_(x.__html__(), u'foo bar')
+ eq_(x.__html__(), u'a with markup')
@davidbgk

davidbgk Feb 17, 2014

Member

Do we really want to remove the text within the link? It can lead to weird sentences.

@magopian

magopian Feb 17, 2014

Contributor

The rationale is explained here: https://bugzilla.mozilla.org/show_bug.cgi?id=928128#c11

1/ forbid users to enter links at all (text or markup), at the form submission level
2/ if, for some weird reason, the link still goes through, remove it when storing into the DB (that's the part you're commenting)
3/ when displaying the content (for some descriptions that were saved before this PR), just display the link inner text (if it's markup) or the link itself (if it's only text)

@davidbgk

davidbgk Feb 20, 2014

Member

I'm confused, how do you display the link inner text in 3/ if you removed it in 2/?

@magopian

magopian Feb 20, 2014

Contributor

point 3/ is only for old content, already stored in the database. In this case, you could have links (text or markup) in the database, and you don't want to display them. However, in this case, to avoid "weird sentences" as you put it, we keep the inner text. In this case only.

Contributor

magopian commented Feb 20, 2014

is it ok if I merge this? @cvan @davidbgk ?

+ """Return True if links (text or markup) are found in the given html."""
+ # Call bleach.linkify to transform text links to real links, and add some
+ # content to the ``href`` attribute. If the result is different from the
+ # initial string, links were found.
@davidbgk

davidbgk Feb 21, 2014

Member

It looks tricky, I'm surprised there is no other way to do this.

@magopian

magopian Feb 21, 2014

Contributor

it's the simplest way I could think of, and it should be working in each and every edge case I thought about. Do you have any other ideas or suggestions?

@washort

washort Mar 6, 2014

Contributor

you could make this more clever by having add_to_href raise an exception and wrapping the linkify call in a try/except block that catches it; this would have the advantage of stopping on the first link found. I agree that invoking linkify is the best way to do this, no need to have duplicate link-detecting code.

@magopian

magopian Mar 7, 2014

Contributor

excellent idea, I've implemented it.

@@ -1,49 +1,57 @@
+import copy
@davidbgk

davidbgk Feb 21, 2014

Member

Missing new line.

@magopian

magopian Feb 21, 2014

Contributor

thanks, added ;)

+ # the last new lines from the children's tails.
+ if child.tail:
+ child.tail = child.tail.rstrip('\n')
+ parse_html(child)
@washort

washort Mar 6, 2014

Contributor

you can use tree.iter() to visit every node instead of having to recurse.

@magopian

magopian Mar 7, 2014

Contributor

As discussed, it's best if it stays like that ;)

- return sum(walk(node) for node in tree.childNodes)
- return walk(tree)
+ total = 0
+ for node in tree.getiterator(): # Traverse all the tree nodes.
@washort

washort Mar 6, 2014

Contributor

should be tree.iter()

@magopian

magopian Mar 7, 2014

Contributor

According to http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.iter, Element.iter() is new in python2.7 (and getiterator is deprecated in python2.7).

I guess we'll have to wait for the switch to python2.7/django1.6 to replace this.

@washort

washort Mar 7, 2014

Contributor

doh, that's what I get for not inspecting 2.6 docs. never mind then :)

Contributor

washort commented Mar 6, 2014

r+wc, nice test improvements

Contributor

magopian commented Mar 7, 2014

@washort, I've implemented the suggested modifications, thanks, please let me know what you think before I merge (especially, I want to make sure I didn't misunderstand your suggestion about using tree.iter()).

Contributor

washort commented Mar 10, 2014

Looks good, r+.

magopian added a commit that referenced this pull request Mar 12, 2014

Merge pull request #1735 from magopian/update-bleach-html5lib
update bleach to 1.4, html5lib to 0.999 (bug 968226)

@magopian magopian merged commit ad79124 into mozilla:master Mar 12, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment