bug 1453796: fix generation of HTML when filtering noincludes #4737

escattone · 2018-04-17T00:14:20Z

This PR fixes Bugzilla #1453796. The bug occurs when the include parameter is used during document requests, for example when using the page macro. The include parameter triggers a removal of all elements with class="noinclude", which is done using pyquery, after which it re-generates the filtered HTML. During the re-generation of the HTML, any empty HTML elements (e.g., <iframe ...></iframe>) are output as self-closing elements (e.g., <iframe ... />), which in the case of iframe, is illegal. When loaded, the self-closing iframe seems to be interpreted as simply an opening tag, and so swallows all subsequent HTML.

For example, https://developer.mozilla.org/en-US/docs/Web/CSS/perspective?raw=1&macros=1&section=Setting_perspective works fine, but when the include=1 is added, the failure appears (https://developer.mozilla.org/en-US/docs/Web/CSS/perspective?raw=1&macros=1&include=1&section=Setting_perspective).

The fix is to call doc.html(method='html') instead of doc.html() in the kuma.wiki.content.filter_out_noinclude function (see http://pyquery.readthedocs.io/en/latest/api.html#pyquery.pyquery.PyQuery.html).

This does not appear to be due to a pyquery package version update. I'm not sure why and when this bug appeared. Could it be that the iframe elements used to be populated, but then changed to being empty?

a2sheppy · 2018-04-17T19:25:58Z

Looking forward to this landing; it will let me finish up a few things. Great job!

jwhitlock · 2018-04-19T19:24:42Z

I dug in to see when this might have broke, and I can't identify a library update that would have changed this behavior. I tested a few configurations going back 2-3 years ago, and the issue persists.

pyquery uses lxml.etree.tostring() to implement .html(), and that outputs XHTML by default, where <iframe /> is valid. However, it's not valid HTML 5, and using a self-closing tag will cause issues in Firefox and other browsers (https://stackoverflow.com/a/27545787/10612). Passing method='html' is the way to get HTML instead of XML (see http://lxml.de/tutorial.html#serialisation). The HTML option was added in lxml 2.0, released in 2008.

Every year, I find something that makes me wish XHTML won.

There's a few more instances where .html() is called in the code. Most are in tests, but one in particular is in the code that extracts SEO summaries. @escattone, should we include fixing those with this PR, or save it for a new one?

jwhitlock

👍 I was able to reproduce the issue locally, and this fixed it. I think more work needs to be done for other instances of .html, but I'm happy to do that in a new PR.

jwhitlock · 2018-04-19T19:26:24Z

kuma/wiki/content.py

+    # to be closed in the opening tag. For example, without "method='html'"
+    # the output would be "<iframe/>" instead of the correct
+    # "<iframe></iframe>".
+    return doc.html(method='html')


If we decide to fix all the .html() calls, this might be done as a utility function such as to_html, rather than adding a 5-line comment to every call.

jwhitlock · 2018-04-19T19:27:42Z

kuma/wiki/tests/test_views.py

@@ -2710,6 +2710,7 @@ def test_raw_include_option(self):
              <dd>Type: <em>integer</em></dd>
              <dd>Przykłady 例 예제 示例</dd>
            </dl>
+            <p><iframe></iframe></p>


I was going to give feedback that the <iframe> test should be in it's own test, but 1) I don't want to make you convert to pytest as well, and 2) pyquery will return None if you just serialize an empty <iframe>.

a2sheppy · 2018-04-19T20:55:40Z

👍

codecov-io · 2018-04-20T19:22:01Z

Codecov Report

Merging #4737 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #4737   +/-   ##
=======================================
  Coverage   95.83%   95.83%           
=======================================
  Files         270      270           
  Lines       24613    24613           
  Branches     1745     1745           
=======================================
  Hits        23589    23589           
  Misses        814      814           
  Partials      210      210

Impacted Files	Coverage Δ
kuma/wiki/tests/test_views.py	`100% <ø> (ø)`	⬆️
kuma/wiki/content.py	`95.66% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ca67e0...3d1d400. Read the comment docs.

escattone · 2018-04-20T19:30:54Z

@jwhitlock This is ready for another review. Here's what I changed:

created a to_html utility function and replaced all calls to .html() on PyQuery instances
modified tests for the wiki.document_api endpoint
totally refactored the tests for the kuma.wiki.content.get_seo_description function, including testing more cases (with the expanded testing, I found/fixed what seemed like a bug in the handling of the case when not stripping markup with multiple elements with the class seoSummary)

I should add that all new/modified tests were run with and without the fix to ensure that they detected the bug.

jwhitlock · 2018-04-20T20:59:19Z

The additional code looks good as well. Thanks @escattone

escattone requested a review from jwhitlock April 17, 2018 00:14

jwhitlock approved these changes Apr 19, 2018

View reviewed changes

escattone added 2 commits April 20, 2018 12:21

bug 1453796: fix generation of HTML when filtering noincludes

99d332c

bug 1453796: fix all other html() calls on PyQuery instances

3d1d400

escattone merged commit 8635d0f into mdn:master Apr 20, 2018

escattone deleted the fix-include-handling-for-docs-1453796 branch April 20, 2018 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug 1453796: fix generation of HTML when filtering noincludes #4737

bug 1453796: fix generation of HTML when filtering noincludes #4737

escattone commented Apr 17, 2018

a2sheppy commented Apr 17, 2018

jwhitlock commented Apr 19, 2018

jwhitlock left a comment

jwhitlock Apr 19, 2018

jwhitlock Apr 19, 2018

a2sheppy commented Apr 19, 2018

codecov-io commented Apr 20, 2018

escattone commented Apr 20, 2018 •

edited

jwhitlock commented Apr 20, 2018

bug 1453796: fix generation of HTML when filtering noincludes #4737

bug 1453796: fix generation of HTML when filtering noincludes #4737

Conversation

escattone commented Apr 17, 2018

a2sheppy commented Apr 17, 2018

jwhitlock commented Apr 19, 2018

jwhitlock left a comment

Choose a reason for hiding this comment

jwhitlock Apr 19, 2018

Choose a reason for hiding this comment

jwhitlock Apr 19, 2018

Choose a reason for hiding this comment

a2sheppy commented Apr 19, 2018

codecov-io commented Apr 20, 2018

Codecov Report

escattone commented Apr 20, 2018 • edited

jwhitlock commented Apr 20, 2018

escattone commented Apr 20, 2018 •

edited