Skip to content
This repository has been archived by the owner on Aug 26, 2022. It is now read-only.

bug 1453796: fix generation of HTML when filtering noincludes #4737

Merged
merged 2 commits into from Apr 20, 2018
Merged

bug 1453796: fix generation of HTML when filtering noincludes #4737

merged 2 commits into from Apr 20, 2018

Conversation

escattone
Copy link
Contributor

This PR fixes Bugzilla #1453796. The bug occurs when the include parameter is used during document requests, for example when using the page macro. The include parameter triggers a removal of all elements with class="noinclude", which is done using pyquery, after which it re-generates the filtered HTML. During the re-generation of the HTML, any empty HTML elements (e.g., <iframe ...></iframe>) are output as self-closing elements (e.g., <iframe ... />), which in the case of iframe, is illegal. When loaded, the self-closing iframe seems to be interpreted as simply an opening tag, and so swallows all subsequent HTML.

For example, https://developer.mozilla.org/en-US/docs/Web/CSS/perspective?raw=1&macros=1&section=Setting_perspective works fine, but when the include=1 is added, the failure appears (https://developer.mozilla.org/en-US/docs/Web/CSS/perspective?raw=1&macros=1&include=1&section=Setting_perspective).

The fix is to call doc.html(method='html') instead of doc.html() in the kuma.wiki.content.filter_out_noinclude function (see http://pyquery.readthedocs.io/en/latest/api.html#pyquery.pyquery.PyQuery.html).

This does not appear to be due to a pyquery package version update. I'm not sure why and when this bug appeared. Could it be that the iframe elements used to be populated, but then changed to being empty?

@a2sheppy
Copy link
Contributor

Looking forward to this landing; it will let me finish up a few things. Great job!

@jwhitlock
Copy link
Contributor

I dug in to see when this might have broke, and I can't identify a library update that would have changed this behavior. I tested a few configurations going back 2-3 years ago, and the issue persists.

pyquery uses lxml.etree.tostring() to implement .html(), and that outputs XHTML by default, where <iframe /> is valid. However, it's not valid HTML 5, and using a self-closing tag will cause issues in Firefox and other browsers (https://stackoverflow.com/a/27545787/10612). Passing method='html' is the way to get HTML instead of XML (see http://lxml.de/tutorial.html#serialisation). The HTML option was added in lxml 2.0, released in 2008.

Every year, I find something that makes me wish XHTML won.

There's a few more instances where .html() is called in the code. Most are in tests, but one in particular is in the code that extracts SEO summaries. @escattone, should we include fixing those with this PR, or save it for a new one?

Copy link
Contributor

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I was able to reproduce the issue locally, and this fixed it. I think more work needs to be done for other instances of .html, but I'm happy to do that in a new PR.

# to be closed in the opening tag. For example, without "method='html'"
# the output would be "<iframe/>" instead of the correct
# "<iframe></iframe>".
return doc.html(method='html')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to fix all the .html() calls, this might be done as a utility function such as to_html, rather than adding a 5-line comment to every call.

@@ -2710,6 +2710,7 @@ def test_raw_include_option(self):
<dd>Type: <em>integer</em></dd>
<dd>Przykłady 例 예제 示例</dd>
</dl>
<p><iframe></iframe></p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to give feedback that the <iframe> test should be in it's own test, but 1) I don't want to make you convert to pytest as well, and 2) pyquery will return None if you just serialize an empty <iframe>.

@a2sheppy
Copy link
Contributor

👍

@codecov-io
Copy link

Codecov Report

Merging #4737 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4737   +/-   ##
=======================================
  Coverage   95.83%   95.83%           
=======================================
  Files         270      270           
  Lines       24613    24613           
  Branches     1745     1745           
=======================================
  Hits        23589    23589           
  Misses        814      814           
  Partials      210      210
Impacted Files Coverage Δ
kuma/wiki/tests/test_views.py 100% <ø> (ø) ⬆️
kuma/wiki/content.py 95.66% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ca67e0...3d1d400. Read the comment docs.

@escattone
Copy link
Contributor Author

escattone commented Apr 20, 2018

@jwhitlock This is ready for another review. Here's what I changed:

  • created a to_html utility function and replaced all calls to .html() on PyQuery instances
  • modified tests for the wiki.document_api endpoint
  • totally refactored the tests for the kuma.wiki.content.get_seo_description function, including testing more cases (with the expanded testing, I found/fixed what seemed like a bug in the handling of the case when not stripping markup with multiple elements with the class seoSummary)

I should add that all new/modified tests were run with and without the fix to ensure that they detected the bug.

@escattone escattone merged commit 8635d0f into mdn:master Apr 20, 2018
@escattone escattone deleted the fix-include-handling-for-docs-1453796 branch April 20, 2018 20:24
@jwhitlock
Copy link
Contributor

The additional code looks good as well. Thanks @escattone

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants