Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Escape <html> elements before passing to Nokogiri::HTML::DocumentFragment.parse to avoid truncation #49

Closed
wants to merge 1 commit into from

3 participants

@adelcambre

No description provided.

@adelcambre adelcambre referenced this pull request in github/markup
Closed

Filters truncate everything after a </html> #191

@rtomayko
Collaborator
@adelcambre

Yeah, I'm down to hold off. It's definitely an edge case, but the truncation behavior seems fairly wrong as well. I'm not sure there is much to be done that isn't as hacky or worse.

@vmg
Collaborator
vmg commented

This is obviously a bug in HTMLSanitizer. It shouldn't be truncating. Fix the issue on the Sanitizer gem instead of in our pipeline -- that's the whole point of the Sanitizer gem after all.

@adelcambre
@rtomayko
Collaborator
@vmg
Collaborator
vmg commented

In that case, let's fix it on LibXML. We cannot escape every pipeline twice just because our XML parser is broken. This is a no go.

I'm gonna give it a look on Nokogiri and see if the bug is at the C level or what.

@adelcambre

Using DocumentFragment.parse

irb(main):002:0> p Nokogiri::HTML::DocumentFragment.parse("before</html>/after").to_s
"before"

Using Document.parse:

irb(main):036:0> puts Nokogiri::HTML::Document.parse("before</html>after").to_s
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body><p>before</p></body>
<html><p>after</p></html>
</html>
@vmg
Collaborator
vmg commented

Yeah, I've just found that out the hard way. In all fairness, Nokogiri is not at fault here. If you have an unmatched, closing </html> tag in your document, it's simply not valid XML nor HTML. The fact that it's being parsed at all is a miracle -- there's no reliable way to "close" that tag.

I'd lean towards WONTFIX on this.

@vmg
Collaborator
vmg commented

Quote from the Nokogiri mailing list:

libxml generally thinks of a "fragment" as not having either <html> or <body> elements

@adelcambre

Yep, I'm happy with the investigation at this point. WONTFIX is fine with me, I'll report back to the orignal user that this just isn't supported.

@adelcambre adelcambre closed this
@vmg
Collaborator
vmg commented

Again, I don't see why truncating everything after an unmatched </html> is somehow less valid that dropping the tag altogether. I wouldn't change this behavior. It seems safe and reliable right now: that unmatched HTML tag doesn't belong anywhere.

@vmg
Collaborator
vmg commented

...Well that was easy. :p

Thanks @adelcambre! Can you give me insight on what the original bug was? Like, what did the user want us to support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Apr 3, 2013
  1. @adelcambre

    Escape <html> elements before passing to Nokogiri::HTML::DocumentFrag…

    adelcambre authored
    …ment.parse to avoid truncation
This page is out of date. Refresh to see the latest.
View
4 lib/html/pipeline.rb
@@ -1,6 +1,7 @@
require "nokogiri"
require "active_support/xml_mini/nokogiri" # convert Documents to hashes
require "escape_utils"
+require "cgi"
module HTML
# GitHub HTML processing filters and utilities. This module includes a small
@@ -51,7 +52,8 @@ class Pipeline
def self.parse(document_or_html)
document_or_html ||= ''
if document_or_html.is_a?(String)
- DocumentFragment.parse(document_or_html)
+ escaped = CGI::escapeElement(document_or_html, ['html'])
+ DocumentFragment.parse(escaped)
else
document_or_html
end
View
6 test/html/pipeline/sanitization_filter_test.rb
@@ -44,4 +44,10 @@ def test_github_specific_protocols_are_not_removed
stuff = '<a href="github-windows://spillthelog">Spill this yo</a> and so on'
assert_equal stuff, SanitizationFilter.call(stuff).to_s
end
+
+ def test_html_tags_arent_truncated
+ stuff = "a\n<p>b</p>\n</html>\nc"
+ html = SanitizationFilter.call(stuff).to_s
+ assert_equal "a\n<p>b</p>\n&lt;/html&gt;\nc", html
+ end
end
Something went wrong with that request. Please try again.