Don't strip html, body, and head tags #37

benbalter · 2016-03-11T15:54:53Z

benbalter · 2016-03-11T16:38:28Z

Alright, @jekyll/ecosystem, I could use some smart thinking here. The problem in jekyll/jekyll#4648, is that HTML Pipeline expects an HTML fragment, but the post_render hook is passing it a full-fledged HTML document. The culprit, I suspect is this:

          # This is a horrible hack, but I don't care
          if tags.strip =~ /^<body/i
            path = "/html/body"
          else
            path = "/html/body/node()"
          end

That means, that HTML documents (e.g., and page or doc with a layout), we need to parse the document ourself and pass the body fragment to pipeline, emojify it, and then swap it out for the body in the already-parsed document.

Sounds simple, right? Wrong. If you've ever used Nokigiri, you know it loves to do two things:

Mangle your content, by adding extra, technically correct, but not originally there tags like <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">. (Compare the layout in this PR, to the output fixture).
Scream 🔴y 🔪 if your HTML isn't 100% absolutely perfect (see Maruku)

So that leaves us with a few options:

Find some other way not to @mention and emoji in code blocks (regex?)
Try to minimize the damage by e.g., only using the document-body-replace method on pages with code blocks, otherwise using straight regex
Some other trickery (?)

Thoughts? I'm going to go ahead and downgrade GItHub Pages in the interim, to give us the time to find the right solution here.

envygeeks · 2016-03-11T17:15:25Z

@benbalter I don't see the problem with using Nokogiri in a plugin.

parkr · 2016-03-11T22:40:08Z

@benbalter did you try changing this to a :pre_render plugin and setting doc.content instead of doc.output? That might help a significant number of cases where the doc doesn't have the <html><head>...

parkr · 2016-03-16T22:27:49Z

Fixes #36, too.

Reference issue: jekyll/jemoji#37

parkr · 2016-03-17T19:03:08Z

lib/jemoji.rb

+        if doc.output =~ /<body/
+          parsed_doc = Nokogiri::HTML::Document.parse(doc.output)
+          body       = parsed_doc.at_css('body')
+          body.replace filter_with_emoji(src).call(body.to_html)[:output]


Let's remove this .to_html – from what I could see in the docs, it can operate on the fragment, too.

I think we're going to have to pass it as a string... body at that point is a Nokogiri::XML::Element, not a document, and thus has no parent method (and errors out on has_ancestor?).

Ok. Tested with the jekyllrb.com site and it strips the <body> tag classes. Is there a way to preserve those?

benbalter · 2016-03-17T19:06:29Z

did you try changing this to a :pre_render plugin and setting doc.content instead of doc.output? That might help a significant number of cases where the doc doesn't have the ...

AFAIK, that would be Markdown at that point, not HTML, meaning we couldn't parse to determine if a node was inside code or pre tags.

envygeeks · 2016-03-18T01:07:24Z

We use Kramdown, why don't get get a bit clever with that and use it's tokenizer? That just randomly dawned on me because recently we did a project where we abused Kramdown to get it to tokenize Markdown for us so we could alter it. Actually, we use Kramdown! You could even just build a plugin for it and mark this as a Kramdown only plugin and thus make it easy for all parties to use?

parkr · 2016-03-18T22:57:41Z

@benbalter What do you think about cdb9cca? It skirts around the issue of loss of <body> tags by just replacing the children of the <body> tag. I added it to the spec and it works on my machine. ❤️

benbalter · 2016-03-19T15:38:54Z

. We'll also likely need to port the changes to @mentions as well.

parkr · 2016-03-19T16:53:56Z

@jekyllbot: merge

Merge pull request 37

benbalter added 5 commits March 11, 2016 10:26

failing tests for mangling html and head tags

55d4985

stub out the index

b392387

fix fixture

6d45997

really, really fix fixture

6ced96d

unescape fixture

f06d691

benbalter mentioned this pull request Mar 11, 2016

Jekyll is stripping <html> and <head> tags jekyll/jekyll#4648

Closed

benbalter added 2 commits March 11, 2016 11:26

its not pretty, but it works!

3144230

bail early if we dont have emoji

9fff560

benbalter changed the title ~~WIP: Don't strip html, body, and head tags~~ Don't strip html, body, and head tags Mar 11, 2016

parkr added a commit to jekyll/jekyll that referenced this pull request Mar 16, 2016

Lock jemoji to v0.5.1 while we figure out the issue with HTML::Pipeline.

ac9a724

Reference issue: jekyll/jemoji#37

benbalter mentioned this pull request Mar 17, 2016

Added gemoji parser to parse raw unicode emoji input into token form #38

Open

parkr reviewed Mar 17, 2016
View reviewed changes

parkr assigned benbalter Mar 17, 2016

Preserve <body> tags by just replacing the body's children.

cdb9cca

parkr mentioned this pull request Mar 18, 2016

Don't strip html, body, and head tags jekyll/jekyll-mentions#29

Merged

jekyllbot added a commit that referenced this pull request Mar 19, 2016

Merge pull request #37 from jekyll/layout-mangle-fix

7f8ce4a

Merge pull request 37

jekyllbot merged commit 7f8ce4a into master Mar 19, 2016

jekyllbot deleted the layout-mangle-fix branch March 19, 2016 16:53

jekyllbot added a commit that referenced this pull request Mar 19, 2016

Update history to reflect merge of #37 [ci skip]

46d02fd

jekyll locked and limited conversation to collaborators Apr 24, 2019

jekyllbot added the frozen-due-to-age label Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't strip html, body, and head tags #37

Don't strip html, body, and head tags #37

benbalter commented Mar 11, 2016

benbalter commented Mar 11, 2016

envygeeks commented Mar 11, 2016

parkr commented Mar 11, 2016

parkr commented Mar 16, 2016

parkr Mar 17, 2016

benbalter Mar 17, 2016

parkr Mar 17, 2016

benbalter commented Mar 17, 2016

envygeeks commented Mar 18, 2016

parkr commented Mar 18, 2016

benbalter commented Mar 19, 2016

parkr commented Mar 19, 2016

Don't strip html, body, and head tags #37

Don't strip html, body, and head tags #37

Conversation

benbalter commented Mar 11, 2016

benbalter commented Mar 11, 2016

envygeeks commented Mar 11, 2016

parkr commented Mar 11, 2016

parkr commented Mar 16, 2016

parkr Mar 17, 2016

Choose a reason for hiding this comment

benbalter Mar 17, 2016

Choose a reason for hiding this comment

parkr Mar 17, 2016

Choose a reason for hiding this comment

benbalter commented Mar 17, 2016

envygeeks commented Mar 18, 2016

parkr commented Mar 18, 2016

benbalter commented Mar 19, 2016

parkr commented Mar 19, 2016