Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't strip html, body, and head tags #37

Merged
merged 8 commits into from
Mar 19, 2016
Merged

Conversation

benbalter
Copy link
Contributor

Fixes (half of) jekyll/jekyll#4648.

@benbalter benbalter changed the title WIP: Don't strip html, body, and head tags Don't strip html, body, and head tags Mar 11, 2016
@benbalter
Copy link
Contributor Author

Alright, @jekyll/ecosystem, I could use some smart thinking here. The problem in jekyll/jekyll#4648, is that HTML Pipeline expects an HTML fragment, but the post_render hook is passing it a full-fledged HTML document. The culprit, I suspect is this:

          # This is a horrible hack, but I don't care
          if tags.strip =~ /^<body/i
            path = "/html/body"
          else
            path = "/html/body/node()"
          end

:hurtrealbad:

That means, that HTML documents (e.g., and page or doc with a layout), we need to parse the document ourself and pass the body fragment to pipeline, emojify it, and then swap it out for the body in the already-parsed document.

Sounds simple, right? Wrong. If you've ever used Nokigiri, you know it loves to do two things:

  1. Mangle your content, by adding extra, technically correct, but not originally there tags like <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">. (Compare the layout in this PR, to the output fixture).
  2. Scream 🔴y 🔪 if your HTML isn't 100% absolutely perfect (see Maruku)

So that leaves us with a few options:

  1. Find some other way not to @mention and emoji in code blocks (regex?)
  2. Try to minimize the damage by e.g., only using the document-body-replace method on pages with code blocks, otherwise using straight regex
  3. Some other trickery (?)

Thoughts? I'm going to go ahead and downgrade GItHub Pages in the interim, to give us the time to find the right solution here.

@envygeeks
Copy link

@benbalter I don't see the problem with using Nokogiri in a plugin.

@parkr
Copy link
Member

parkr commented Mar 11, 2016

@benbalter did you try changing this to a :pre_render plugin and setting doc.content instead of doc.output? That might help a significant number of cases where the doc doesn't have the <html><head>...

@parkr
Copy link
Member

parkr commented Mar 16, 2016

Fixes #36, too.

if doc.output =~ /<body/
parsed_doc = Nokogiri::HTML::Document.parse(doc.output)
body = parsed_doc.at_css('body')
body.replace filter_with_emoji(src).call(body.to_html)[:output]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this .to_html – from what I could see in the docs, it can operate on the fragment, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're going to have to pass it as a string... body at that point is a Nokogiri::XML::Element, not a document, and thus has no parent method (and errors out on has_ancestor?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Tested with the jekyllrb.com site and it strips the <body> tag classes. Is there a way to preserve those?

@benbalter
Copy link
Contributor Author

did you try changing this to a :pre_render plugin and setting doc.content instead of doc.output? That might help a significant number of cases where the doc doesn't have the ...

AFAIK, that would be Markdown at that point, not HTML, meaning we couldn't parse to determine if a node was inside code or pre tags.

@envygeeks
Copy link

We use Kramdown, why don't get get a bit clever with that and use it's tokenizer? That just randomly dawned on me because recently we did a project where we abused Kramdown to get it to tokenize Markdown for us so we could alter it. Actually, we use Kramdown! You could even just build a plugin for it and mark this as a Kramdown only plugin and thus make it easy for all parties to use?

@parkr
Copy link
Member

parkr commented Mar 18, 2016

@benbalter What do you think about cdb9cca? It skirts around the issue of loss of <body> tags by just replacing the children of the <body> tag. I added it to the spec and it works on my machine. ❤️

@benbalter
Copy link
Contributor Author

:shipit:. We'll also likely need to port the changes to @mentions as well.

@parkr
Copy link
Member

parkr commented Mar 19, 2016

@jekyllbot: merge

jekyllbot added a commit that referenced this pull request Mar 19, 2016
@jekyllbot jekyllbot merged commit 7f8ce4a into master Mar 19, 2016
@jekyllbot jekyllbot deleted the layout-mangle-fix branch March 19, 2016 16:53
jekyllbot added a commit that referenced this pull request Mar 19, 2016
@jekyll jekyll locked and limited conversation to collaborators Apr 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants