New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plainify should remove <style> and <script> sections #3235

Open
liwenyip opened this Issue Mar 24, 2017 · 17 comments

Comments

Projects
None yet
6 participants
@liwenyip
Contributor

liwenyip commented Mar 24, 2017

plainify doesn't strip contents of <style> tags

See source of https://www.liwen.id.au/barbershop-charts/, line 34

Where

<meta property="og:description" content="{{ .Content | plainify | htmlEscape | truncate 200 }}">

is resulting in

<meta property="og:description" content="Here are some easy-to-follow flowcharts to help you remember the words to popular barbershop songs.
  figure { border: 1px solid lightgrey; } 
(Click to zoom)
Hello Mary Lou .gallery .img4 …">

plainify should probably strip out the contents of any non-displaying tags e.g. <style>, <script>, (any others?)

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Mar 24, 2017

Member

I think this is an issue where the tags gets escaped before they get to plainify.

Member

bep commented Mar 24, 2017

I think this is an issue where the tags gets escaped before they get to plainify.

@liwenyip

This comment has been minimized.

Show comment
Hide comment
@liwenyip

liwenyip Mar 25, 2017

Contributor

Did another test to confirm:

{{ plainify "One <span class='red'>red</span> apple. <style>.red {color:red;}</style>" }}

results in

One red apple. .red {color:red;}
Contributor

liwenyip commented Mar 25, 2017

Did another test to confirm:

{{ plainify "One <span class='red'>red</span> apple. <style>.red {color:red;}</style>" }}

results in

One red apple. .red {color:red;}
@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Mar 25, 2017

Member

Then I don't see how we can do anything about this ...?

Member

bep commented Mar 25, 2017

Then I don't see how we can do anything about this ...?

@liwenyip

This comment has been minimized.

Show comment
Hide comment
@liwenyip

liwenyip Mar 25, 2017

Contributor

Having looked at how the StripHTML function is implemented - I guess not.

If I understand correctly (I've never programmed in Go before), StripHTML walks through the string one char at a time, so there is no way for it to tell what type of tag it has encountered. I'm guessing it's done this way for speed? (Faster than using regexp?)

I have a workaround:

{{ "One <span class='red'>red</span> apple. <style>.red {color:red;}</style>" 
    | replaceRE "(<style.+?</style>|<script.+?</script>)" "" | plainify }}

I think the docs for plainify should state that it doesn't remove the innerHTML of any tags particularly <style> and <script>. Should I make the change and submit a PR?

Contributor

liwenyip commented Mar 25, 2017

Having looked at how the StripHTML function is implemented - I guess not.

If I understand correctly (I've never programmed in Go before), StripHTML walks through the string one char at a time, so there is no way for it to tell what type of tag it has encountered. I'm guessing it's done this way for speed? (Faster than using regexp?)

I have a workaround:

{{ "One <span class='red'>red</span> apple. <style>.red {color:red;}</style>" 
    | replaceRE "(<style.+?</style>|<script.+?</script>)" "" | plainify }}

I think the docs for plainify should state that it doesn't remove the innerHTML of any tags particularly <style> and <script>. Should I make the change and submit a PR?

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Mar 25, 2017

Member

The main motivation is speed, of course, but in my last reply, I did not understand the real problem. Now I do. No, let us keep this open and fix it in code (in a fast way).

We should have a list of keywords that we skip completely. Let us start with style and script.

Member

bep commented Mar 25, 2017

The main motivation is speed, of course, but in my last reply, I did not understand the real problem. Now I do. No, let us keep this open and fix it in code (in a fast way).

We should have a list of keywords that we skip completely. Let us start with style and script.

@bep bep changed the title from plainify doesn't strip contents of <style> tags to plainify should remove <style> and <script> sections Mar 25, 2017

@bep bep added the Enhancement label Mar 25, 2017

@liwenyip

This comment has been minimized.

Show comment
Hide comment
@liwenyip

liwenyip Mar 25, 2017

Contributor

Let us start with style and script.

Yes I think these are the ones most likely to be a problem.

http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page suggests also stripping head, object, embed, applet, noframes, noscript, noembed, but they are all either deprecated, actually contain plain text, or don't normally contain any innerHTML.

Contributor

liwenyip commented Mar 25, 2017

Let us start with style and script.

Yes I think these are the ones most likely to be a problem.

http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_html_tags_web_page suggests also stripping head, object, embed, applet, noframes, noscript, noembed, but they are all either deprecated, actually contain plain text, or don't normally contain any innerHTML.

@liwenyip

This comment has been minimized.

Show comment
Hide comment
@liwenyip

liwenyip Mar 25, 2017

Contributor

By the way I have solved my original problem by using <meta property="og:description" content="{{ .Summary }}"> which is obviously a much better solution.

But in the process, I discovered some more weird behaviour

EDIT: Raised separately as #3237

Contributor

liwenyip commented Mar 25, 2017

By the way I have solved my original problem by using <meta property="og:description" content="{{ .Summary }}"> which is obviously a much better solution.

But in the process, I discovered some more weird behaviour

EDIT: Raised separately as #3237

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Mar 25, 2017

Member

As you pass the .Content into truncate I'm not following your argumentation. Not one of your examples show output from .Content ... And more important, I don't see any plainify, which this issue is all about.

Member

bep commented Mar 25, 2017

As you pass the .Content into truncate I'm not following your argumentation. Not one of your examples show output from .Content ... And more important, I don't see any plainify, which this issue is all about.

@liwenyip

This comment has been minimized.

Show comment
Hide comment
@liwenyip

liwenyip Mar 25, 2017

Contributor

Sorry, I'll raise a separate issue for this. (I'm still figuring out how Github works, apologies)

Contributor

liwenyip commented Mar 25, 2017

Sorry, I'll raise a separate issue for this. (I'm still figuring out how Github works, apologies)

@Astorsoft

This comment has been minimized.

Show comment
Hide comment
@Astorsoft

Astorsoft Jul 26, 2017

Hi, any update on this issue? As a typical use case, I'm currently designing a custom RSS output which contains the article's content going through plainify. It now contains the whole podcast player javascript in plain text, which is quite ugly. I've tried the replaceRE but it's not working, I'm guessing because my <script> tag is multiline?

If you want an example of what I'm talking about here is the code

<description> {{ .Content | plainify | htmlUnescape | safeHTML }} </description>

(if I don't htmlUnescape and safeHTML after plainify, all my quotes are htmlencoded)

The output being something like:

<description> French gibberish, which is the actual content of the article, and then the content of the script tag as you can see below: window.podigee = { "options": { "theme": "default", }, "extensions": { "ChapterMarks": { "disabled": false }, "EpisodeInfo": {}, "Playlist": { "disabled": true }, "Transcript": { "disabled": true } }, "podcast": { "feed": "https:\/\/media.blubrry.com\/comptoirsecu\/s\/podcast.comptoirsecu.fr\/CSEC.SECHebdo.2017-05-30.mp3" }, "episode": { "media": { "mp3": "https:\/\/media.blubrry.com\/comptoirsecu\/s\/podcast.comptoirsecu.fr\/CSEC.SECHebdo.2017-05-30.mp3" }, "coverUrl": "http:\/\/localhost:1313\/\/images\/covers\/2017-05-30.jpg", "title": "[SECHebdo] Mai 2017 - 5", "subtitle": "Cloak \x26 Dagger, ACYMA, SAMBACry, BSoD Win8, 2FA \x26 Cloud, Pacemakers, Iris \x26 CCC, Inboxen, etc.", "description": "SECHebdo est une revue de l\x27actualité cybersécurité réalisé en live sur Youtube, généralement le mardi soir.", "chaptermarks": [ { "start": "00:01:30", "title": "Titre"}, ], } } Notre discord : https://discord.comptoirsecu.fr

Astorsoft commented Jul 26, 2017

Hi, any update on this issue? As a typical use case, I'm currently designing a custom RSS output which contains the article's content going through plainify. It now contains the whole podcast player javascript in plain text, which is quite ugly. I've tried the replaceRE but it's not working, I'm guessing because my <script> tag is multiline?

If you want an example of what I'm talking about here is the code

<description> {{ .Content | plainify | htmlUnescape | safeHTML }} </description>

(if I don't htmlUnescape and safeHTML after plainify, all my quotes are htmlencoded)

The output being something like:

<description> French gibberish, which is the actual content of the article, and then the content of the script tag as you can see below: window.podigee = { "options": { "theme": "default", }, "extensions": { "ChapterMarks": { "disabled": false }, "EpisodeInfo": {}, "Playlist": { "disabled": true }, "Transcript": { "disabled": true } }, "podcast": { "feed": "https:\/\/media.blubrry.com\/comptoirsecu\/s\/podcast.comptoirsecu.fr\/CSEC.SECHebdo.2017-05-30.mp3" }, "episode": { "media": { "mp3": "https:\/\/media.blubrry.com\/comptoirsecu\/s\/podcast.comptoirsecu.fr\/CSEC.SECHebdo.2017-05-30.mp3" }, "coverUrl": "http:\/\/localhost:1313\/\/images\/covers\/2017-05-30.jpg", "title": "[SECHebdo] Mai 2017 - 5", "subtitle": "Cloak \x26 Dagger, ACYMA, SAMBACry, BSoD Win8, 2FA \x26 Cloud, Pacemakers, Iris \x26 CCC, Inboxen, etc.", "description": "SECHebdo est une revue de l\x27actualité cybersécurité réalisé en live sur Youtube, généralement le mardi soir.", "chaptermarks": [ { "start": "00:01:30", "title": "Titre"}, ], } } Notre discord : https://discord.comptoirsecu.fr

@Astorsoft

This comment has been minimized.

Show comment
Hide comment
@Astorsoft

Astorsoft Aug 2, 2017

Ok, found a solution in https://discourse.gohugo.io/t/replacere-not-working-for-me-solved/5103/6, you just have to add (?m) (?s:YOUR_REGEXP)

{{chomp ")" ""| plainify }}{{chomp "]]>\n"}}

Astorsoft commented Aug 2, 2017

Ok, found a solution in https://discourse.gohugo.io/t/replacere-not-working-for-me-solved/5103/6, you just have to add (?m) (?s:YOUR_REGEXP)

{{chomp ")" ""| plainify }}{{chomp "]]>\n"}}

@bep bep added the Hacktoberfest label Sep 22, 2017

@uncompiled

This comment has been minimized.

Show comment
Hide comment
@uncompiled

uncompiled Oct 13, 2017

If no one has started working on this, I would be interested in tackling this issue.

uncompiled commented Oct 13, 2017

If no one has started working on this, I would be interested in tackling this issue.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Oct 13, 2017

Member

@uncompiled that would be great; please don't hesitate to ask.

Member

bep commented Oct 13, 2017

@uncompiled that would be great; please don't hesitate to ask.

uncompiled added a commit to uncompiled/hugo that referenced this issue Oct 15, 2017

uncompiled added a commit to uncompiled/hugo that referenced this issue Oct 15, 2017

helpers: Remove style and script tags in plainify output
Add new stripHTMLContainers function that strips out specific
HTML tags including the contents contained within the element.

Fixes gohugoio#3235
@stale

This comment has been minimized.

Show comment
Hide comment
@stale

stale bot May 6, 2018

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

stale bot commented May 6, 2018

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@stale stale bot added the Stale label May 6, 2018

@stale stale bot closed this Jun 5, 2018

@bep bep added the Keep label Jun 5, 2018

@bep bep reopened this Jun 5, 2018

@stale stale bot removed the Stale label Jun 5, 2018

@prannayk

This comment has been minimized.

Show comment
Hide comment
@prannayk

prannayk Oct 16, 2018

Is anyone working on this? I would like to jump on this

prannayk commented Oct 16, 2018

Is anyone working on this? I would like to jump on this

@moorereason

This comment has been minimized.

Show comment
Hide comment
@moorereason

moorereason Oct 16, 2018

Contributor

@prannayk, go for it.

Contributor

moorereason commented Oct 16, 2018

@prannayk, go for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment