From 487e55b3fad1fe7b541a9eee020665c322722e10 Mon Sep 17 00:00:00 2001 From: David Kitchen Date: Thu, 10 Jun 2021 21:10:22 +0100 Subject: [PATCH] Resolves #95 by allowing HTML comments --- README.md | 24 ++++++++--------- go.mod | 2 +- go.sum | 3 +++ policy.go | 21 ++++++++++++++- sanitize.go | 4 +++ sanitize_test.go | 70 ++++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 110 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 36d1a20..6a34473 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. -bluemonday takes untrusted user generated content as an input, and will return HTML that has been sanitised against a whitelist of approved HTML elements and attributes so that you can safely include the content in your web page. +bluemonday takes untrusted user generated content as an input, and will return HTML that has been sanitised against an allowlist of approved HTML elements and attributes so that you can safely include the content in your web page. If you accept user generated content, and your server uses Go, you **need** bluemonday. @@ -50,11 +50,11 @@ bluemonday is heavily inspired by both the [OWASP Java HTML Sanitizer](https://c ## Technical Summary -Whitelist based, you need to either build a policy describing the HTML elements and attributes to permit (and the `regexp` patterns of attributes), or use one of the supplied policies representing good defaults. +Allowlist based, you need to either build a policy describing the HTML elements and attributes to permit (and the `regexp` patterns of attributes), or use one of the supplied policies representing good defaults. -The policy containing the whitelist is applied using a fast non-validating, forward only, token-based parser implemented in the [Go net/html library](https://godoc.org/golang.org/x/net/html) by the core Go team. +The policy containing the allowlist is applied using a fast non-validating, forward only, token-based parser implemented in the [Go net/html library](https://godoc.org/golang.org/x/net/html) by the core Go team. -We expect to be supplied with well-formatted HTML (closing elements for every applicable open element, nested correctly) and so we do not focus on repairing badly nested or incomplete HTML. We focus on simply ensuring that whatever elements do exist are described in the policy whitelist and that attributes and links are safe for use on your web page. [GIGO](http://en.wikipedia.org/wiki/Garbage_in,_garbage_out) does apply and if you feed it bad HTML bluemonday is not tasked with figuring out how to make it good again. +We expect to be supplied with well-formatted HTML (closing elements for every applicable open element, nested correctly) and so we do not focus on repairing badly nested or incomplete HTML. We focus on simply ensuring that whatever elements do exist are described in the policy allowlist and that attributes and links are safe for use on your web page. [GIGO](http://en.wikipedia.org/wiki/Garbage_in,_garbage_out) does apply and if you feed it bad HTML bluemonday is not tasked with figuring out how to make it good again. ### Supported Go Versions @@ -146,8 +146,8 @@ func main() { We ship two default policies: -1. `bluemonday.StrictPolicy()` which can be thought of as equivalent to stripping all HTML elements and their attributes as it has nothing on its whitelist. An example usage scenario would be blog post titles where HTML tags are not expected at all and if they are then the elements *and* the content of the elements should be stripped. This is a *very* strict policy. -2. `bluemonday.UGCPolicy()` which allows a broad selection of HTML elements and attributes that are safe for user generated content. Note that this policy does *not* whitelist iframes, object, embed, styles, script, etc. An example usage scenario would be blog post bodies where a variety of formatting is expected along with the potential for TABLEs and IMGs. +1. `bluemonday.StrictPolicy()` which can be thought of as equivalent to stripping all HTML elements and their attributes as it has nothing on its allowlist. An example usage scenario would be blog post titles where HTML tags are not expected at all and if they are then the elements *and* the content of the elements should be stripped. This is a *very* strict policy. +2. `bluemonday.UGCPolicy()` which allows a broad selection of HTML elements and attributes that are safe for user generated content. Note that this policy does *not* allow iframes, object, embed, styles, script, etc. An example usage scenario would be blog post bodies where a variety of formatting is expected along with the potential for TABLEs and IMGs. ## Policy Building @@ -220,7 +220,7 @@ p.AllowElements("fieldset", "select", "option") ### Inline CSS -Although it's possible to handle inline CSS using `AllowAttrs` with a `Matching` rule, writing a single monolithic regular expression to safely process all inline CSS which you wish to allow is not a trivial task. Instead of attempting to do so, you can whitelist the `style` attribute on whichever element(s) you desire and use style policies to control and sanitize inline styles. +Although it's possible to handle inline CSS using `AllowAttrs` with a `Matching` rule, writing a single monolithic regular expression to safely process all inline CSS which you wish to allow is not a trivial task. Instead of attempting to do so, you can allow the `style` attribute on whichever element(s) you desire and use style policies to control and sanitize inline styles. It is suggested that you use `Matching` (with a suitable regular expression) `MatchingEnum`, or `MatchingHandler` to ensure each style matches your needs, @@ -280,12 +280,12 @@ We provide some additional global options for safely working with links. p.RequireParseableURLs(true) ``` -If you have enabled parseable URLs then the following option will `AllowRelativeURLs`. By default this is disabled (bluemonday is a whitelist tool... you need to explicitly tell us to permit things) and when disabled it will prevent all local and scheme relative URLs (i.e. `href="localpage.html"`, `href="../home.html"` and even `href="//www.google.com"` are relative): +If you have enabled parseable URLs then the following option will `AllowRelativeURLs`. By default this is disabled (bluemonday is an allowlist tool... you need to explicitly tell us to permit things) and when disabled it will prevent all local and scheme relative URLs (i.e. `href="localpage.html"`, `href="../home.html"` and even `href="//www.google.com"` are relative): ```go p.AllowRelativeURLs(true) ``` -If you have enabled parseable URLs then you can whitelist the schemes (commonly called protocol when thinking of `http` and `https`) that are permitted. Bear in mind that allowing relative URLs in the above option will allow for a blank scheme: +If you have enabled parseable URLs then you can allow the schemes (commonly called protocol when thinking of `http` and `https`) that are permitted. Bear in mind that allowing relative URLs in the above option will allow for a blank scheme: ```go p.AllowURLSchemes("mailto", "http", "https") ``` @@ -303,7 +303,7 @@ p.RequireNoReferrerOnLinks(true) ``` -We provide a convenience method that applies all of the above, but you will still need to whitelist the linkable elements for the URL rules to be applied to: +We provide a convenience method that applies all of the above, but you will still need to allow the linkable elements for the URL rules to be applied to: ```go p.AllowStandardURLs() p.AllowAttrs("cite").OnElements("blockquote", "q") @@ -373,11 +373,11 @@ p.AllowAttrs( ) ``` -Both examples exhibit the same issue, they declare attributes but do not then specify whether they are whitelisted globally or only on specific elements (and which elements). Attributes belong to one or more elements, and the policy needs to declare this. +Both examples exhibit the same issue, they declare attributes but do not then specify whether they are allowed globally or only on specific elements (and which elements). Attributes belong to one or more elements, and the policy needs to declare this. ## Limitations -We are not yet including any tools to help whitelist and sanitize CSS. Which means that unless you wish to do the heavy lifting in a single regular expression (inadvisable), **you should not allow the "style" attribute anywhere**. +We are not yet including any tools to help allow and sanitize CSS. Which means that unless you wish to do the heavy lifting in a single regular expression (inadvisable), **you should not allow the "style" attribute anywhere**. It is not the job of bluemonday to fix your bad HTML, it is merely the job of bluemonday to prevent malicious HTML getting through. If you have mismatched HTML elements, or non-conforming nesting of elements, those will remain. But if you have well-structured HTML bluemonday will not break it. diff --git a/go.mod b/go.mod index edbd585..02cf2ea 100644 --- a/go.mod +++ b/go.mod @@ -5,5 +5,5 @@ go 1.16 require ( github.com/aymerick/douceur v0.2.0 github.com/gorilla/css v1.0.0 // indirect - golang.org/x/net v0.0.0-20210421230115-4e50805a0758 + golang.org/x/net v0.0.0-20210610132358-84b48f89b13b ) diff --git a/go.sum b/go.sum index e195d4e..930d271 100644 --- a/go.sum +++ b/go.sum @@ -4,8 +4,11 @@ github.com/gorilla/css v1.0.0 h1:BQqNyPTi50JCFMTw/b67hByjMVXZRwGha6wxVGkeihY= github.com/gorilla/css v1.0.0/go.mod h1:Dn721qIggHpt4+EFCcTLTU/vk5ySda2ReITrtgBl60c= golang.org/x/net v0.0.0-20210421230115-4e50805a0758 h1:aEpZnXcAmXkd6AvLb2OPt+EN1Zu/8Ne3pCqPjja5PXY= golang.org/x/net v0.0.0-20210421230115-4e50805a0758/go.mod h1:72T/g9IO56b78aLF+1Kcs5dz7/ng1VjMUvfKvpfy+jM= +golang.org/x/net v0.0.0-20210610132358-84b48f89b13b h1:k+E048sYJHyVnsr1GDrRZWQ32D2C7lWs9JRc0bel53A= +golang.org/x/net v0.0.0-20210610132358-84b48f89b13b/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y= golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210420072515-93ed5bcd2bfe/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo= golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= diff --git a/policy.go b/policy.go index 9c7e662..241915d 100644 --- a/policy.go +++ b/policy.go @@ -86,6 +86,9 @@ type Policy struct { // When true, allow data attributes. allowDataAttributes bool + // When true, allow comments. + allowComments bool + // map[htmlElementName]map[htmlAttributeName]attrPolicy elsAndAttrs map[string]map[string]attrPolicy @@ -223,7 +226,7 @@ func (p *Policy) AllowAttrs(attrNames ...string) *attrPolicyBuilder { return &abp } -// AllowDataAttributes whitelists all data attributes. We can't specify the name +// AllowDataAttributes permits all data attributes. We can't specify the name // of each attribute exactly as they are customized. // // NOTE: These values are not sanitized and applications that evaluate or process @@ -238,6 +241,22 @@ func (p *Policy) AllowDataAttributes() { p.allowDataAttributes = true } +// AllowComments allows comments. +// +// Please note that only one type of comment will be allowed by this, this is the +// the standard HTML comment which includes the use of that to permit +// conditionals as per https://docs.microsoft.com/en-us/previous-versions/windows/internet-explorer/ie-developer/compatibility/ms537512(v=vs.85)?redirectedfrom=MSDN +// +// What is not permitted are CDATA XML comments, as the x/net/html package we depend +// on does not handle this fully and we are not choosing to take on that work: +// https://pkg.go.dev/golang.org/x/net/html#Tokenizer.AllowCDATA . If the x/net/html +// package changes this then these will be considered, otherwise if you AllowComments +// but provide a CDATA comment, then as per the documentation in x/net/html this will +// be treated as a plain HTML comment. +func (p *Policy) AllowComments() { + p.allowComments = true +} + // AllowNoAttrs says that attributes on element are optional. // // The attribute policy is only added to the core policy when OnElements(...) diff --git a/sanitize.go b/sanitize.go index b462f09..e203411 100644 --- a/sanitize.go +++ b/sanitize.go @@ -289,6 +289,10 @@ func (p *Policy) sanitize(r io.Reader) *bytes.Buffer { case html.CommentToken: // Comments are ignored by default + if p.allowComments { + // But if allowed then write the comment out as-is + buff.WriteString(token.String()) + } case html.StartTagToken: diff --git a/sanitize_test.go b/sanitize_test.go index 1bb45a9..3033b37 100644 --- a/sanitize_test.go +++ b/sanitize_test.go @@ -1835,3 +1835,73 @@ func TestQuotes(t *testing.T) { } wg.Wait() } + +func TestComments(t *testing.T) { + p := UGCPolicy() + + tests := []test{ + { + in: `1 3`, + expected: `1 3`, + }, + { + in: ``, + expected: ``, + }, + } + + // These tests are run concurrently to enable the race detector to pick up + // potential issues + wg := sync.WaitGroup{} + wg.Add(len(tests)) + for ii, tt := range tests { + go func(ii int, tt test) { + out := p.Sanitize(tt.in) + if out != tt.expected { + t.Errorf( + "test %d failed;\ninput : %s\noutput : %s\nexpected: %s", + ii, + tt.in, + out, + tt.expected, + ) + } + wg.Done() + }(ii, tt) + } + wg.Wait() + + p.AllowComments() + + tests = []test{ + { + in: `1 3`, + expected: `1 3`, + }, + { + in: ``, + expected: ``, + }, + } + + // These tests are run concurrently to enable the race detector to pick up + // potential issues + wg = sync.WaitGroup{} + wg.Add(len(tests)) + for ii, tt := range tests { + go func(ii int, tt test) { + out := p.Sanitize(tt.in) + if out != tt.expected { + t.Errorf( + "test %d failed;\ninput : %s\noutput : %s\nexpected: %s", + ii, + tt.in, + out, + tt.expected, + ) + } + wg.Done() + }(ii, tt) + } + wg.Wait() +}