Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Support for anchors (A tags with internal link syntax) #77

Closed
wants to merge 392 commits into from
@rorygibson

I've added support for hrefs starting with '#', so that links to internal anchors on a page don't get ripped out by the Cleaner, which is something I need for my day job and thought might be useful to you.

W3C syntax for anchors states simply that they must start with a # and contain no spaces.
I've added 2 tests in the CleanerTest class that document this behaviour.

If you could merge this in I'd be grateful; currently we're using the forked jar, but it'd be nice to stay on trunk.

and others added some commits
@jhy Implemented Element#wrap and #Elements#wrap
Also protected Node.replaceChild, removeChild, addChild.
4d91a81
@jhy New: E + F adjacent sibling selector, E ~ F preceding sibling. 465493e
@jhy Maven Sonatype setup 96ef20d
@jhy [maven-release-plugin] prepare release jsoup-0.2.1 0747d34
@jhy [maven-release-plugin] prepare for next development iteration 55e9a11
@jhy [maven-release-plugin] prepare release jsoup-0.2.1 4c55fb7
@jhy Release prep 138c1a7
@jhy [maven-release-plugin] prepare release jsoup-0.2.1a 7b0469e
@jhy [maven-release-plugin] prepare for next development iteration 55edde5
@jhy Sonatype release machinations 3185272
@jhy [maven-release-plugin] prepare release jsoup-0.2.1b 69f7fdb
@jhy [maven-release-plugin] prepare for next development iteration a9d69ab
@jhy Add addClass, removeClass, toggleClass, hasClass to Element and Eleme…
…nts.

Closes #2
87877c9
@jhy Improved document normalisation. ba751b9
@jhy hasText 0984e71
@jhy Improved HTML output (pretty-print) 837afba
@jhy Changelog for release prep fb5e61f
@jhy [maven-release-plugin] prepare release jsoup-0.2.2 c4f5acf
@jhy [maven-release-plugin] prepare for next development iteration f20e2ba
@jhy Corrected change note 2a00564
@jhy Merge branch 'master' of 10.68.203.26:jhy/jsoup fafbfc7
@jhy Assert attribute values are not null, not not empty.
Closes #7.
36c3b16
@jhy Changed Elements#attr(key) to scan all elements for attribute.
Closes #4.
dba761f
@jhy Implemented Elements html(), html(string), append, and prepend.
Closes #5.
83ebe5f
@jhy Changelog 25f33dd
@jhy Normalise head by prepending, not appending.
Closes #9.
f3e4e56
@jhy Cleaner.isValid() method.
Closes #6.
7d94e23
@jhy IsValid test for OK attribute 4c2c345
@jhy Test self is not descender 985b4ae
@jhy Deploy prep 4d7c7fa
@jhy Release prep bb86d89
@jhy [maven-release-plugin] prepare release jsoup-0.3.1 cd59eed
@jhy [maven-release-plugin] prepare for next development iteration 1aadd75
@jhy Allow - and _ in CSS ID selectors.
Closes #10.
22bf5ab
@jhy Changelog b3d2cbc
@jhy Changelog 858240e
@jhy Resolve relative links when cleaning.
Closes #12.
c4b96b3
@jhy Allow combinators at start of selector query
Closes #13
8b1abce
@jhy Added val() and val(string) to Element and Elements.
Treat contents of textarea as text, not data.

Closes #14
d4a4b6c
@jhy Added Node#remove and Node#replaceWith.
Closes #19
7c62959
@jhy Throw exception if trying to parse non-text content
Closes #17
1e93653
@jhy Added TextNode#text and TextNode#text(String)
Closes #18
f55d80c
@jhy Added selector support for :eq, :lt, and gt
Closes #16
ecc54c2
@uggedal uggedal String.isEmpty() and LinkedList.peekFirst() is not part of the Java 5…
….0 API.
d8ea84f
@jhy Updated ignore list 499c262
@jhy Preparing 1.1.1 release e9bc6c3
@jhy [maven-release-plugin] prepare release jsoup-1.1.1 739436d
@jhy [maven-release-plugin] prepare for next development iteration 685832e
@jhy Change notes 284c552
@jhy Fixed test package 253412a
@jhy Fix an issue where text order was incorrect when parsing pre-document…
… HTML.

Fixes #23
821d3b6
@jhy Clean up the parse stack correctly when parsing data-nodes.
Fixes #22.
886e4fe
@jhy Fixed javadoc typo 0059dce
@jhy Added :has(selector) pseudo-selector.
Added Element#parents() and Elements#parents() methods.

Fixes #20
49f0e41
@jhy Chanelog release date e5716c0
@jhy Improved implicit close tag heuristic detection when parsing malforme…
…d HTML.

Fixes an issue where appending / prepending rows to a table (or  to similar implicit
element structures) would create a redundant wrapping elements.

Fixes #21
3e8cf58
@jhy Cleanup Element and Node add mechanism e65cfb3
@jhy Added .before(html) and .after(html) methods to Element and Elements,…
… to insert sibling HTML
610cb60
@jhy Added :contains(text) selector 2bcfa10
@jhy [maven-release-plugin] prepare release jsoup-1.2.1 9883b56
@jhy [maven-release-plugin] prepare for next development iteration 8217467
@jhy Changelog release date 1ced3d7
@jhy Fixed javadoc for :eq(n) 4ed5aec
@jhy Upgraded the selector query parser to allow nested selectors like 'di…
…v:has(p:has(span))'
d8b5aa7
@jhy Updated TokenQueue so :contains(text) can be escaped, if looking
for ( or ) within text
3cacfff
@jhy Implemented :matches(regex) selector. 47b13d4
@jhy Changelog 92baf07
@jhy Parsing optimisation.
Modified TokenQueue to use a StringBuilder + offset to back the queue,
instead of a linked list. Reduces memory and CPU use.
f033e5a
@jhy Parsing performance optimisation.
Modified TokenQueue chompTo method to use indexOf to allow rapid
scan for next token.
fcb4841
@jhy Parsing performance optimisation.
Intern attribute keys (often shared), and dropped back default
bucket sizes for attributes and element children so as to conserve
memory.
af2a97c
@jhy TextNode performance tweaks 3c28ff7
@jhy Performance optimisation in parsing. 7425c6d
@jhy Use a Visitor instead of recursion for HTML and selectors. a038358
@jhy Performance tweaks. 3975b8b
@jhy Tidy 5f6a9ae
@jhy Added [key~=regex] attribute selector by regular expression 7c28911
@jhy Tidy c41390c
@jhy Changelog a35db94
@jhy Test update fb37594
@jhy [maven-release-plugin] prepare release jsoup-1.2.2 145d783
@jhy [maven-release-plugin] prepare for next development iteration cc5f37f
@jhy Automatically determine charset when parsing from URL or File. 784a31f
@jhy Auto detect charset from HTML5 <meta charset> tag if present d4c06ac
@jhy Changed DT & DD tags to block-mode tags, to follow practise over spec. a03639a
@jhy Added support for [^attributePrefix] selector query. Useful for finding
elements with HTML5 datasets: [^data]
4d0eab2
@jhy Implemented Element.dataset(), to retrieve a map of custom data attri…
…butes.
aa3c4f0
@jhy Improved tag definitions to allow limited children and excluded child…
…ren.

Improved implicit table element creation, particularly around tbody tags.
1682762
@jhy Cleaned tag definitions to make head and dl parsing more generic. ab9d34e
@jhy Implicit close for <caption> tags. 0b509fd
@jhy Changelog 8f66e9c
@jhy Testcase for malformed meta http-equiv charset. 2290966
@jhy HTML5 tag support a09ef4a
@jhy Added support for namespaced elements (<fb:name>) and selectors (fb|n…
…ame)
781eb0f
@jhy Improved HTML output format for empty elements and auto-detected self…
… closing tags.

Closes #27
dacb8e4
@jhy Added support for tag names with - and _ (<abc_foo>, <abc-foo>) 05623f4
@jhy Removed obsolete nodeDepth method 479a9fa
@jhy Implemented Node.ownerDocument DOM API method. 34bc04b
@jhy Fixed support for character class regular expressions in [attr=~regex…
…] selector
ef1bbcb
@jhy Fixed support for character class regular expressions in [attr=~regex…
…] selector
98f3cce
@jhy Note <tag > fix d52406b
@jhy Draft implementation of Entities, for customisable entity escaping. 6bde6c8
@jhy Working on escape/unescape routine. 8bb490a
@jhy

This section will be clearer with a regex: &(#(x|X)?(\d+)|[a-zA-Z]+);?

and others added some commits
@jhy Simplified Entity unescaper 61b18d9
@jhy Added ability to configure the document's output charset. 78a028b
@jhy Re-ordered changelog 724c454
@jhy [maven-release-plugin] prepare release jsoup-1.2.3 d27d92e
@jhy [maven-release-plugin] prepare for next development iteration c807c32
@jhy Use jsoup escaper for attributes, not Apache's. 8180838
@jhy Optimise adding nodes to end of childnode list. 1135778
@jhy TokenQueue optimisations 57d909d
@jhy Optimised document normalisation 64f7110
@jhy Mini optimisations dbe01fe
@jhy Restored public access for Entities.EscapeMode eee3609
@jhy Javadoc fix 245e437
@jhy Removed dependency on Apache Commons-lang. Jsoup now has no external …
…dependencies.
4384f7e
@jhy Optimised normaliseWhitespace 2587788
@jhy Optimised attribute html 6463dd7
@jhy Micro-optimise tag ancestor ce4e564
@jhy Optimised textnodes to not hold attributes or childnodes unless requi…
…red on use.
9babc3b
@jhy Fixed support for case-sensitive HTML escape entities.
Fixes #31
17d07c5
@jhy Fixed issue when parsing tags with keyless attributes.
Fixes #32
856c8ef
@jhy Entity doc e5cbf67
@jhy Draft / in progress implementation of Connection 6e5e8a2
@jhy Initial implementation of Connection f737aa1
@jhy Working on http connection implementation 6942ad8
@jhy Implemented request headers 4394c86
@jhy Implemented query string from data 7eee43a
@jhy Fixed Attributes.hmtl() f4061bd
@jhy Added support for gzipped output.
Fixes #28
7d80538
@jhy Connection timeout specified in millis, not seconds 16737a6
@jhy Documented Connection interface methods 7d0015f
@jhy Tidied up Connection and Jsoup use 608599c
@jhy URL connection tests a291885
@jhy Implemented Element#ownText() a27046b
@jhy Changelog 5a29e18
@jhy Added support for non-pretty-printed HTML output, to more closely mir…
…ror the input HTML.

Fixes #8
5514f98
@jhy Changelog 2067cb7
@jhy Fixed html() method of Attribute dbc25bf
@jhy Added support for selectors :containsOwn(text) and :matchesOwn(regex)…
…, to supplement Element.ownText().
cf6bc67
@jhy Updated the link example program to use Jsoup.connect() 4082a25
@jhy Validations for Connection 0a5d0b2
@jhy Changelog release prep 333ca9f
@jhy [maven-release-plugin] prepare release jsoup-1.3.1 617a1e5
@jhy [maven-release-plugin] prepare for next development iteration 76237ff
@jhy Doc 4668d10
@jhy Treat HTTP headers as case insensitive in Jsoup.Connection. Improves …
…compatibility for HTTP responses.
bf456f4
@jhy Tweaks 310c7bc
@jhy Improved malformed table parsing by implementing ignorable end tags. 26b1aaf
@jhy More tests for Jsoup.Connection 2958642
@jhy [maven-release-plugin] prepare release jsoup-1.3.2 eb9f5c9
@jhy [maven-release-plugin] prepare for next development iteration 7dcf96f
@jhy Implement Elements.empty() and Elements.remove(). 867a00d
@jhy Javadoc note for Elements.get(int) 7daa24b
@jhy Selector documentation tweak 5972b86
@jhy Fixed issue in Entities when unescaping &#36; ("$")
Fixes #34
f6752d3
@akr4 akr4 added EscapeMode.minimum d9f4958
@jhy Added restricted XHTML output entity option 6ce593d
@jhy Changelog 9964fd5
@jhy [maven-release-plugin] prepare release jsoup-1.3.3 88c822e
@jhy [maven-release-plugin] prepare for next development iteration 3ca6654
@jhy Implemented DataNode.setWholeData() to allow updating of script and s…
…tyle data contents.
a5b5ec2
@jhy Fixed support for jsoup.connect to follow redirects between http & ht…
…tps URLs.

Fixes #37
2d9b97b
@jhy Fixed issue in jsoup.connect when extracting character set from conte…
…nt-type header; now supports quoted

charset declaration.
2714d6b
@jhy Relaxed parse rules of H1 - H6 to allow nested content. 866825d
@jhy Relaxed parse rule of SPAN to treat as block, to allow nested block c…
…ontent.
182f903
@jhy Added ability to load and parse HTML from an input stream. c408300
@jhy Test fix 2f46ca2
@jhy Javadoc example on absUrl f70526c
@jhy Document normalisation now more correctly enforces document structure.
 - ensure only one head and one body element, both under html el
 - allow html/head/noscript/img for some site's analytic pattern

Fixes #43
85ef3f6
@jhy Support node.outerHtml() method when node has no parent.
Fixes #45
f6271f9
@jhy Fixed support for HTML entities with numbers in name (e.g. &frac34, &…
…sup1)

Fixes #46
6d48121
@jhy Merge branch 'master' of git@github.com:jhy/jsoup
Conflicts:
	CHANGES
	src/test/java/org/jsoup/parser/ParserTest.java
5d77dbe
@clementdenis-vv clementdenis-vv Fixes IndexArrayOutOfBoundException on response with empty headers c907932
@jhy Implemented Node.clone() to create deep, independent copies of Nodes,…
… Elements, and Documents.

Fixes #47
d255668
@jhy Testcase to confirm doctypes get cloned e9f7254
@jhy Fixed absolute URL generation from relative URLs which are only query…
… strings.

Fixes #49
a57d8a3
@jhy Output format tweak 2212615
@jhy Added :not() selector, to find elements that do not match the selecto…
…r. E.g. div:not(.logo) finds divs that

do not have the "logo" class name.

Fixes #36
b19beb7
@jhy Added Elements.not(selector) method, to remove undesired results from…
… selector results.
ff91ded
@jhy Changelog update in launce preperation c1cf385
@jhy Changelog tweak 7babf92
@jhy [maven-release-plugin] prepare release jsoup-1.4.1 b2229d6
@jhy [maven-release-plugin] prepare for next development iteration 5824e29
@bbeck bbeck Updated OutputSettings inside of Document to be a static inner class.
This addresses the issue with using jsoup and scala 2.8 discussed here:
http://groups.google.com/group/jsoup/browse_thread/thread/3f7ec2fa41dfb87f

This change should be safe to commit and won't make a backwards incompatible
to the public interface of jsoup (you can always reference a static member
via a non-static path) for any existing users.  A recompilation of their code
won't even be necessary.

All tests continue to pass for me after this change.
9a2e19e
@kzn kzn added .clone() for Elements dc8b3fd
@kzn kzn Initial add of new generation selectors(faster than original) f4337b9
Anton Kazennikov Added attribute selectors f8e487a
@kzn kzn added AttrSelector.AttrNamePrefixSelector 2e93a70
@kzn kzn fix bug in element selector: incorrect behavior on multiple classes 6a6fa21
@kzn kzn removing as exists matching evaluator classes 45a45e1
@kzn kzn changes wrt existing Evaluator class 326646f
@kzn kzn Selectors update d3d36bb
@kzn kzn removing boxing/unboxing 5ce24c2
@kzn kzn evaluators made public 98f8292
@kzn kzn Adding evaluators tests 87891d5
@kzn kzn added new tests 67e09f8
Anton Kazennikov Renaming in some selectors 57e0209
Anton Kazennikov adding new ctors to And and Or 76a7ade
Anton Kazennikov adding toString() 2d7d62f
Anton Kazennikov added base container 22a1d10
Anton Kazennikov added .toString() to basic Evaluators 8573486
Anton Kazennikov Adding Selector parser aabd34f
@kzn kzn removing char boxing 2dd2968
@kzn kzn adding empty and .addAll 95b03ed
@kzn kzn implemented :has selector 06a6674
@kzn kzn Working parser except the root node selector.
Added basic tests
fef19b5
@kzn kzn removed unused constructor 80e420e
@kzn kzn parser update: normal order of selectors 9c00eab
@kzn kzn fix non-void element parsing such as <a href=/link/>link text</a> a59cfee
@kzn kzn Evaluator.match(Element test) ->
Evaluator.match(Element root, Element test)
change
441cc33
@kzn kzn Character -> char change 4c89a93
@kzn kzn restored all tests. 906bbf0
@kzn kzn added RootSelector
updated tree selectors wrt subtree matching
a4d3abe
@kzn kzn update evaluator wrt subtree matching 3f91b30
@kzn kzn Added RootSelector support 03abecc
@kzn kzn small optimizations 692d5f8
@kzn kzn added javadocs bb2f08b
@kzn kzn Added javadocs for Evaluators.
Updated tests.
Updated parser
9f43840
@jhy Fixed issue when using descendant regex attribute selectors.
Fixes #52
53a207d
@jhy Added a test to confirm combinators don't match in balanced contains …
…queries
d46b89c
@jhy Merge branch 'master' of https://github.com/bbeck/jsoup into bbeck-ma…
…ster
7938bae
@jhy Fixed tokeniser optimisation when scanning for missing data element c…
…lose tags.

Fixes #67
c755b04
@michael-simons michael-simons There are no valid (x)html tags that start with numbers 05285d0
Anton Kazennikov Merge remote branch 'upstream/master'
Conflicts:
	src/main/java/org/jsoup/parser/TokenQueue.java
06389e9
@jhy Reverted changes that only allow empty tags in pre-defined instances.
Markup like <tag /> needs to be parsed as an empty element.
dbc053f
@jhy Integrated new single-pass selector evaluators, contributed by knz (A…
…nton Kazennikov).
a9b6f76
@tc tc Removed com.sun.xml.internal.ws.util.StringUtils to fix https://githu…
…b.com/jhy/jsoup/issues/#issue/69

"jsoup/src/main/java/org/jsoup/select/selectors/AndSelector.java:[8,35] package com.sun.xml.internal.ws.util does not exist"
50a51cc
@jhy Force strict entity matching (must be &xxx; and not &xxx) in element …
…attributes.

Fixes #71
e008ef7
@jhy Ensure that Jsoup.Connect handles relative redirects in cases where the
underlying HTTP stack doesn't automatically follow them.

Fixes #73
0a4699e
@jhy Allow Jsoup.Connect to parse application/xml and application/xhtml+xm…
…l responses.

Fixes #72
4a0f8d6
@jhy Defined U (underline) element as an inline tag. 82264c8
@jhy Cleanup of selector class files a59a878
@jhy Updated Jsoup.Connection so that cookies set on a redirect response w…
…ill be included on the redirected request and response.
e70cbf2
@jhy Prevent infinite redirection loops in jsoup.connect. 096e130
@jhy Implemented TextNode.splitText e3ddbb8
@jhy Moved .wrap, .before, and .after from Element to Node for flexibility…
…. Overriding implementations in Element still return Element.
eea130b
@jhy Don't run URL connectivity tests by default. 800d6e1
@jhy Added ability to change an element's tag with Element.tagName(String)…
…, and to change many at once with Elements.tagName(String).
64a3dea
@jhy Test to confirm that abs URL method works on img src attributes. c659826
@jhy Generify empty child list. 8c112c7
@jhy Removed redundant empty array 765eafc
@jhy Changelog updates 66343dc
@jhy Readme update 22e2513
@jhy [maven-release-plugin] prepare release jsoup-1.5.1 da25d1d
@jhy [maven-release-plugin] prepare for next development iteration 67ee7a1
@jhy Fixed issue with selector parser where some boolean AND + OR combined…
… queries (e.g. "meta[http-equiv], meta[content]") were being parsed incorrectly as OR only queries (e.g. former as "meta, [http-equiv], meta[content]")

Fixed issue where a content-tye specified in a meta tag may not be reliably detected, due to the above issue.
d96f781
@jhy Allow <a> and <font> elements to be treated as flow/block elements, t…
…o match browser parse trees.
4744082
@jhy Updated copyright date 8b52f48
Rory Gibson Add support for A elements with hrefs starting '#' i.e. anchors. bdd810f
Rory Gibson Make relative URLs possible 46cd1e3
@analytically

This should also support single quotes: eg Content-Type:text/html; charset='utf-8'

Site that has this: http://www.roundwoodpark.herts.sch.uk/

@ishults

I noticed this was never merged -- is this fix not wanted, or were there issues with the implementation? Would it be worth submitting a new pull request for this issue?

@jhy
Owner
jhy commented

Merged with #441, thanks

@jhy jhy closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
4 pom.xml
@@ -5,7 +5,7 @@
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
- <version>1.5.2-SNAPSHOT</version>
+ <version>1.5.2-TECHNOPHOBIA</version>
<description>jsoup HTML parser</description>
<url>http://jsoup.org/</url>
<inceptionYear>2009</inceptionYear>
@@ -152,4 +152,4 @@
</developer>
</developers>
-</project>
+</project>
337 src/main/java/org/jsoup/safety/Whitelist.java
@@ -1,8 +1,8 @@
package org.jsoup.safety;
/*
- Thank you to Ryan Grove (wonko.com) for the Ruby HTML cleaner http://github.com/rgrove/sanitize/, which inspired
- this whitelist configuration, and the initial defaults.
+ Thank you to Ryan Grove (wonko.com) for the Ruby HTML cleaner http://github.com/rgrove/sanitize/, which inspired
+ this whitelist configuration, and the initial defaults.
*/
import org.jsoup.helper.Validate;
@@ -15,156 +15,154 @@ Thank you to Ryan Grove (wonko.com) for the Ruby HTML cleaner http://github.com/
import java.util.Map;
import java.util.Set;
-
/**
- Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.
- <p/>
- Start with one of the defaults:
- <ul>
- <li>{@link #none}
- <li>{@link #simpleText}
- <li>{@link #basic}
- <li>{@link #basicWithImages}
- <li>{@link #relaxed}
- </ul>
- <p/>
- If you need to allow more through (please be careful!), tweak a base whitelist with:
- <ul>
- <li>{@link #addTags}
- <li>{@link #addAttributes}
- <li>{@link #addEnforcedAttribute}
- <li>{@link #addProtocols}
- </ul>
- <p/>
- The cleaner and these whitelists assume that you want to clean a <code>body</code> fragment of HTML (to add user
- supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the
- document HTML around the cleaned body HTML, or create a whitelist that allows <code>html</code> and <code>head</code>
- elements as appropriate.
- <p/>
- If you are going to extend a whitelist, please be very careful. Make sure you understand what attributes may lead to
- XSS attack vectors. URL attributes are particularly vulnerable and require careful validation. See
- http://ha.ckers.org/xss.html for some XSS attack examples.
-
- @author Jonathan Hedley
+ * Whitelists define what HTML (elements and attributes) to allow through the
+ * cleaner. Everything else is removed.
+ * <p/>
+ * Start with one of the defaults:
+ * <ul>
+ * <li>{@link #none}
+ * <li>{@link #simpleText}
+ * <li>{@link #basic}
+ * <li>{@link #basicWithImages}
+ * <li>{@link #relaxed}
+ * </ul>
+ * <p/>
+ * If you need to allow more through (please be careful!), tweak a base
+ * whitelist with:
+ * <ul>
+ * <li>{@link #addTags}
+ * <li>{@link #addAttributes}
+ * <li>{@link #addEnforcedAttribute}
+ * <li>{@link #addProtocols}
+ * </ul>
+ * <p/>
+ * The cleaner and these whitelists assume that you want to clean a
+ * <code>body</code> fragment of HTML (to add user supplied HTML into a
+ * templated page), and not to clean a full HTML document. If the latter is the
+ * case, either wrap the document HTML around the cleaned body HTML, or create a
+ * whitelist that allows <code>html</code> and <code>head</code> elements as
+ * appropriate.
+ * <p/>
+ * If you are going to extend a whitelist, please be very careful. Make sure you
+ * understand what attributes may lead to XSS attack vectors. URL attributes are
+ * particularly vulnerable and require careful validation. See
+ * http://ha.ckers.org/xss.html for some XSS attack examples.
+ *
+ * @author Jonathan Hedley
*/
public class Whitelist {
- private Set<TagName> tagNames; // tags allowed, lower case. e.g. [p, br, span]
- private Map<TagName, Set<AttributeKey>> attributes; // tag -> attribute[]. allowed attributes [href] for a tag.
- private Map<TagName, Map<AttributeKey, AttributeValue>> enforcedAttributes; // always set these attribute values
- private Map<TagName, Map<AttributeKey, Set<Protocol>>> protocols; // allowed URL protocols for attributes
+ private Set<TagName> tagNames; // tags allowed, lower case. e.g. [p, br,
+ // span]
+ private Map<TagName, Set<AttributeKey>> attributes; // tag -> attribute[].
+ // allowed attributes
+ // [href] for a tag.
+ private Map<TagName, Map<AttributeKey, AttributeValue>> enforcedAttributes; // always
+ // set
+ // these
+ // attribute
+ // values
+ private Map<TagName, Map<AttributeKey, Set<Protocol>>> protocols; // allowed
+ // URL
+ // protocols
+ // for
+ // attributes
+ private boolean useAbsoluteURLs = true;
/**
- This whitelist allows only text nodes: all HTML will be stripped.
-
- @return whitelist
+ * This whitelist allows only text nodes: all HTML will be stripped.
+ *
+ * @return whitelist
*/
public static Whitelist none() {
return new Whitelist();
}
/**
- This whitelist allows only simple text formatting: <code>b, em, i, strong, u</code>. All other HTML (tags and
- attributes) will be removed.
-
- @return whitelist
+ * This whitelist allows only simple text formatting:
+ * <code>b, em, i, strong, u</code>. All other HTML (tags and attributes)
+ * will be removed.
+ *
+ * @return whitelist
*/
public static Whitelist simpleText() {
- return new Whitelist()
- .addTags("b", "em", "i", "strong", "u")
- ;
+ return new Whitelist().addTags("b", "em", "i", "strong", "u");
}
/**
- This whitelist allows a fuller range of text nodes: <code>a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li,
- ol, p, pre, q, small, strike, strong, sub, sup, u, ul</code>, and appropriate attributes.
- <p/>
- Links (<code>a</code> elements) can point to <code>http, https, ftp, mailto</code>, and have an enforced
- <code>rel=nofollow</code> attribute.
- <p/>
- Does not allow images.
-
- @return whitelist
+ * This whitelist allows a fuller range of text nodes:
+ * <code>a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li,
+ ol, p, pre, q, small, strike, strong, sub, sup, u, ul</code>, and
+ * appropriate attributes.
+ * <p/>
+ * Links (<code>a</code> elements) can point to
+ * <code>http, https, ftp, mailto</code>, and have an enforced
+ * <code>rel=nofollow</code> attribute.
+ * <p/>
+ * Does not allow images.
+ *
+ * @return whitelist
*/
public static Whitelist basic() {
return new Whitelist()
- .addTags(
- "a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em",
- "i", "li", "ol", "p", "pre", "q", "small", "strike", "strong", "sub",
- "sup", "u", "ul")
+ .addTags("a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em", "i", "li", "ol", "p", "pre", "q", "small", "strike", "strong",
+ "sub", "sup", "u", "ul")
- .addAttributes("a", "href")
- .addAttributes("blockquote", "cite")
- .addAttributes("q", "cite")
+ .addAttributes("a", "href").addAttributes("blockquote", "cite").addAttributes("q", "cite")
- .addProtocols("a", "href", "ftp", "http", "https", "mailto")
- .addProtocols("blockquote", "cite", "http", "https")
+ .addProtocols("a", "href", "ftp", "http", "https", "mailto").addProtocols("blockquote", "cite", "http", "https")
.addProtocols("cite", "cite", "http", "https")
- .addEnforcedAttribute("a", "rel", "nofollow")
- ;
+ .addEnforcedAttribute("a", "rel", "nofollow");
}
/**
- This whitelist allows the same text tags as {@link #basic}, and also allows <code>img</code> tags, with appropriate
- attributes, with <code>src</code> pointing to <code>http</code> or <code>https</code>.
-
- @return whitelist
+ * This whitelist allows the same text tags as {@link #basic}, and also
+ * allows <code>img</code> tags, with appropriate attributes, with
+ * <code>src</code> pointing to <code>http</code> or <code>https</code>.
+ *
+ * @return whitelist
*/
public static Whitelist basicWithImages() {
- return basic()
- .addTags("img")
- .addAttributes("img", "align", "alt", "height", "src", "title", "width")
- .addProtocols("img", "src", "http", "https")
- ;
+ return basic().addTags("img").addAttributes("img", "align", "alt", "height", "src", "title", "width").addProtocols("img", "src", "http", "https");
}
/**
- This whitelist allows a full range of text and structural body HTML: <code>a, b, blockquote, br, caption, cite,
+ * This whitelist allows a full range of text and structural body HTML:
+ * <code>a, b, blockquote, br, caption, cite,
code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub,
sup, table, tbody, td, tfoot, th, thead, tr, u, ul</code>
- <p/>
- Links do not have an enforced <code>rel=nofollow</code> attribute, but you can add that if desired.
-
- @return whitelist
+ * <p/>
+ * Links do not have an enforced <code>rel=nofollow</code> attribute, but
+ * you can add that if desired.
+ *
+ * @return whitelist
*/
public static Whitelist relaxed() {
return new Whitelist()
- .addTags(
- "a", "b", "blockquote", "br", "caption", "cite", "code", "col",
- "colgroup", "dd", "div", "dl", "dt", "em", "h1", "h2", "h3", "h4", "h5", "h6",
- "i", "img", "li", "ol", "p", "pre", "q", "small", "strike", "strong",
- "sub", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "tr", "u",
- "ul")
-
- .addAttributes("a", "href", "title")
- .addAttributes("blockquote", "cite")
- .addAttributes("col", "span", "width")
- .addAttributes("colgroup", "span", "width")
- .addAttributes("img", "align", "alt", "height", "src", "title", "width")
- .addAttributes("ol", "start", "type")
- .addAttributes("q", "cite")
- .addAttributes("table", "summary", "width")
- .addAttributes("td", "abbr", "axis", "colspan", "rowspan", "width")
- .addAttributes(
- "th", "abbr", "axis", "colspan", "rowspan", "scope",
- "width")
+ .addTags("a", "b", "blockquote", "br", "caption", "cite", "code", "col", "colgroup", "dd", "div", "dl", "dt", "em", "h1", "h2", "h3", "h4",
+ "h5", "h6", "i", "img", "li", "ol", "p", "pre", "q", "small", "strike", "strong", "sub", "sup", "table", "tbody", "td", "tfoot", "th",
+ "thead", "tr", "u", "ul")
+
+ .addAttributes("a", "href", "title").addAttributes("blockquote", "cite").addAttributes("col", "span", "width")
+ .addAttributes("colgroup", "span", "width").addAttributes("img", "align", "alt", "height", "src", "title", "width")
+ .addAttributes("ol", "start", "type").addAttributes("q", "cite").addAttributes("table", "summary", "width")
+ .addAttributes("td", "abbr", "axis", "colspan", "rowspan", "width").addAttributes("th", "abbr", "axis", "colspan", "rowspan", "scope", "width")
.addAttributes("ul", "type")
- .addProtocols("a", "href", "ftp", "http", "https", "mailto")
- .addProtocols("blockquote", "cite", "http", "https")
- .addProtocols("img", "src", "http", "https")
- .addProtocols("q", "cite", "http", "https")
- ;
+ .addProtocols("a", "href", "ftp", "http", "https", "mailto").addProtocols("blockquote", "cite", "http", "https")
+ .addProtocols("img", "src", "http", "https").addProtocols("q", "cite", "http", "https");
}
/**
- Create a new, empty whitelist. Generally it will be better to start with a default prepared whitelist instead.
-
- @see #basic()
- @see #basicWithImages()
- @see #simpleText()
- @see #relaxed()
+ * Create a new, empty whitelist. Generally it will be better to start with
+ * a default prepared whitelist instead.
+ *
+ * @see #basic()
+ * @see #basicWithImages()
+ * @see #simpleText()
+ * @see #relaxed()
*/
public Whitelist() {
tagNames = new HashSet<TagName>();
@@ -174,10 +172,12 @@ public Whitelist() {
}
/**
- Add a list of allowed elements to a whitelist. (If a tag is not allowed, it will be removed from the HTML.)
-
- @param tags tag names to allow
- @return this (for chaining)
+ * Add a list of allowed elements to a whitelist. (If a tag is not allowed,
+ * it will be removed from the HTML.)
+ *
+ * @param tags
+ * tag names to allow
+ * @return this (for chaining)
*/
public Whitelist addTags(String... tags) {
Validate.notNull(tags);
@@ -190,14 +190,17 @@ public Whitelist addTags(String... tags) {
}
/**
- Add a list of allowed attributes to a tag. (If an attribute is not allowed on an element, it will be removed.)
- <p/>
- To make an attribute valid for <b>all tags</b>, use the pseudo tag <code>:all</code>, e.g.
- <code>addAttributes(":all", "class")</code>.
-
- @param tag The tag the attributes are for
- @param keys List of valid attributes for the tag
- @return this (for chaining)
+ * Add a list of allowed attributes to a tag. (If an attribute is not
+ * allowed on an element, it will be removed.)
+ * <p/>
+ * To make an attribute valid for <b>all tags</b>, use the pseudo tag
+ * <code>:all</code>, e.g. <code>addAttributes(":all", "class")</code>.
+ *
+ * @param tag
+ * The tag the attributes are for
+ * @param keys
+ * List of valid attributes for the tag
+ * @return this (for chaining)
*/
public Whitelist addAttributes(String tag, String... keys) {
Validate.notEmpty(tag);
@@ -219,16 +222,21 @@ public Whitelist addAttributes(String tag, String... keys) {
}
/**
- Add an enforced attribute to a tag. An enforced attribute will always be added to the element. If the element
- already has the attribute set, it will be overridden.
- <p/>
- E.g.: <code>addEnforcedAttribute("a", "rel", "nofollow")</code> will make all <code>a</code> tags output as
- <code>&lt;a href="..." rel="nofollow"></code>
-
- @param tag The tag the enforced attribute is for
- @param key The attribute key
- @param value The enforced attribute value
- @return this (for chaining)
+ * Add an enforced attribute to a tag. An enforced attribute will always be
+ * added to the element. If the element already has the attribute set, it
+ * will be overridden.
+ * <p/>
+ * E.g.: <code>addEnforcedAttribute("a", "rel", "nofollow")</code> will make
+ * all <code>a</code> tags output as
+ * <code>&lt;a href="..." rel="nofollow"></code>
+ *
+ * @param tag
+ * The tag the enforced attribute is for
+ * @param key
+ * The attribute key
+ * @param value
+ * The enforced attribute value
+ * @return this (for chaining)
*/
public Whitelist addEnforcedAttribute(String tag, String key, String value) {
Validate.notEmpty(tag);
@@ -250,15 +258,18 @@ public Whitelist addEnforcedAttribute(String tag, String key, String value) {
}
/**
- Add allowed URL protocols for an element's URL attribute. This restricts the possible values of the attribute to
- URLs with the defined protocol.
- <p/>
- E.g.: <code>addProtocols("a", "href", "ftp", "http", "https")</code>
-
- @param tag Tag the URL protocol is for
- @param key Attribute key
- @param protocols List of valid protocols
- @return this, for chaining
+ * Add allowed URL protocols for an element's URL attribute. This restricts
+ * the possible values of the attribute to URLs with the defined protocol.
+ * <p/>
+ * E.g.: <code>addProtocols("a", "href", "ftp", "http", "https")</code>
+ *
+ * @param tag
+ * Tag the URL protocol is for
+ * @param key
+ * Attribute key
+ * @param protocols
+ * List of valid protocols
+ * @return this, for chaining
*/
public Whitelist addProtocols(String tag, String key, String... protocols) {
Validate.notEmpty(tag);
@@ -314,22 +325,38 @@ boolean isSafeAttribute(String tagName, Element el, Attribute attr) {
return false;
}
- private boolean testValidProtocol(Element el, Attribute attr, Set<Protocol> protocols) {
+ public Whitelist setUseAbsoluteURLs(boolean useAbsoluteURLs) {
+ this.useAbsoluteURLs = useAbsoluteURLs;
+ return this;
+ }
+
+ private boolean testValidProtocol(Element el, Attribute attr, Set<Protocol> protocols) {
if (isValidAnchor(attr.getValue())) {
return true;
}
-
- // resolve relative urls to abs, and update the attribute so output html has abs.
+
+ // resolve relative urls to abs, and update the attribute so output html
+ // has abs.
// rels without a baseuri get removed
- String value = el.absUrl(attr.getKey());
+
+ String value = attr.getValue();
+
+ if (!useAbsoluteURLs) {
+ if (value.startsWith("/")) {
+ return true;
+ }
+ }
+
+ value = el.absUrl(attr.getKey());
attr.setValue(value);
-
+
for (Protocol protocol : protocols) {
String prot = protocol.toString() + ":";
if (value.toLowerCase().startsWith(prot)) {
return true;
}
}
+
return false;
}
@@ -348,7 +375,7 @@ Attributes getEnforcedAttributes(String tagName) {
}
return attrs;
}
-
+
// named types for config. All just hold strings, but here for my sanity.
static class TagName extends TypedValue {
@@ -409,13 +436,18 @@ public int hashCode() {
@Override
public boolean equals(Object obj) {
- if (this == obj) return true;
- if (obj == null) return false;
- if (getClass() != obj.getClass()) return false;
+ if (this == obj)
+ return true;
+ if (obj == null)
+ return false;
+ if (getClass() != obj.getClass())
+ return false;
TypedValue other = (TypedValue) obj;
if (value == null) {
- if (other.value != null) return false;
- } else if (!value.equals(other.value)) return false;
+ if (other.value != null)
+ return false;
+ } else if (!value.equals(other.value))
+ return false;
return true;
}
@@ -425,4 +457,3 @@ public String toString() {
}
}
}
-
6 src/test/java/org/jsoup/safety/CleanerTest.java
@@ -119,4 +119,10 @@
String clean = Jsoup.clean(html, Whitelist.basic());
assertEquals("<a rel=\"nofollow\">Link</a>", clean);
}
+
+ @Test public void allowsRelativeLinksIfConfiguredThusly() {
+ String html = "<a href='/foo'>Link</a>";
+ String clean = Jsoup.clean(html, Whitelist.basic().setUseAbsoluteURLs(false));
+ assertEquals("<a href=\"/foo\" rel=\"nofollow\">Link</a>", clean);
+ }
}
Something went wrong with that request. Please try again.