Support for anchors (A tags with internal link syntax) #77

wants to merge 392 commits into


None yet

I've added support for hrefs starting with '#', so that links to internal anchors on a page don't get ripped out by the Cleaner, which is something I need for my day job and thought might be useful to you.

W3C syntax for anchors states simply that they must start with a # and contain no spaces.
I've added 2 tests in the CleanerTest class that document this behaviour.

If you could merge this in I'd be grateful; currently we're using the forked jar, but it'd be nice to stay on trunk.

and others added some commits Feb 3, 2010
@jhy Implemented Element#wrap and #Elements#wrap
Also protected Node.replaceChild, removeChild, addChild.
@jhy New: E + F adjacent sibling selector, E ~ F preceding sibling. 465493e
@jhy Maven Sonatype setup 96ef20d
@jhy [maven-release-plugin] prepare release jsoup-0.2.1 0747d34
@jhy [maven-release-plugin] prepare for next development iteration 55e9a11
@jhy [maven-release-plugin] prepare release jsoup-0.2.1 4c55fb7
@jhy Release prep 138c1a7
@jhy [maven-release-plugin] prepare release jsoup-0.2.1a 7b0469e
@jhy [maven-release-plugin] prepare for next development iteration 55edde5
@jhy Sonatype release machinations 3185272
@jhy [maven-release-plugin] prepare release jsoup-0.2.1b 69f7fdb
@jhy [maven-release-plugin] prepare for next development iteration a9d69ab
@jhy Add addClass, removeClass, toggleClass, hasClass to Element and Eleme…

Closes #2
@jhy Improved document normalisation. ba751b9
@jhy hasText 0984e71
@jhy Improved HTML output (pretty-print) 837afba
@jhy Changelog for release prep fb5e61f
@jhy [maven-release-plugin] prepare release jsoup-0.2.2 c4f5acf
@jhy [maven-release-plugin] prepare for next development iteration f20e2ba
@jhy Corrected change note 2a00564
@jhy Merge branch 'master' of fafbfc7
@jhy Assert attribute values are not null, not not empty.
Closes #7.
@jhy Changed Elements#attr(key) to scan all elements for attribute.
Closes #4.
@jhy Implemented Elements html(), html(string), append, and prepend.
Closes #5.
@jhy Changelog 25f33dd
@jhy Normalise head by prepending, not appending.
Closes #9.
@jhy Cleaner.isValid() method.
Closes #6.
@jhy IsValid test for OK attribute 4c2c345
@jhy Test self is not descender 985b4ae
@jhy Deploy prep 4d7c7fa
@jhy Release prep bb86d89
@jhy [maven-release-plugin] prepare release jsoup-0.3.1 cd59eed
@jhy [maven-release-plugin] prepare for next development iteration 1aadd75
@jhy Allow - and _ in CSS ID selectors.
Closes #10.
@jhy Changelog b3d2cbc
@jhy Changelog 858240e
@jhy Resolve relative links when cleaning.
Closes #12.
@jhy Allow combinators at start of selector query
Closes #13
@jhy Added val() and val(string) to Element and Elements.
Treat contents of textarea as text, not data.

Closes #14
@jhy Added Node#remove and Node#replaceWith.
Closes #19
@jhy Throw exception if trying to parse non-text content
Closes #17
@jhy Added TextNode#text and TextNode#text(String)
Closes #18
@jhy Added selector support for :eq, :lt, and gt
Closes #16
@uggedal @jhy uggedal String.isEmpty() and LinkedList.peekFirst() is not part of the Java 5…
….0 API.
@jhy Updated ignore list 499c262
@jhy Preparing 1.1.1 release e9bc6c3
@jhy [maven-release-plugin] prepare release jsoup-1.1.1 739436d
@jhy [maven-release-plugin] prepare for next development iteration 685832e
@jhy Change notes 284c552
@jhy Fixed test package 253412a
@jhy Fix an issue where text order was incorrect when parsing pre-document…

Fixes #23
@jhy Clean up the parse stack correctly when parsing data-nodes.
Fixes #22.
@jhy Fixed javadoc typo 0059dce
@jhy Added :has(selector) pseudo-selector.
Added Element#parents() and Elements#parents() methods.

Fixes #20
@jhy Chanelog release date e5716c0
@jhy Improved implicit close tag heuristic detection when parsing malforme…
…d HTML.

Fixes an issue where appending / prepending rows to a table (or  to similar implicit
element structures) would create a redundant wrapping elements.

Fixes #21
@jhy Cleanup Element and Node add mechanism e65cfb3
@jhy Added .before(html) and .after(html) methods to Element and Elements,…
… to insert sibling HTML
@jhy Added :contains(text) selector 2bcfa10
@jhy [maven-release-plugin] prepare release jsoup-1.2.1 9883b56
@jhy [maven-release-plugin] prepare for next development iteration 8217467
@jhy Changelog release date 1ced3d7
@jhy Fixed javadoc for :eq(n) 4ed5aec
@jhy Upgraded the selector query parser to allow nested selectors like 'di…
@jhy Updated TokenQueue so :contains(text) can be escaped, if looking
for ( or ) within text
@jhy Implemented :matches(regex) selector. 47b13d4
@jhy Changelog 92baf07
@jhy Parsing optimisation.
Modified TokenQueue to use a StringBuilder + offset to back the queue,
instead of a linked list. Reduces memory and CPU use.
@jhy Parsing performance optimisation.
Modified TokenQueue chompTo method to use indexOf to allow rapid
scan for next token.
@jhy Parsing performance optimisation.
Intern attribute keys (often shared), and dropped back default
bucket sizes for attributes and element children so as to conserve
@jhy TextNode performance tweaks 3c28ff7
@jhy Performance optimisation in parsing. 7425c6d
@jhy Use a Visitor instead of recursion for HTML and selectors. a038358
@jhy Performance tweaks. 3975b8b
@jhy Tidy 5f6a9ae
@jhy Added [key~=regex] attribute selector by regular expression 7c28911
@jhy Tidy c41390c
@jhy Changelog a35db94
@jhy Test update fb37594
@jhy [maven-release-plugin] prepare release jsoup-1.2.2 145d783
@jhy [maven-release-plugin] prepare for next development iteration cc5f37f
@jhy Automatically determine charset when parsing from URL or File. 784a31f
@jhy Auto detect charset from HTML5 <meta charset> tag if present d4c06ac
@jhy Changed DT & DD tags to block-mode tags, to follow practise over spec. a03639a
@jhy Added support for [^attributePrefix] selector query. Useful for finding
elements with HTML5 datasets: [^data]
@jhy Implemented Element.dataset(), to retrieve a map of custom data attri…
@jhy Improved tag definitions to allow limited children and excluded child…

Improved implicit table element creation, particularly around tbody tags.
@jhy Cleaned tag definitions to make head and dl parsing more generic. ab9d34e
@jhy Implicit close for <caption> tags. 0b509fd
@jhy Changelog 8f66e9c
@jhy Testcase for malformed meta http-equiv charset. 2290966
@jhy HTML5 tag support a09ef4a
@jhy Added support for namespaced elements (<fb:name>) and selectors (fb|n…
@jhy Improved HTML output format for empty elements and auto-detected self…
… closing tags.

Closes #27
@jhy Added support for tag names with - and _ (<abc_foo>, <abc-foo>) 05623f4
@jhy Removed obsolete nodeDepth method 479a9fa
@jhy Implemented Node.ownerDocument DOM API method. 34bc04b
@jhy Fixed support for character class regular expressions in [attr=~regex…
…] selector
@jhy Fixed support for character class regular expressions in [attr=~regex…
…] selector
@jhy Note <tag > fix d52406b
and others added some commits Sep 19, 2010
@jhy Changelog 9964fd5
@jhy [maven-release-plugin] prepare release jsoup-1.3.3 88c822e
@jhy [maven-release-plugin] prepare for next development iteration 3ca6654
@jhy Implemented DataNode.setWholeData() to allow updating of script and s…
…tyle data contents.
@jhy Fixed support for jsoup.connect to follow redirects between http & ht…
…tps URLs.

Fixes #37
@jhy Fixed issue in jsoup.connect when extracting character set from conte…
…nt-type header; now supports quoted

charset declaration.
@jhy Relaxed parse rules of H1 - H6 to allow nested content. 866825d
@jhy Relaxed parse rule of SPAN to treat as block, to allow nested block c…
@jhy Added ability to load and parse HTML from an input stream. c408300
@jhy Test fix 2f46ca2
@jhy Javadoc example on absUrl f70526c
@jhy Document normalisation now more correctly enforces document structure.
 - ensure only one head and one body element, both under html el
 - allow html/head/noscript/img for some site's analytic pattern

Fixes #43
@jhy Support node.outerHtml() method when node has no parent.
Fixes #45
@jhy Fixed support for HTML entities with numbers in name (e.g. &frac34, &…

Fixes #46
@jhy Merge branch 'master' of
@clementdenis-vv @jhy clementdenis-vv Fixes IndexArrayOutOfBoundException on response with empty headers c907932
@jhy Implemented Node.clone() to create deep, independent copies of Nodes,…
… Elements, and Documents.

Fixes #47
@jhy Testcase to confirm doctypes get cloned e9f7254
@jhy Fixed absolute URL generation from relative URLs which are only query…
… strings.

Fixes #49
@jhy Output format tweak 2212615
@jhy Added :not() selector, to find elements that do not match the selecto…
…r. E.g. div:not(.logo) finds divs that

do not have the "logo" class name.

Fixes #36
@jhy Added Elements.not(selector) method, to remove undesired results from…
… selector results.
@jhy Changelog update in launce preperation c1cf385
@jhy Changelog tweak 7babf92
@jhy [maven-release-plugin] prepare release jsoup-1.4.1 b2229d6
@jhy [maven-release-plugin] prepare for next development iteration 5824e29
@bbeck bbeck Updated OutputSettings inside of Document to be a static inner class.
This addresses the issue with using jsoup and scala 2.8 discussed here:

This change should be safe to commit and won't make a backwards incompatible
to the public interface of jsoup (you can always reference a static member
via a non-static path) for any existing users.  A recompilation of their code
won't even be necessary.

All tests continue to pass for me after this change.
@kzn kzn added .clone() for Elements dc8b3fd
@kzn kzn Initial add of new generation selectors(faster than original) f4337b9
Anton Kazennikov Added attribute selectors f8e487a
@kzn kzn added AttrSelector.AttrNamePrefixSelector 2e93a70
@kzn kzn fix bug in element selector: incorrect behavior on multiple classes 6a6fa21
@kzn kzn removing as exists matching evaluator classes 45a45e1
@kzn kzn changes wrt existing Evaluator class 326646f
@kzn kzn Selectors update d3d36bb
@kzn kzn removing boxing/unboxing 5ce24c2
@kzn kzn evaluators made public 98f8292
@kzn kzn Adding evaluators tests 87891d5
@kzn kzn added new tests 67e09f8
Anton Kazennikov Renaming in some selectors 57e0209
Anton Kazennikov adding new ctors to And and Or 76a7ade
Anton Kazennikov adding toString() 2d7d62f
Anton Kazennikov added base container 22a1d10
Anton Kazennikov added .toString() to basic Evaluators 8573486
Anton Kazennikov Adding Selector parser aabd34f
@kzn kzn removing char boxing 2dd2968
@kzn kzn adding empty and .addAll 95b03ed
@kzn kzn implemented :has selector 06a6674
@kzn kzn Working parser except the root node selector.
Added basic tests
@kzn kzn removed unused constructor 80e420e
@kzn kzn parser update: normal order of selectors 9c00eab
@kzn kzn fix non-void element parsing such as <a href=/link/>link text</a> a59cfee
@kzn kzn Evaluator.match(Element test) ->
Evaluator.match(Element root, Element test)
@kzn kzn Character -> char change 4c89a93
@kzn kzn restored all tests. 906bbf0
@kzn kzn added RootSelector
updated tree selectors wrt subtree matching
@kzn kzn update evaluator wrt subtree matching 3f91b30
@kzn kzn Added RootSelector support 03abecc
@kzn kzn small optimizations 692d5f8
@kzn kzn added javadocs bb2f08b
@kzn kzn Added javadocs for Evaluators.
Updated tests.
Updated parser
@jhy Fixed issue when using descendant regex attribute selectors.
Fixes #52
@jhy Added a test to confirm combinators don't match in balanced contains …
@jhy Merge branch 'master' of into bbeck-ma…
@jhy Fixed tokeniser optimisation when scanning for missing data element c…
…lose tags.

Fixes #67
@michael-simons @jhy michael-simons There are no valid (x)html tags that start with numbers 05285d0
Anton Kazennikov Merge remote branch 'upstream/master'
@jhy Reverted changes that only allow empty tags in pre-defined instances.
Markup like <tag /> needs to be parsed as an empty element.
@jhy Integrated new single-pass selector evaluators, contributed by knz (A…
…nton Kazennikov).
@tc @jhy tc Removed to fix https://githu…

"jsoup/src/main/java/org/jsoup/select/selectors/[8,35] package does not exist"
@jhy Force strict entity matching (must be &xxx; and not &xxx) in element …

Fixes #71
@jhy Ensure that Jsoup.Connect handles relative redirects in cases where the
underlying HTTP stack doesn't automatically follow them.

Fixes #73
@jhy Allow Jsoup.Connect to parse application/xml and application/xhtml+xm…
…l responses.

Fixes #72
@jhy Defined U (underline) element as an inline tag. 82264c8
@jhy Cleanup of selector class files a59a878
@jhy Updated Jsoup.Connection so that cookies set on a redirect response w…
…ill be included on the redirected request and response.
@jhy Prevent infinite redirection loops in jsoup.connect. 096e130
@jhy Implemented TextNode.splitText e3ddbb8
@jhy Moved .wrap, .before, and .after from Element to Node for flexibility…
…. Overriding implementations in Element still return Element.
@jhy Don't run URL connectivity tests by default. 800d6e1
@jhy Added ability to change an element's tag with Element.tagName(String)…
…, and to change many at once with Elements.tagName(String).
@jhy Test to confirm that abs URL method works on img src attributes. c659826
@jhy Generify empty child list. 8c112c7
@jhy Removed redundant empty array 765eafc
@jhy Changelog updates 66343dc
@jhy Readme update 22e2513
@jhy [maven-release-plugin] prepare release jsoup-1.5.1 da25d1d
@jhy [maven-release-plugin] prepare for next development iteration 67ee7a1
@jhy Fixed issue with selector parser where some boolean AND + OR combined…
… queries (e.g. "meta[http-equiv], meta[content]") were being parsed incorrectly as OR only queries (e.g. former as "meta, [http-equiv], meta[content]")

Fixed issue where a content-tye specified in a meta tag may not be reliably detected, due to the above issue.
@jhy Allow <a> and <font> elements to be treated as flow/block elements, t…
…o match browser parse trees.
@jhy Updated copyright date 8b52f48
Rory Gibson Add support for A elements with hrefs starting '#' i.e. anchors. bdd810f
Rory Gibson Make relative URLs possible 46cd1e3

This should also support single quotes: eg Content-Type:text/html; charset='utf-8'

Site that has this:

ishults commented Jul 23, 2014

I noticed this was never merged -- is this fix not wanted, or were there issues with the implementation? Would it be worth submitting a new pull request for this issue?

jhy commented Oct 2, 2014

Merged with #441, thanks

@jhy jhy closed this Oct 2, 2014
@zazi zazi added a commit to dswarm/jsoup that referenced this pull request Oct 15, 2015
@ishults @zazi ishults + zazi Attempt fix for #77. Add support for # 'protocol'. 78f361e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment