Skip to content
This repository

Support for anchors (A tags with internal link syntax) #77

wants to merge 392 commits into from

10 participants

Rory Gibson Mathias Bogaert Jonathan Hedley Eivind Uggedal Akira Ueda Clément Denis Brandon Beck Anton Kazennikov Michael Simons Tommy Chheng
Rory Gibson

I've added support for hrefs starting with '#', so that links to internal anchors on a page don't get ripped out by the Cleaner, which is something I need for my day job and thought might be useful to you.

W3C syntax for anchors states simply that they must start with a # and contain no spaces.
I've added 2 tests in the CleanerTest class that document this behaviour.

If you could merge this in I'd be grateful; currently we're using the forked jar, but it'd be nice to stay on trunk.

and others added some commits February 03, 2010
Jonathan Hedley Implemented Element#wrap and #Elements#wrap
Also protected Node.replaceChild, removeChild, addChild.
Jonathan Hedley New: E + F adjacent sibling selector, E ~ F preceding sibling. 465493e
Jonathan Hedley Maven Sonatype setup 96ef20d
Jonathan Hedley [maven-release-plugin] prepare release jsoup-0.2.1 0747d34
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 55e9a11
Jonathan Hedley [maven-release-plugin] prepare release jsoup-0.2.1 4c55fb7
Jonathan Hedley Release prep 138c1a7
Jonathan Hedley [maven-release-plugin] prepare release jsoup-0.2.1a 7b0469e
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 55edde5
Jonathan Hedley Sonatype release machinations 3185272
Jonathan Hedley [maven-release-plugin] prepare release jsoup-0.2.1b 69f7fdb
Jonathan Hedley [maven-release-plugin] prepare for next development iteration a9d69ab
Jonathan Hedley Add addClass, removeClass, toggleClass, hasClass to Element and Eleme…

Closes #2
Jonathan Hedley Improved document normalisation. ba751b9
Jonathan Hedley hasText 0984e71
Jonathan Hedley Improved HTML output (pretty-print) 837afba
Jonathan Hedley Changelog for release prep fb5e61f
Jonathan Hedley [maven-release-plugin] prepare release jsoup-0.2.2 c4f5acf
Jonathan Hedley [maven-release-plugin] prepare for next development iteration f20e2ba
Jonathan Hedley Corrected change note 2a00564
Jonathan Hedley Merge branch 'master' of fafbfc7
Jonathan Hedley Assert attribute values are not null, not not empty.
Closes #7.
Jonathan Hedley Changed Elements#attr(key) to scan all elements for attribute.
Closes #4.
Jonathan Hedley Implemented Elements html(), html(string), append, and prepend.
Closes #5.
Jonathan Hedley Changelog 25f33dd
Jonathan Hedley Normalise head by prepending, not appending.
Closes #9.
Jonathan Hedley Cleaner.isValid() method.
Closes #6.
Jonathan Hedley IsValid test for OK attribute 4c2c345
Jonathan Hedley Test self is not descender 985b4ae
Jonathan Hedley Deploy prep 4d7c7fa
Jonathan Hedley Release prep bb86d89
Jonathan Hedley [maven-release-plugin] prepare release jsoup-0.3.1 cd59eed
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 1aadd75
Jonathan Hedley Allow - and _ in CSS ID selectors.
Closes #10.
Jonathan Hedley Changelog b3d2cbc
Jonathan Hedley Changelog 858240e
Jonathan Hedley Resolve relative links when cleaning.
Closes #12.
Jonathan Hedley Allow combinators at start of selector query
Closes #13
Jonathan Hedley Added val() and val(string) to Element and Elements.
Treat contents of textarea as text, not data.

Closes #14
Jonathan Hedley Added Node#remove and Node#replaceWith.
Closes #19
Jonathan Hedley Throw exception if trying to parse non-text content
Closes #17
Jonathan Hedley Added TextNode#text and TextNode#text(String)
Closes #18
Jonathan Hedley Added selector support for :eq, :lt, and gt
Closes #16
Eivind Uggedal String.isEmpty() and LinkedList.peekFirst() is not part of the Java 5…
….0 API.
Jonathan Hedley Updated ignore list 499c262
Jonathan Hedley Preparing 1.1.1 release e9bc6c3
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.1.1 739436d
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 685832e
Jonathan Hedley Change notes 284c552
Jonathan Hedley Fixed test package 253412a
Jonathan Hedley Fix an issue where text order was incorrect when parsing pre-document…

Fixes #23
Jonathan Hedley Clean up the parse stack correctly when parsing data-nodes.
Fixes #22.
Jonathan Hedley Fixed javadoc typo 0059dce
Jonathan Hedley Added :has(selector) pseudo-selector.
Added Element#parents() and Elements#parents() methods.

Fixes #20
Jonathan Hedley Chanelog release date e5716c0
Jonathan Hedley Improved implicit close tag heuristic detection when parsing malforme…
…d HTML.

Fixes an issue where appending / prepending rows to a table (or  to similar implicit
element structures) would create a redundant wrapping elements.

Fixes #21
Jonathan Hedley Cleanup Element and Node add mechanism e65cfb3
Jonathan Hedley Added .before(html) and .after(html) methods to Element and Elements,…
… to insert sibling HTML
Jonathan Hedley Added :contains(text) selector 2bcfa10
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.2.1 9883b56
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 8217467
Jonathan Hedley Changelog release date 1ced3d7
Jonathan Hedley Fixed javadoc for :eq(n) 4ed5aec
Jonathan Hedley Upgraded the selector query parser to allow nested selectors like 'di…
Jonathan Hedley Updated TokenQueue so :contains(text) can be escaped, if looking
for ( or ) within text
Jonathan Hedley Implemented :matches(regex) selector. 47b13d4
Jonathan Hedley Changelog 92baf07
Jonathan Hedley Parsing optimisation.
Modified TokenQueue to use a StringBuilder + offset to back the queue,
instead of a linked list. Reduces memory and CPU use.
Jonathan Hedley Parsing performance optimisation.
Modified TokenQueue chompTo method to use indexOf to allow rapid
scan for next token.
Jonathan Hedley Parsing performance optimisation.
Intern attribute keys (often shared), and dropped back default
bucket sizes for attributes and element children so as to conserve
Jonathan Hedley TextNode performance tweaks 3c28ff7
Jonathan Hedley Performance optimisation in parsing. 7425c6d
Jonathan Hedley Use a Visitor instead of recursion for HTML and selectors. a038358
Jonathan Hedley Performance tweaks. 3975b8b
Jonathan Hedley Tidy 5f6a9ae
Jonathan Hedley Added [key~=regex] attribute selector by regular expression 7c28911
Jonathan Hedley Tidy c41390c
Jonathan Hedley Changelog a35db94
Jonathan Hedley Test update fb37594
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.2.2 145d783
Jonathan Hedley [maven-release-plugin] prepare for next development iteration cc5f37f
Jonathan Hedley Automatically determine charset when parsing from URL or File. 784a31f
Jonathan Hedley Auto detect charset from HTML5 <meta charset> tag if present d4c06ac
Jonathan Hedley Changed DT & DD tags to block-mode tags, to follow practise over spec. a03639a
Jonathan Hedley Added support for [^attributePrefix] selector query. Useful for finding
elements with HTML5 datasets: [^data]
Jonathan Hedley Implemented Element.dataset(), to retrieve a map of custom data attri…
Jonathan Hedley Improved tag definitions to allow limited children and excluded child…

Improved implicit table element creation, particularly around tbody tags.
Jonathan Hedley Cleaned tag definitions to make head and dl parsing more generic. ab9d34e
Jonathan Hedley Implicit close for <caption> tags. 0b509fd
Jonathan Hedley Changelog 8f66e9c
Jonathan Hedley Testcase for malformed meta http-equiv charset. 2290966
Jonathan Hedley HTML5 tag support a09ef4a
Jonathan Hedley Added support for namespaced elements (<fb:name>) and selectors (fb|n…
Jonathan Hedley Improved HTML output format for empty elements and auto-detected self…
… closing tags.

Closes #27
Jonathan Hedley Added support for tag names with - and _ (<abc_foo>, <abc-foo>) 05623f4
Jonathan Hedley Removed obsolete nodeDepth method 479a9fa
Jonathan Hedley Implemented Node.ownerDocument DOM API method. 34bc04b
Jonathan Hedley Fixed support for character class regular expressions in [attr=~regex…
…] selector
Jonathan Hedley Fixed support for character class regular expressions in [attr=~regex…
…] selector
Jonathan Hedley Note <tag > fix d52406b
Jonathan Hedley Draft implementation of Entities, for customisable entity escaping. 6bde6c8
Jonathan Hedley Working on escape/unescape routine. 8bb490a
Jonathan Hedley

This section will be clearer with a regex: &(#(x|X)?(\d+)|[a-zA-Z]+);?

and others added some commits August 03, 2010
Jonathan Hedley Simplified Entity unescaper 61b18d9
Jonathan Hedley Added ability to configure the document's output charset. 78a028b
Jonathan Hedley Re-ordered changelog 724c454
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.2.3 d27d92e
Jonathan Hedley [maven-release-plugin] prepare for next development iteration c807c32
Jonathan Hedley Use jsoup escaper for attributes, not Apache's. 8180838
Jonathan Hedley Optimise adding nodes to end of childnode list. 1135778
Jonathan Hedley TokenQueue optimisations 57d909d
Jonathan Hedley Optimised document normalisation 64f7110
Jonathan Hedley Mini optimisations dbe01fe
Jonathan Hedley Restored public access for Entities.EscapeMode eee3609
Jonathan Hedley Javadoc fix 245e437
Jonathan Hedley Removed dependency on Apache Commons-lang. Jsoup now has no external …
Jonathan Hedley Optimised normaliseWhitespace 2587788
Jonathan Hedley Optimised attribute html 6463dd7
Jonathan Hedley Micro-optimise tag ancestor ce4e564
Jonathan Hedley Optimised textnodes to not hold attributes or childnodes unless requi…
…red on use.
Jonathan Hedley Fixed support for case-sensitive HTML escape entities.
Fixes #31
Jonathan Hedley Fixed issue when parsing tags with keyless attributes.
Fixes #32
Jonathan Hedley Entity doc e5cbf67
Jonathan Hedley Draft / in progress implementation of Connection 6e5e8a2
Jonathan Hedley Initial implementation of Connection f737aa1
Jonathan Hedley Working on http connection implementation 6942ad8
Jonathan Hedley Implemented request headers 4394c86
Jonathan Hedley Implemented query string from data 7eee43a
Jonathan Hedley Fixed Attributes.hmtl() f4061bd
Jonathan Hedley Added support for gzipped output.
Fixes #28
Jonathan Hedley Connection timeout specified in millis, not seconds 16737a6
Jonathan Hedley Documented Connection interface methods 7d0015f
Jonathan Hedley Tidied up Connection and Jsoup use 608599c
Jonathan Hedley URL connection tests a291885
Jonathan Hedley Implemented Element#ownText() a27046b
Jonathan Hedley Changelog 5a29e18
Jonathan Hedley Added support for non-pretty-printed HTML output, to more closely mir…
…ror the input HTML.

Fixes #8
Jonathan Hedley Changelog 2067cb7
Jonathan Hedley Fixed html() method of Attribute dbc25bf
Jonathan Hedley Added support for selectors :containsOwn(text) and :matchesOwn(regex)…
…, to supplement Element.ownText().
Jonathan Hedley Updated the link example program to use Jsoup.connect() 4082a25
Jonathan Hedley Validations for Connection 0a5d0b2
Jonathan Hedley Changelog release prep 333ca9f
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.3.1 617a1e5
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 76237ff
Jonathan Hedley Doc 4668d10
Jonathan Hedley Treat HTTP headers as case insensitive in Jsoup.Connection. Improves …
…compatibility for HTTP responses.
Jonathan Hedley Tweaks 310c7bc
Jonathan Hedley Improved malformed table parsing by implementing ignorable end tags. 26b1aaf
Jonathan Hedley More tests for Jsoup.Connection 2958642
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.3.2 eb9f5c9
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 7dcf96f
Jonathan Hedley Implement Elements.empty() and Elements.remove(). 867a00d
Jonathan Hedley Javadoc note for Elements.get(int) 7daa24b
Jonathan Hedley Selector documentation tweak 5972b86
Jonathan Hedley Fixed issue in Entities when unescaping &#36; ("$")
Fixes #34
Akira Ueda added EscapeMode.minimum d9f4958
Jonathan Hedley Added restricted XHTML output entity option 6ce593d
Jonathan Hedley Changelog 9964fd5
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.3.3 88c822e
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 3ca6654
Jonathan Hedley Implemented DataNode.setWholeData() to allow updating of script and s…
…tyle data contents.
Jonathan Hedley Fixed support for jsoup.connect to follow redirects between http & ht…
…tps URLs.

Fixes #37
Jonathan Hedley Fixed issue in jsoup.connect when extracting character set from conte…
…nt-type header; now supports quoted

charset declaration.
Jonathan Hedley Relaxed parse rules of H1 - H6 to allow nested content. 866825d
Jonathan Hedley Relaxed parse rule of SPAN to treat as block, to allow nested block c…
Jonathan Hedley Added ability to load and parse HTML from an input stream. c408300
Jonathan Hedley Test fix 2f46ca2
Jonathan Hedley Javadoc example on absUrl f70526c
Jonathan Hedley Document normalisation now more correctly enforces document structure.
 - ensure only one head and one body element, both under html el
 - allow html/head/noscript/img for some site's analytic pattern

Fixes #43
Jonathan Hedley Support node.outerHtml() method when node has no parent.
Fixes #45
Jonathan Hedley Fixed support for HTML entities with numbers in name (e.g. &frac34, &…

Fixes #46
Jonathan Hedley Merge branch 'master' of
Clément Denis Fixes IndexArrayOutOfBoundException on response with empty headers c907932
Jonathan Hedley Implemented Node.clone() to create deep, independent copies of Nodes,…
… Elements, and Documents.

Fixes #47
Jonathan Hedley Testcase to confirm doctypes get cloned e9f7254
Jonathan Hedley Fixed absolute URL generation from relative URLs which are only query…
… strings.

Fixes #49
Jonathan Hedley Output format tweak 2212615
Jonathan Hedley Added :not() selector, to find elements that do not match the selecto…
…r. E.g. div:not(.logo) finds divs that

do not have the "logo" class name.

Fixes #36
Jonathan Hedley Added Elements.not(selector) method, to remove undesired results from…
… selector results.
Jonathan Hedley Changelog update in launce preperation c1cf385
Jonathan Hedley Changelog tweak 7babf92
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.4.1 b2229d6
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 5824e29
Brandon Beck Updated OutputSettings inside of Document to be a static inner class.
This addresses the issue with using jsoup and scala 2.8 discussed here:

This change should be safe to commit and won't make a backwards incompatible
to the public interface of jsoup (you can always reference a static member
via a non-static path) for any existing users.  A recompilation of their code
won't even be necessary.

All tests continue to pass for me after this change.
Anton Kazennikov added .clone() for Elements dc8b3fd
Anton Kazennikov Initial add of new generation selectors(faster than original) f4337b9
Added attribute selectors f8e487a
Anton Kazennikov added AttrSelector.AttrNamePrefixSelector 2e93a70
Anton Kazennikov fix bug in element selector: incorrect behavior on multiple classes 6a6fa21
Anton Kazennikov removing as exists matching evaluator classes 45a45e1
Anton Kazennikov changes wrt existing Evaluator class 326646f
Anton Kazennikov Selectors update d3d36bb
Anton Kazennikov removing boxing/unboxing 5ce24c2
Anton Kazennikov evaluators made public 98f8292
Anton Kazennikov Adding evaluators tests 87891d5
Anton Kazennikov added new tests 67e09f8
Renaming in some selectors 57e0209
adding new ctors to And and Or 76a7ade
adding toString() 2d7d62f
added base container 22a1d10
added .toString() to basic Evaluators 8573486
Adding Selector parser aabd34f
Anton Kazennikov removing char boxing 2dd2968
Anton Kazennikov adding empty and .addAll 95b03ed
Anton Kazennikov implemented :has selector 06a6674
Anton Kazennikov Working parser except the root node selector.
Added basic tests
Anton Kazennikov removed unused constructor 80e420e
Anton Kazennikov parser update: normal order of selectors 9c00eab
Anton Kazennikov fix non-void element parsing such as <a href=/link/>link text</a> a59cfee
Anton Kazennikov Evaluator.match(Element test) ->
Evaluator.match(Element root, Element test)
Anton Kazennikov Character -> char change 4c89a93
Anton Kazennikov restored all tests. 906bbf0
Anton Kazennikov added RootSelector
updated tree selectors wrt subtree matching
Anton Kazennikov update evaluator wrt subtree matching 3f91b30
Anton Kazennikov Added RootSelector support 03abecc
Anton Kazennikov small optimizations 692d5f8
Anton Kazennikov added javadocs bb2f08b
Anton Kazennikov Added javadocs for Evaluators.
Updated tests.
Updated parser
Jonathan Hedley Fixed issue when using descendant regex attribute selectors.
Fixes #52
Jonathan Hedley Added a test to confirm combinators don't match in balanced contains …
Jonathan Hedley Merge branch 'master' of into bbeck-ma…
Jonathan Hedley Fixed tokeniser optimisation when scanning for missing data element c…
…lose tags.

Fixes #67
Michael Simons There are no valid (x)html tags that start with numbers 05285d0
Merge remote branch 'upstream/master'
Jonathan Hedley Reverted changes that only allow empty tags in pre-defined instances.
Markup like <tag /> needs to be parsed as an empty element.
Jonathan Hedley Integrated new single-pass selector evaluators, contributed by knz (A…
…nton Kazennikov).
Tommy Chheng Removed to fix https://githu…

"jsoup/src/main/java/org/jsoup/select/selectors/[8,35] package does not exist"
Jonathan Hedley Force strict entity matching (must be &xxx; and not &xxx) in element …

Fixes #71
Jonathan Hedley Ensure that Jsoup.Connect handles relative redirects in cases where the
underlying HTTP stack doesn't automatically follow them.

Fixes #73
Jonathan Hedley Allow Jsoup.Connect to parse application/xml and application/xhtml+xm…
…l responses.

Fixes #72
Jonathan Hedley Defined U (underline) element as an inline tag. 82264c8
Jonathan Hedley Cleanup of selector class files a59a878
Jonathan Hedley Updated Jsoup.Connection so that cookies set on a redirect response w…
…ill be included on the redirected request and response.
Jonathan Hedley Prevent infinite redirection loops in jsoup.connect. 096e130
Jonathan Hedley Implemented TextNode.splitText e3ddbb8
Jonathan Hedley Moved .wrap, .before, and .after from Element to Node for flexibility…
…. Overriding implementations in Element still return Element.
Jonathan Hedley Don't run URL connectivity tests by default. 800d6e1
Jonathan Hedley Added ability to change an element's tag with Element.tagName(String)…
…, and to change many at once with Elements.tagName(String).
Jonathan Hedley Test to confirm that abs URL method works on img src attributes. c659826
Jonathan Hedley Generify empty child list. 8c112c7
Jonathan Hedley Removed redundant empty array 765eafc
Jonathan Hedley Changelog updates 66343dc
Jonathan Hedley Readme update 22e2513
Jonathan Hedley [maven-release-plugin] prepare release jsoup-1.5.1 da25d1d
Jonathan Hedley [maven-release-plugin] prepare for next development iteration 67ee7a1
Jonathan Hedley Fixed issue with selector parser where some boolean AND + OR combined…
… queries (e.g. "meta[http-equiv], meta[content]") were being parsed incorrectly as OR only queries (e.g. former as "meta, [http-equiv], meta[content]")

Fixed issue where a content-tye specified in a meta tag may not be reliably detected, due to the above issue.
Jonathan Hedley Allow <a> and <font> elements to be treated as flow/block elements, t…
…o match browser parse trees.
Jonathan Hedley Updated copyright date 8b52f48
Add support for A elements with hrefs starting '#' i.e. anchors. bdd810f
Make relative URLs possible 46cd1e3
Mathias Bogaert

This should also support single quotes: eg Content-Type:text/html; charset='utf-8'

Site that has this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.

Sorry, Diff contents are not available for this pull request.

Something went wrong with that request. Please try again.