Since 1.17.1 CDATA being added around other CDATA in XHTML documents #2078

rgevaerd · 2023-12-05T21:22:51Z

This test below using XHTML passes on 1.16.2 and fails on 1.17.1 because it adds another CDATA, generating something like: ...<script><![CDATA[<![CDATA[...]]>]]></script> and such code causes a javascript error on a browser due to the duplicated CDATA.

Document doc = Jsoup
.parse("<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"><html><head></head><body><script><![CDATA[console.log('ok');]]></script></body></html>");
doc.outputSettings().prettyPrint(false).syntax(Syntax.xml);
assertEquals(
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\"><html><head></head><body><script><![CDATA[console.log('ok');]]></script></body></html>",
doc.outerHtml());

I suspect this problem was introduced by #1720

Please note that adding CDATA only if the text node doesn't start with <![CDATA[ will not be enough to fix this, because the <![CDATA[ expression can be in the middle of the text node. A text node can even have multiple CDATA nodes. But a CDATA node cannot have another CDATA node in it.

The text was updated successfully, but these errors were encountered:

jhy · 2023-12-06T02:18:46Z

A text node can even have multiple CDATA nodes. But a CDATA node cannot have another CDATA node in it.

I feel this is the crux of the issue. Logically there are 3 seperate node types in play - a TextNode, a DataNode, and a CDataNode. Each are leaf nodes -- logically a TextNode (or DataNode) could not contain a CDataNode. Either the parser parses each individually (so perhaps if this were in a <p> element, it might be TextNode, CData, TextNode, Cdata).

In a HTML parse, the contents of a <script> are tokenized in the Script Data State. There is no concept of a CData node. Only DataNodes or potentially HTML comments.

And if I run your original HTML in Chrome, I actually get a Javascript syntax error:

Uncaught SyntaxError: Unexpected token '<' (at script-data-input.html:1:155)

As it's also not tokenising the <![CDATA[]]> as anything special - just a data literal, which then blows up in the Javascript parser.

Back in the olden days I recall there was special handling in the script flow that would parse the CData section specifically (as there is for other markup types - CData section).

Have you got an example URL with script data like this (with Cdata) that a browser executes? I would like to inspect to see if something else is supposed to kick it into XML parse mode.

Otherwise, I'm not sure how to appropriately handle the situation.

rgevaerd · 2023-12-06T03:12:13Z

Hi @jhy,
The Java test code I made I focused on reproducing the scenario in the jsoup and showing the CDATA being added around the other one, and actually used a sample HTML in that test code that results in script error if rendered in a browser. Sorry about that.

Here is an example of a full XHTML using CDATA properly working on browsers:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head></head><body>
<script>//<![CDATA[[
	console.log('script sucessfully executed even with a special character ampersand &');
//]]></script>
</body></html>

The rationale about this style of CDATA usage is that when using XHTML, for the document to be both compatible with HTML and XML parsers it is good practice to use a commented CDATA on the <script>, as it is common for scripts to have special characters that would require escaping in XML but would not be escaped in HTML. This way those special characters does not need to be escaped and become compatible in both XML (that will take CDATA into account) and HTML (that does not take CDATA into account but will just ignore it as it is commented). As a reference of this I present you the Appendix A of XHTML media type document that describes compatibility guidelines ("This appendix summarizes design guidelines for authors who wish their XHTML documents to render on both XHTML-aware and modern HTML user agents"). The item A.4 of this appendix describes this style of CDATA usage to obtain such compatibility with embedded script and style tags. Here is a link for this item: https://www.w3.org/TR/2009/NOTE-xhtml-media-types-20090116/#C_4

With the changes made in 1.17.1, the use of jsoup to process XHTML (using jsoup to make some modifications on the XHTML for example) and then using the resulting source-code read from jsoup now generates a XHTML that fails as both XML and HTML due to that added CDATA in the script element. Until jsoup 1.16.2 I could do this without any problem.

jhy · 2023-12-06T03:58:04Z

Thanks, got it. So in context the browser is not parsing the CData as anything special and just a data literal, but the Javascript parser drops it as the comment.

Not sure the best approach here.

How are you presenting the content after your trip through jsoup? As HTML or XML?

I'm thinking that if you are presenting it as HTML you are not going to want a CData directly after the <script> tag, regardless of if the content within has a CData. You want it preserved in the form of the JS comment.

Maybe the concept of #1720 needs to change to do a deeper introspection and add / preserve a comment node. But then it's moving away from the DataNode just changing its escape manner. And it won't work directly for e.g. <style> tags.

Maybe there's another output mode required which is XHTML. But I don't really think that's a good path and think in most cases folks should move away from that. But perhaps that's the implicit goal here.

Maybe to help me understand better and weave this path - why are you setting the output syntax to XML? It's not about parsing with an XML parser, right?

rgevaerd · 2023-12-06T04:20:48Z

How are you presenting the content after your trip through jsoup? As HTML or XML?

In the end the content is presented on the browser as HTML, but as a XHTML due to the doctype. As its doctype specifies it as XHTML, it is required to follow XHTML specifications so it also passes some code validations, just being able to be rendered by modern browsers while not following the specifications would not be enough for some clients that are stricter in their validations.

Maybe to help me understand better and weave this path - why are you setting the output syntax to XML? It's not about parsing with an XML parser, right?

I set the output syntax to XML so the generated code is following the XHTML specification. XHTML requires the code to be XML compatible. Although I am not parsing it with a XML parser myself, the specification requires it to be compatible with it.
For example, while in HTML you may have unclosed tags like <meta ...> and <br>, in XHTML they must be properly closed, for example as <br/> and <meta ... />. Using output syntax HTML, he was for example generating meta and br tags unclosed.
Modern browsers will likely accept both as they are quite lenient, but in my case I am required to modify something in a a valid XHTML code expecting to generate a valid XHTML code as a result. If the result is validated by a tool such as https://validator.w3.org it is expected to pass

I also have scenario of both type of inputs, XHTML and HTML5. So I basically use jsoup to parse the input no matter which of both types it is, and then after processing, to generate the output, if the input was XHTML I use Syntax.xml, and if the input was HTML5 I use Syntax.html.

jhy · 2023-12-06T04:34:40Z

OK so that kind of connects with my thought that a OutputSyntax type of XHTML would make sense in this context. It's neither HTML nor XML. Perhaps a different question would be why have a goal of producing XHTML? As it's been dead since 2005 ~ 2007 at least. (Per your edit, makes sense; not producing new XHTML but trying to be light touch on existing older styles, I assume.)

Without introducing a new mode the best I can think of is to have the DataNode see if it's in a <script>, and if the contents don't start with something like //<!CDATA[, add that. Effectively introducing the polyglot escape if it's not already there.

@maxfortun could you take a look at this thread? Wondering what context you're using the output and why it doesn't need the //<!CDATA[ polyglot. Perhaps you're serving the content as XML and browsers are invoking their XML parser? I still feel I'm missing something.

rgevaerd · 2023-12-06T04:52:08Z

not producing new XHTML but trying to be light touch on existing older style

Yes, new code now uses HTML5 not XHTML. But to keep compatibility with some older code base that uses XHTML we still support it.

Without introducing a new mode the best I can think of is to have the DataNode see if it's in a <script>, and if the contents don't start with something like //<!CDATA[, add that. Effectively introducing the polygot escape if it's not already there.

Would it be possible for a DataNode that was not modified to just keep its text as is? Like if the decision to introduce a CDATA in it would only be needed if it is modified.

maxfortun · 2023-12-08T16:04:16Z

CDATA in CDATA... What an interesting oversight on my part. If I recall correctly at the time I used this library to convert Google web stories in AMP format, which is xhtml into an xml document for further XSL transformations, and html entities were not being converted properly into their xml equivalents. Instead of escaping the individual characters I just wrapped the whole section in CDATA, and now I see that character conversion is probably a more appropriate way to go...

maxfortun · 2023-12-08T16:10:10Z

If this is any helpful, this is my use-case:

Document document = Jsoup.parse(ampXhtml, "", Parser.xmlParser());
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml).escapeMode(Entities.EscapeMode.xhtml);
String ampXml = document.toString();

rgevaerd · 2023-12-08T17:22:15Z

xhtml into an xml document for further XSL transformations

I suspect that @maxfortun 's use case, as it uses the output as XML, it would work if the input does not already contain CDATA. But if it did, it would break as it would add another CDATA around the existent one.
And if the output was used as HTML rendered in a modern browser, it would break even if it did not already contain CDATA as it would be a syntax error on javascript the uncommented CDATA.

jhy · 2023-12-17T01:08:07Z

Thanks @rgevaerd and @maxfortun for the discussion - I've fixed this now. It would be great if you could review and test (with a snapshot version) and let me know if any issues.

rgevaerd · 2023-12-21T00:34:37Z

Thanks @rgevaerd and @maxfortun for the discussion - I've fixed this now. It would be great if you could review and test (with a snapshot version) and let me know if any issues.

@jhy reviewing the code seems good to me. I could not test it, as I tried to get the snapshot from https://jsoup.org/packages/jsoup-1.17.2-SNAPSHOT.jar but it seems to be an old snapshot. Don't know where to get it updated.

jhy · 2023-12-22T03:53:38Z

Thanks -- there is no currently published snapshot but you can build one manually quite simply. See "Building from Source" on the download page - the mvn install will set up the snapshot.

rgevaerd · 2023-12-23T02:06:06Z

@jhy I tested from master and it worked fine. 👍

This MR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [flow-bin](https://github.com/flowtype/flow-bin) ([changelog](https://github.com/facebook/flow/blob/master/Changelog.md)) | devDependencies | minor | [`^0.224.0` -> `^0.225.0`](https://renovatebot.com/diffs/npm/flow-bin/0.224.0/0.225.1) | | [org.liquibase.ext:liquibase-hibernate5](https://github.com/liquibase/liquibase-hibernate/wiki) ([source](https://github.com/liquibase/liquibase-hibernate)) | build | patch | `4.25.0` -> `4.25.1` | | [org.liquibase:liquibase-maven-plugin](http://www.liquibase.org/liquibase-maven-plugin) ([source](https://github.com/liquibase/liquibase)) | build | patch | `4.25.0` -> `4.25.1` | | [org.jsoup:jsoup](https://jsoup.org/) ([source](https://github.com/jhy/jsoup)) | compile | patch | `1.17.1` -> `1.17.2` | | [io.hypersistence:hypersistence-utils-hibernate-62](https://github.com/vladmihalcea/hypersistence-utils) | compile | minor | `3.6.1` -> `3.7.0` | | [io.quarkus:quarkus-maven-plugin](https://github.com/quarkusio/quarkus) | build | patch | `3.6.3` -> `3.6.4` | | [io.quarkus:quarkus-universe-bom](https://github.com/quarkusio/quarkus-platform) | import | patch | `3.6.3` -> `3.6.4` | | [org.apache.maven.plugins:maven-compiler-plugin](https://maven.apache.org/plugins/) | build | minor | `3.11.0` -> `3.12.1` | --- ### Release Notes <details> <summary>flowtype/flow-bin</summary> ### [`v0.225.1`](flow/flow-bin@62a43fb...23ec616) [Compare Source](flow/flow-bin@62a43fb...23ec616) ### [`v0.225.0`](flow/flow-bin@e6104a1...62a43fb) [Compare Source](flow/flow-bin@e6104a1...62a43fb) </details> <details> <summary>liquibase/liquibase</summary> ### [`v4.25.1`](https://github.com/liquibase/liquibase/blob/HEAD/changelog.txt#Liquibase-4251-is-a-patch-release) [Compare Source](liquibase/liquibase@v4.25.0...v4.25.1) </details> <details> <summary>jhy/jsoup</summary> ### [`v1.17.2`](https://github.com/jhy/jsoup/blob/HEAD/CHANGES.md#1172-2023-Dec-29) ##### Improvements - **Attribute object accessors**: Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply obtain an `Attribute` object. [2069](jhy/jsoup#2069) - **Attribute source tracking**: If source tracking is on, and an Attribute's key is changed ( via `Attribute.setKey(String)`), the source range is now still tracked in `Attribute.sourceRange()`. [2070](jhy/jsoup#2070) - **Wildcard attribute selector**: Added support for the `[*]` element with any attribute selector. And also restored support for selecting by an empty attribute name prefix (`[^]`). [2079](jhy/jsoup#2079) ##### Bug Fixes - **Mixed-cased source position**: When tracking the source position of attributes, if the source attribute name was mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not tracked correctly. [2067](jhy/jsoup#2067) - **Source position NPE**: When tracking the source position of a body fragment parse, a null pointer exception was thrown. [2068](jhy/jsoup#2068) - **Multi-point emoji entity**: A multi-point encoded emoji entity may be incorrectly decoded to the replacement character. [2074](jhy/jsoup#2074) - **Selector sub-expressions**: (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding to `[attr=va]` instead of `parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr to represent the query, allowing simpler and more thorough query parse tests. [2073](jhy/jsoup#2073) - **XML CData output**: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData section. [2078](jhy/jsoup#2078) - **Thread safety**: The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the iterator object is a thread-local. [2088](jhy/jsoup#2088) *** Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in [change-archive.txt](./change-archive.txt). </details> <details> <summary>vladmihalcea/hypersistence-utils</summary> ### [`v3.7.0`](https://github.com/vladmihalcea/hypersistence-utils/blob/HEAD/changelog.txt#Version-370---December-18-2023) \================================================================================ Oracle Interval Type does not support negative intervals [#682](vladmihalcea/hypersistence-utils#682) Return original object if target and original are the same when merging [#677](vladmihalcea/hypersistence-utils#677) Add a hypersistence-utils-hibernate-63 module for Hibernate 6.3 [#657](vladmihalcea/hypersistence-utils#657) </details> <details> <summary>quarkusio/quarkus</summary> ### [`v3.6.4`](https://github.com/quarkusio/quarkus/releases/tag/3.6.4) [Compare Source](quarkusio/quarkus@3.6.3...3.6.4) ##### Complete changelog - [#37808](quarkusio/quarkus#37808) - CLI - Rework how missing commands are detected - [#37803](quarkusio/quarkus#37803) - Dev mode: add null checks to TimestampSet.isRestartNeeded() - [#37798](quarkusio/quarkus#37798) - Only update ~/.docker/config.json if it exists - [#37787](quarkusio/quarkus#37787) - Take priority into account in ConfigurationImpl - [#37775](quarkusio/quarkus#37775) - Docs: fix typo in rabbitmq reference documentation - [#37770](quarkusio/quarkus#37770) - Add SequencedCollection to BANNED_INTERFACE_TYPES - [#37768](quarkusio/quarkus#37768) - Running application build with JDK21 and target Java 17 crash with NoClassDefFoundError: java/util/SequencedCollection - [#37731](quarkusio/quarkus#37731) - Query logging is being done in io.quarkus.mongodb.panache.common.runtime.MongoOperations - [#37723](quarkusio/quarkus#37723) - Do not use CSRF cookie as the next token value - [#37717](quarkusio/quarkus#37717) - Docs: Fix incorrect link reference in Cross-Site Request Forgery Prevention guide - [#37714](quarkusio/quarkus#37714) - Remove the driver property in the documentation for Cloud SQL - [#37710](quarkusio/quarkus#37710) - Use NoStackTraceException in metrics - [#37677](quarkusio/quarkus#37677) - Bump io.quarkus:quarkus-platform-bom-maven-plugin from 0.0.100 to 0.0.101 - [#37654](quarkusio/quarkus#37654) - Make sure dev mode is properly written in doc - [#36848](quarkusio/quarkus#36848) - CSRF Token with HMAC signature gets double signed </details> <details> <summary>quarkusio/quarkus-platform</summary> ### [`v3.6.4`](quarkusio/quarkus-platform@3.6.3...3.6.4) [Compare Source](quarkusio/quarkus-platform@3.6.3...3.6.4) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox. 👻 **Immortal**: This MR will be recreated if closed unmerged. Get [config help](https://github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).

jhy closed this as completed in 8a4bdae Dec 17, 2023

jhy self-assigned this Dec 17, 2023

jhy added bug Confirmed bug that we should fix fixed labels Dec 17, 2023

jhy added this to the 1.17.2 milestone Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Since 1.17.1 CDATA being added around other CDATA in XHTML documents #2078

Since 1.17.1 CDATA being added around other CDATA in XHTML documents #2078

rgevaerd commented Dec 5, 2023

jhy commented Dec 6, 2023

rgevaerd commented Dec 6, 2023 •

edited

jhy commented Dec 6, 2023

rgevaerd commented Dec 6, 2023 •

edited

jhy commented Dec 6, 2023 •

edited

rgevaerd commented Dec 6, 2023 •

edited

maxfortun commented Dec 8, 2023 •

edited

maxfortun commented Dec 8, 2023

rgevaerd commented Dec 8, 2023 •

edited

jhy commented Dec 17, 2023

rgevaerd commented Dec 21, 2023

jhy commented Dec 22, 2023

rgevaerd commented Dec 23, 2023

Since 1.17.1 CDATA being added around other CDATA in XHTML documents #2078

Since 1.17.1 CDATA being added around other CDATA in XHTML documents #2078

Comments

rgevaerd commented Dec 5, 2023

jhy commented Dec 6, 2023

rgevaerd commented Dec 6, 2023 • edited

jhy commented Dec 6, 2023

rgevaerd commented Dec 6, 2023 • edited

jhy commented Dec 6, 2023 • edited

rgevaerd commented Dec 6, 2023 • edited

maxfortun commented Dec 8, 2023 • edited

maxfortun commented Dec 8, 2023

rgevaerd commented Dec 8, 2023 • edited

jhy commented Dec 17, 2023

rgevaerd commented Dec 21, 2023

jhy commented Dec 22, 2023

rgevaerd commented Dec 23, 2023

rgevaerd commented Dec 6, 2023 •

edited

rgevaerd commented Dec 6, 2023 •

edited

jhy commented Dec 6, 2023 •

edited

rgevaerd commented Dec 6, 2023 •

edited

maxfortun commented Dec 8, 2023 •

edited

rgevaerd commented Dec 8, 2023 •

edited