You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
org.jsoup.nodes.Document doc = Jsoup.parse(new java.io.File("test.html"), null);
But trying to convert this to a W3C document fails:
new W3CDom().fromJsoup(doc);
Exception in thread "main" org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
at org.apache.xerces.dom.CoreDocumentImpl.checkQName(Unknown Source)
at org.apache.xerces.dom.ElementNSImpl.setName(Unknown Source)
at org.apache.xerces.dom.ElementNSImpl.<init>(Unknown Source)
at org.apache.xerces.dom.CoreDocumentImpl.createElementNS(Unknown Source)
at org.jsoup.helper.W3CDom$W3CBuilder.head(W3CDom.java:94)
at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:45)
at org.jsoup.helper.W3CDom.convert(W3CDom.java:64)
at org.jsoup.helper.W3CDom.fromJsoup(W3CDom.java:45)
Is there a way to know where is the invalid or illegal XML charachter in the source html file (test.html)?
jhy
added
bug
Confirmed bug that we should fix
and removed
needs-more-info
More information is needed from the reporter to progress the issue
labels
Jan 6, 2021
org.jsoup.nodes.Document doc = Jsoup.parse(new java.io.File("test.html"), null);
But trying to convert this to a W3C document fails:
Is there a way to know where is the invalid or illegal XML charachter in the source html file (test.html)?
Can this be fixed?
test.html can be downloaded here: https://www.dropbox.com/s/w3zayut0kryy12s/test.html?dl=0
Note: I am using the latest version of jsoup.
Note: the example on #721 works well.
The text was updated successfully, but these errors were encountered: