INVALID_CHARACTER_ERR when converting Document to W3C (different page) #1093

dportabella · 2018-07-12T15:04:50Z

org.jsoup.nodes.Document doc = Jsoup.parse(new java.io.File("test.html"), null);
But trying to convert this to a W3C document fails:

new W3CDom().fromJsoup(doc);

Exception in thread "main" org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. 
	at org.apache.xerces.dom.CoreDocumentImpl.checkQName(Unknown Source)
	at org.apache.xerces.dom.ElementNSImpl.setName(Unknown Source)
	at org.apache.xerces.dom.ElementNSImpl.<init>(Unknown Source)
	at org.apache.xerces.dom.CoreDocumentImpl.createElementNS(Unknown Source)
	at org.jsoup.helper.W3CDom$W3CBuilder.head(W3CDom.java:94)
	at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:45)
	at org.jsoup.helper.W3CDom.convert(W3CDom.java:64)
	at org.jsoup.helper.W3CDom.fromJsoup(W3CDom.java:45)

Is there a way to know where is the invalid or illegal XML charachter in the source html file (test.html)?

Can this be fixed?

test.html can be downloaded here: https://www.dropbox.com/s/w3zayut0kryy12s/test.html?dl=0

Note: I am using the latest version of jsoup.
Note: the example on #721 works well.

The text was updated successfully, but these errors were encountered:

jhy · 2020-02-06T20:45:37Z

@dportabella apologies for the late reply. Do you still have that file or an example? Can you attach it here or on a gist, so it doesn't go away?

smatei · 2020-06-04T05:12:34Z

Hello @jhy,
I found a page that throws this error. I am using 1.13.1

html.zip

You can find it also here:
https://search.yahoo.co.jp/search?p=%E7%A5%9E%E7%94%B0+%E9%AB%98%E5%8F%8E%E5%85%A5+%E6%B1%82%E4%BA%BA&fr=top_ga1_sa&ei=UTF-8&mfb=P051&ts=641&aq=-1&oq=&at=&ai=0VH4FjPFQMG5Lv3SI1tFlA

The parser crashes with error, the browser does not care and displays <blabla!>

In order to avoid it, I had to cleanup the code, find non ascii tags with regex and replace them with &lt blabla &gt

Pattern pattern = Pattern.compile("<([^\\x00-\\x7F]+)>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

jhy · 2021-01-06T01:20:34Z

Thanks! I've fixed this - now if a tag contains XML illegal names, we insert it as a textnode instead.

jhy added the needs-more-info More information is needed from the reporter to progress the issue label Feb 6, 2020

jhy closed this as completed in 165b3c8 Jan 6, 2021

jhy added bug Confirmed bug that we should fix and removed needs-more-info More information is needed from the reporter to progress the issue labels Jan 6, 2021

jhy added this to the 1.14.1 milestone Jan 6, 2021

jhy added the fixed label Jan 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INVALID_CHARACTER_ERR when converting Document to W3C (different page) #1093

INVALID_CHARACTER_ERR when converting Document to W3C (different page) #1093

dportabella commented Jul 12, 2018

jhy commented Feb 6, 2020

smatei commented Jun 4, 2020 •

edited

Loading

jhy commented Jan 6, 2021

INVALID_CHARACTER_ERR when converting Document to W3C (different page) #1093

INVALID_CHARACTER_ERR when converting Document to W3C (different page) #1093

Comments

dportabella commented Jul 12, 2018

jhy commented Feb 6, 2020

smatei commented Jun 4, 2020 • edited Loading

jhy commented Jan 6, 2021

smatei commented Jun 4, 2020 •

edited

Loading