Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INVALID_CHARACTER_ERR when converting Document to W3C (different page) #1093

Closed
dportabella opened this issue Jul 12, 2018 · 3 comments
Closed
Labels
bug Confirmed bug that we should fix fixed
Milestone

Comments

@dportabella
Copy link

org.jsoup.nodes.Document doc = Jsoup.parse(new java.io.File("test.html"), null);
But trying to convert this to a W3C document fails:

new W3CDom().fromJsoup(doc);

Exception in thread "main" org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. 
	at org.apache.xerces.dom.CoreDocumentImpl.checkQName(Unknown Source)
	at org.apache.xerces.dom.ElementNSImpl.setName(Unknown Source)
	at org.apache.xerces.dom.ElementNSImpl.<init>(Unknown Source)
	at org.apache.xerces.dom.CoreDocumentImpl.createElementNS(Unknown Source)
	at org.jsoup.helper.W3CDom$W3CBuilder.head(W3CDom.java:94)
	at org.jsoup.select.NodeTraversor.traverse(NodeTraversor.java:45)
	at org.jsoup.helper.W3CDom.convert(W3CDom.java:64)
	at org.jsoup.helper.W3CDom.fromJsoup(W3CDom.java:45)

Is there a way to know where is the invalid or illegal XML charachter in the source html file (test.html)?

Can this be fixed?

test.html can be downloaded here: https://www.dropbox.com/s/w3zayut0kryy12s/test.html?dl=0

Note: I am using the latest version of jsoup.
Note: the example on #721 works well.

@jhy
Copy link
Owner

jhy commented Feb 6, 2020

@dportabella apologies for the late reply. Do you still have that file or an example? Can you attach it here or on a gist, so it doesn't go away?

@jhy jhy added the needs-more-info More information is needed from the reporter to progress the issue label Feb 6, 2020
@smatei
Copy link

smatei commented Jun 4, 2020

Hello @jhy,
I found a page that throws this error. I am using 1.13.1

html.zip

You can find it also here:
https://search.yahoo.co.jp/search?p=%E7%A5%9E%E7%94%B0+%E9%AB%98%E5%8F%8E%E5%85%A5+%E6%B1%82%E4%BA%BA&fr=top_ga1_sa&ei=UTF-8&mfb=P051&ts=641&aq=-1&oq=&at=&ai=0VH4FjPFQMG5Lv3SI1tFlA

image

The parser crashes with error, the browser does not care and displays <blabla!>

In order to avoid it, I had to cleanup the code, find non ascii tags with regex and replace them with &lt blabla &gt

Pattern pattern = Pattern.compile("<([^\\x00-\\x7F]+)>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

@jhy jhy closed this as completed in 165b3c8 Jan 6, 2021
@jhy jhy added bug Confirmed bug that we should fix and removed needs-more-info More information is needed from the reporter to progress the issue labels Jan 6, 2021
@jhy jhy added this to the 1.14.1 milestone Jan 6, 2021
@jhy
Copy link
Owner

jhy commented Jan 6, 2021

Thanks! I've fixed this - now if a tag contains XML illegal names, we insert it as a textnode instead.

@jhy jhy added the fixed label Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug that we should fix fixed
Projects
None yet
Development

No branches or pull requests

3 participants