New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF/XML files with a UTF-8 BOM fail to parse #187

Closed
renarl opened this Issue May 26, 2014 · 16 comments

Comments

Projects
None yet
4 participants
@renarl

renarl commented May 26, 2014

If a RDF/XML file starts with a BOM then it fails to parse:

Parser: RDFXMLParser
org.xml.sax.SAXParseException; systemId: file:/test.rdf.xml;
lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

Seems that the UTF-8 BOM problem has been solved for other syntaxes (e.g. Turtle and OWLFunctional). Is there some reason why the fix is not applied to all parsers?

@ignazio1977 ignazio1977 added the bug label May 26, 2014

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 26, 2014

Because this issue in the other parsers depends on the fact the char stream is created by JavaCC generated code, while the RDF/XML parser is a SAX parser where the input stream is managed differently - and this has never been spotted before.

This is a problem affecting a very small subset of ontologies - mostly because the BOM marker in an UTF-8 file is useless, and it appears to have never affected an XML based format. Can you add the first couple of lines of your test as an example?

@sesuncedu

This comment has been minimized.

Contributor

sesuncedu commented May 26, 2014

http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058

The java utf-8 decoder does not, and will not, support BOMs in UTF-8. There are readers that will strip the BOM marker; you can use one of these as a document source for the problematic file. Note that in XML, a BOM overrides an explicit character set encoding, so you need to check the whole XML declaration to make sure that there's no clash.

Using BOM in UTF-8 is generally unwise (it's not illegal, but it breaks a
lot of parsers. It's also usually a sign that a file has been edited in Windows notepad).

@sesuncedu

This comment has been minimized.

Contributor

sesuncedu commented May 26, 2014

http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/XmlStreamReader.html

is a java.io.Reader that does all the XML specific BOM and charset fussy stuff, and can wrap an existing input stream. If this is added it should be as an optional feature (since it may be wanted for import resolution.

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 26, 2014

If I get the situation right, it is as follows:

  • a file with an initial BOM is intended to be UTF-8 (from my link, it is the only use of a BOM in UTF-8, since the endianness is unaffected)
  • an XML file that does not declare a different charset is considered to be UTF-8

So,
i) file with BOM and no specified encoding: strip the BOM, use default encoding (which is already UTF-8)
ii) file without BOM, with or without explicit encoding: happily process as existing
iii) file with BOM and different encoding specified: we got problems here.

In case iii), the fact is, we're being told two contradictory things, which, depending on the actual content of the file, might even be irrelevant (e.g., the whole content is in the intersection of the two character sets). I would choose to strip the BOM and take a chance with the other, explicitly declared, character set. The parsing might fail, but the error will be more obvious than referring to a non printable character that people need to know about to even think it might be the issue.

Also, it provides a simple solution for everything: for any input stream, check the first two bytes. If it's a BOM, strip it. Proceed.

@sesuncedu

This comment has been minimized.

Contributor

sesuncedu commented May 26, 2014

It's only 2 bytes if it's utf16; the apache commons xmlreader does
everything (including setting the right decoder), so its less work to just
let it do the work. (There's a BOMInputStream that just does BOM detection
/defusing )

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 26, 2014

BOMInputStream sounds like the right solution

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 26, 2014

It can be used as a workaround while a solution makes its way to the release: create a StreamDocumentSource wrapping a BOMInputStream, like in Example 1 here:
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html

@sesuncedu

This comment has been minimized.

Contributor

sesuncedu commented May 26, 2014

BOM could be the better solution, iff it would eliminate the need for special case tokenators for Manx and FSS ; otherwise the XML specific reader is better for the XML parsers, as it follows the spec precisely, and does all the charsetdecoder selection and reader building transparently (less work).

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 26, 2014

BOMSafeJavaCharStream can use BOMInputStream; RDF/XML and OWL/XML should use the XML specific reader. This leaves Manchester OWL Syntax alone in the cold - so three entry points to be changed

@renarl renarl closed this May 26, 2014

@renarl renarl reopened this May 26, 2014

@renarl

This comment has been minimized.

renarl commented May 26, 2014

Sorry for the close/reopen, I accidentally clicked the wrong button...

I'm aware that the BOM marker in an UTF-8 file is useless. But we are working on an online ontology visualization service where anyone can upload an ontology for visualization. People occasionally upload an ontology that has the BOM/UTF-8 issue.

So far OWLAPI has been extremely valuable because it could handle all of the numerous owl syntaxes and file formats.

Do I understand it correctly, that you plan to fix the BOM/UTF-8 issue in an upcoming release?

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 26, 2014

Yes that's correct. It will be fixed for 3.5.1

@renarl

This comment has been minimized.

renarl commented May 26, 2014

OK, thanks a lot!

@ansell

This comment has been minimized.

Member

ansell commented May 27, 2014

The Apache Commons BOMInputStream is what Sesame uses to automagically remove any BOM markers it finds, so the owlapi-rio parsers, including the Sesame RDF/XML parser, in version 4 will not have this bug.

ignazio1977 added a commit that referenced this issue May 31, 2014

RDF/XML files with a UTF-8 BOM fail to parse #187
This patch adds an ad hoc wrapper for streams that strips off UTF8,
UTF16 and UTF32 BOMS. Bit of a kludge to avoid adding a dependency on
Commons classes that do the same job properly. To be changed in version
4 to use Apache Commons.
@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented May 31, 2014

This patch should fix the issue, although I don't think it's the most efficient way of doing it.

@ignazio1977

This comment has been minimized.

Contributor

ignazio1977 commented Jun 1, 2014

@renarl this should now be fixed on both main branches. If you have a chance to check out the source and see if it fixes your issue, that would be appreciated.

@ignazio1977 ignazio1977 closed this Jun 1, 2014

@renarl

This comment has been minimized.

renarl commented Jun 2, 2014

I tried it and it fixes the issue. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment