You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The attached PDF generates an XML file that cannot be parsed by GROBID's SAX parser:
ERROR [2019-02-25 20:10:13,990] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
! at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
! at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
! ... 80 common frames omitted
! Causing: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
! at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
! at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
! at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:381)
! ... 70 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [PARSING_ERROR] Cannot parse file: /home/lopez/grobid/grobid-home/tmp/xsW7YuKt23.lxml
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:393)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:94)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:130)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:109)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:474)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:465)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:179)
...
this should be fixed from commit d06fa76 , invalid utf 8 sequences will be ignored but corresponding bitmap are to be ocrized once this feature is integrated
The attached PDF generates an XML file that cannot be parsed by GROBID's SAX parser:
01 Ramadan Indexing techniques for 2016.pdf
The text was updated successfully, but these errors were encountered: