Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with alto file from a PDF #42

Closed
kermitt2 opened this issue Feb 25, 2019 · 2 comments
Closed

Issue with alto file from a PDF #42

kermitt2 opened this issue Feb 25, 2019 · 2 comments
Labels
bug Something isn't working

Comments

@kermitt2
Copy link
Owner

The attached PDF generates an XML file that cannot be parsed by GROBID's SAX parser:

ERROR [2019-02-25 20:10:13,990] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs. 
! org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
! at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
! at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
! ... 80 common frames omitted
! Causing: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
! at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
! at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
! at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:381)
! ... 70 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [PARSING_ERROR] Cannot parse file: /home/lopez/grobid/grobid-home/tmp/xsW7YuKt23.lxml
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:393)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:94)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:130)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:109)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:474)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:465)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:179)
...

01 Ramadan Indexing techniques for 2016.pdf

@Aazhar Aazhar added the bug Something isn't working label Feb 26, 2019
@Aazhar
Copy link
Collaborator

Aazhar commented Feb 26, 2019

this should be fixed from commit d06fa76 , invalid utf 8 sequences will be ignored but corresponding bitmap are to be ocrized once this feature is integrated

@kermitt2
Copy link
Owner Author

It works fine, ALTO files are parsable in this case now, thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants