Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf #185

Closed
rathancage opened this issue May 11, 2017 · 19 comments
Closed
Assignees
Labels
need help Issues where the contributors are even more incompetent than usual

Comments

@rathancage
Copy link

>>>>>>>> GROBID_HOME=C:\grobid-master\grobid-home [main] INFO org.grobid.core.main.LibraryLoader - Loading external native CRF library [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library... [main] INFO org.grobid.core.main.LibraryLoader - Library crfpp loaded [main] INFO org.grobid.core.jni.WapitiModel - Loading model: C:\grobid-master\grobid-home\models\header\model.wapiti (size: 36094028) org.grobid.core.exceptions.GrobidException: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file 1.pdf at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:184) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:133) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:62) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:49) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:84) at org.grobid.core.engines.Engine.processHeader(Engine.java:434) at org.grobid.core.engines.Engine.processHeader(Engine.java:410) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12) [Wapiti] Loading model: "C:\grobid-master\grobid-home\models\header\model.wapiti" Model path: C:\grobid-master\grobid-home\models\header\model.wapiti

The above error shows up when I select a particular pdf. The same pdf gets processed for the header document over the web application. Can you inform as to what the error could be?

@kermitt2
Copy link
Owner

Hello which version of GROBID are you using?

Normally our fork version of pdf2xml will fix most of the PDF parsing failures. To check it, you can send me this particular pdf by email for instance.

The fact that the header-only works fine is because for headers, only the first two pages of the PDF are parsed.

@rathancage
Copy link
Author

rathancage commented May 11, 2017

Just checked for the full document with the web application. It works fine. I am using 0.42 Snapshot here. Oh and using Windows, Eclipse and not Linux if that helps anyway*

@rathancage
Copy link
Author

[pdf2xml[1.pdf]] ERROR org.grobid.core.process.ProcessRunner - IOException while launching the command [bash, -c, ulimit -Sv 6242304 && C:\grobid-master\grobid-home\pdf2xml\win-64/pdftoxml -blocks -noImageInline -fullFontName -l 2 '1.pdf' C:\grobid-master\grobid-home\tmp\c3aKfwnJye.lxml] : Cannot run program "bash": CreateProcess error=2, The system cannot find the file specified org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:157) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:118) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:57) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:47) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:76) at org.grobid.core.engines.Engine.processHeader(Engine.java:428) at org.grobid.core.engines.Engine.processHeader(Engine.java:405) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12)

Switched to 0.41 and this is the error I just got. Any ideas as to what could be causing it? Thanks in advance.

@kermitt2
Copy link
Owner

pdf parsing library has been updated in version 0.4.2-SNAPSHOT, version in 0.4.1 is less robust.

@rathancage
Copy link
Author

So, with the 0.4.2-SNAPSHOT version and this as the error:
>>>>>>>> GROBID_HOME=C:\grobid-master\grobid-home [main] INFO org.grobid.core.main.LibraryLoader - Loading external native CRF library [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library... [main] INFO org.grobid.core.main.LibraryLoader - Library crfpp loaded [main] INFO org.grobid.core.jni.WapitiModel - Loading model: C:\grobid-master\grobid-home\models\header\model.wapiti (size: 36094028) org.grobid.core.exceptions.GrobidException: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file 1.pdf at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:184) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:133) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:62) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:49) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:84) at org.grobid.core.engines.Engine.processHeader(Engine.java:434) at org.grobid.core.engines.Engine.processHeader(Engine.java:410) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12) [Wapiti] Loading model: "C:\grobid-master\grobid-home\models\header\model.wapiti" Model path: C:\grobid-master\grobid-home\models\header\model.wapiti

What could be the issue?
Am I missing any particular library or anything else perhaps?

@kermitt2
Copy link
Owner

I can't really say without testing the pdf myself.

@rathancage
Copy link
Author

Link to the pdf I tried with

I have tried with different pdfs giving the same error. If it adds to anything, I added some of the libraries independently as external JARS and haven't used Maven dependency. Thanks in advance towards your reply.

@rathancage
Copy link
Author

The error points to this statement btw:
String tei = engine.processHeader(pdfPath, false, resHeader);
Hoping that leads to the cause of the error*

@kermitt2
Copy link
Owner

This PDF is working fine on Linux with the new pdf2xml fork. So you need to wait for a recompiled version of this new pdf2xml on Windows. I am not able to recompile it on Windows because I have no Windows machine but hopefully it will be done when releasing version 0.4.2 of GROBID by another contributor.

Here is the resulting TEI:
https://grobid.s3.amazonaws.com/NascimentoLSG-JCDL2011.tei.xml

it looks actually very nice ;)

@rathancage
Copy link
Author

It does indeed. I will give it a try in Linux once. Thanks.

@kermitt2 kermitt2 added the need help Issues where the contributors are even more incompetent than usual label May 11, 2017
@kermitt2
Copy link
Owner

see #161 and #166
need to recompile pdf3xml fork for Windows 64

@rathancage
Copy link
Author

Not to sound silly but in the crash #166, it is stated that it works well with 0.4.1 which provides an error in my case as stated above. Is it due to pdf3xml fork for Win 64?

@kermitt2
Copy link
Owner

kermitt2 commented May 11, 2017

The fact that #166 doesn't work with 0.4.2-SNAPSHOT is because pdf2xml is not recompiled for Windows - a different symptom but the same cause ;)
All these issues will disappear by updating pdf2xml for win64.

@rathancage
Copy link
Author

Oh, okay. Thanks for the quick responses, @kermitt2 I will check it out in Linux. Great stuff @grobid btw!

@kermitt2
Copy link
Owner

thanks @rathancage you're welcome!

@UntoterOstgote
Copy link

UntoterOstgote commented May 16, 2017

grobid actually works out of the box in "Windows Subsystem for Linux" on Windows 10, so you can develop in Windows, deploy and run grobid as a service in Linux and call grobid-APIs from Windows again, all natively on one machine, without messing around with VMs.

grobid

@lfoppiano
Copy link
Collaborator

lfoppiano commented Aug 4, 2017

Dear @rathancage, @UntoterOstgote,
the new pdf2xml version for windows has been pushed yesterday, commit 75536cd.

You should be able to run grobid on Windows without problems. Please bear in mind that the reference architecture for GROBID is Linux (moreover compiling things on windows is a real pain).

Feel free to reopen the ticket if you have further problems.

@dksanyal
Copy link

Hi,
pdf2xml for lin-32 and lin-64 are not updated while that for Windows is updated to version 2.0 at https://github.com/kermitt2/grobid/tree/master/grobid-home/pdf2xml.
Does this cause any issues?
Please comment on this as some of my PDF files are not parsed correctly on my local build on CentOS while they are correctly parsed in the online portal.

This is very urgent. Any help is highly appreciated.

@yoannspace
Copy link

yoannspace commented Aug 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need help Issues where the contributors are even more incompetent than usual
Projects
None yet
Development

No branches or pull requests

6 participants