Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf #185

rathancage · 2017-05-11T16:10:41Z

>>>>>>>> GROBID_HOME=C:\grobid-master\grobid-home [main] INFO org.grobid.core.main.LibraryLoader - Loading external native CRF library [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library... [main] INFO org.grobid.core.main.LibraryLoader - Library crfpp loaded [main] INFO org.grobid.core.jni.WapitiModel - Loading model: C:\grobid-master\grobid-home\models\header\model.wapiti (size: 36094028) org.grobid.core.exceptions.GrobidException: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file 1.pdf at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:184) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:133) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:62) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:49) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:84) at org.grobid.core.engines.Engine.processHeader(Engine.java:434) at org.grobid.core.engines.Engine.processHeader(Engine.java:410) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12) [Wapiti] Loading model: "C:\grobid-master\grobid-home\models\header\model.wapiti" Model path: C:\grobid-master\grobid-home\models\header\model.wapiti

The above error shows up when I select a particular pdf. The same pdf gets processed for the header document over the web application. Can you inform as to what the error could be?

The text was updated successfully, but these errors were encountered:

kermitt2 · 2017-05-11T16:16:34Z

Hello which version of GROBID are you using?

Normally our fork version of pdf2xml will fix most of the PDF parsing failures. To check it, you can send me this particular pdf by email for instance.

The fact that the header-only works fine is because for headers, only the first two pages of the PDF are parsed.

rathancage · 2017-05-11T16:18:15Z

Just checked for the full document with the web application. It works fine. I am using 0.42 Snapshot here. Oh and using Windows, Eclipse and not Linux if that helps anyway*

rathancage · 2017-05-11T16:22:01Z

[pdf2xml[1.pdf]] ERROR org.grobid.core.process.ProcessRunner - IOException while launching the command [bash, -c, ulimit -Sv 6242304 && C:\grobid-master\grobid-home\pdf2xml\win-64/pdftoxml -blocks -noImageInline -fullFontName -l 2 '1.pdf' C:\grobid-master\grobid-home\tmp\c3aKfwnJye.lxml] : Cannot run program "bash": CreateProcess error=2, The system cannot find the file specified org.grobid.core.exceptions.GrobidException: [TIMEOUT] PDF to XML conversion timed out at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:157) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:118) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:57) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:47) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:76) at org.grobid.core.engines.Engine.processHeader(Engine.java:428) at org.grobid.core.engines.Engine.processHeader(Engine.java:405) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12)

Switched to 0.41 and this is the error I just got. Any ideas as to what could be causing it? Thanks in advance.

kermitt2 · 2017-05-11T16:27:57Z

pdf parsing library has been updated in version 0.4.2-SNAPSHOT, version in 0.4.1 is less robust.

rathancage · 2017-05-11T16:33:05Z

So, with the 0.4.2-SNAPSHOT version and this as the error:
>>>>>>>> GROBID_HOME=C:\grobid-master\grobid-home [main] INFO org.grobid.core.main.LibraryLoader - Loading external native CRF library [main] INFO org.grobid.core.main.LibraryLoader - Loading Wapiti native library... [main] INFO org.grobid.core.main.LibraryLoader - Library crfpp loaded [main] INFO org.grobid.core.jni.WapitiModel - Loading model: C:\grobid-master\grobid-home\models\header\model.wapiti (size: 36094028) org.grobid.core.exceptions.GrobidException: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file 1.pdf at org.grobid.core.document.DocumentSource.processPdf2XmlThreadMode(DocumentSource.java:184) at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:133) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:62) at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:49) at org.grobid.core.engines.HeaderParser.processing2(HeaderParser.java:84) at org.grobid.core.engines.Engine.processHeader(Engine.java:434) at org.grobid.core.engines.Engine.processHeader(Engine.java:410) at WeRe.Grobid.performFun(Grobid.java:25) at WeRe.MainClass.main(MainClass.java:12) [Wapiti] Loading model: "C:\grobid-master\grobid-home\models\header\model.wapiti" Model path: C:\grobid-master\grobid-home\models\header\model.wapiti

What could be the issue?
Am I missing any particular library or anything else perhaps?

kermitt2 · 2017-05-11T16:35:38Z

I can't really say without testing the pdf myself.

rathancage · 2017-05-11T16:47:54Z

Link to the pdf I tried with

I have tried with different pdfs giving the same error. If it adds to anything, I added some of the libraries independently as external JARS and haven't used Maven dependency. Thanks in advance towards your reply.

rathancage · 2017-05-11T16:50:08Z

The error points to this statement btw:
String tei = engine.processHeader(pdfPath, false, resHeader);
Hoping that leads to the cause of the error*

kermitt2 · 2017-05-11T16:59:49Z

This PDF is working fine on Linux with the new pdf2xml fork. So you need to wait for a recompiled version of this new pdf2xml on Windows. I am not able to recompile it on Windows because I have no Windows machine but hopefully it will be done when releasing version 0.4.2 of GROBID by another contributor.

Here is the resulting TEI:
https://grobid.s3.amazonaws.com/NascimentoLSG-JCDL2011.tei.xml

it looks actually very nice ;)

rathancage · 2017-05-11T17:00:56Z

It does indeed. I will give it a try in Linux once. Thanks.

kermitt2 · 2017-05-11T17:05:27Z

see #161 and #166
need to recompile pdf3xml fork for Windows 64

rathancage · 2017-05-11T17:07:05Z

Not to sound silly but in the crash #166, it is stated that it works well with 0.4.1 which provides an error in my case as stated above. Is it due to pdf3xml fork for Win 64?

kermitt2 · 2017-05-11T17:10:15Z

The fact that #166 doesn't work with 0.4.2-SNAPSHOT is because pdf2xml is not recompiled for Windows - a different symptom but the same cause ;)
All these issues will disappear by updating pdf2xml for win64.

rathancage · 2017-05-11T17:15:18Z

Oh, okay. Thanks for the quick responses, @kermitt2 I will check it out in Linux. Great stuff @grobid btw!

kermitt2 · 2017-05-11T17:30:30Z

thanks @rathancage you're welcome!

UntoterOstgote · 2017-05-16T19:10:16Z

grobid actually works out of the box in "Windows Subsystem for Linux" on Windows 10, so you can develop in Windows, deploy and run grobid as a service in Linux and call grobid-APIs from Windows again, all natively on one machine, without messing around with VMs.

lfoppiano · 2017-08-04T14:23:21Z

Dear @rathancage, @UntoterOstgote,
the new pdf2xml version for windows has been pushed yesterday, commit 75536cd.

You should be able to run grobid on Windows without problems. Please bear in mind that the reference architecture for GROBID is Linux (moreover compiling things on windows is a real pain).

Feel free to reopen the ticket if you have further problems.

dksanyal · 2018-01-11T14:19:16Z

Hi,
pdf2xml for lin-32 and lin-64 are not updated while that for Windows is updated to version 2.0 at https://github.com/kermitt2/grobid/tree/master/grobid-home/pdf2xml.
Does this cause any issues?
Please comment on this as some of my PDF files are not parsed correctly on my local build on CentOS while they are correctly parsed in the online portal.

This is very urgent. Any help is highly appreciated.

yoannspace · 2018-08-01T20:59:01Z

Hi,

We talked today and saw that some files work, some don't on windows (both work on Unix).
Here are example files for testing:

Cheers,
Yoann

This was referenced May 11, 2017

[BAD_INPUT_DATA] PDF to XML conversion failed with error code: 99 #166

Closed

PDF2XML crash #161

Closed

kermitt2 added the need help Issues where the contributors are even more incompetent than usual label May 11, 2017

This was referenced Jun 24, 2017

Error reported by XML parser: Invalid byte 2 of 3-byte UTF-8 sequence. #195

Closed

Windows 64 bit pdftoxml issue lfoppiano/grobid-quantities#30

Closed

lfoppiano self-assigned this Aug 4, 2017

lfoppiano closed this as completed Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf #185

Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf #185

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017 •

edited

Loading

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017 •

edited

Loading

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

UntoterOstgote commented May 16, 2017 •

edited

Loading

lfoppiano commented Aug 4, 2017 •

edited

Loading

dksanyal commented Jan 11, 2018

yoannspace commented Aug 1, 2018 •

edited

Loading

Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf #185

Error: [PDF2XML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf #185

Comments

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017 • edited Loading

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017 • edited Loading

rathancage commented May 11, 2017

kermitt2 commented May 11, 2017

UntoterOstgote commented May 16, 2017 • edited Loading

lfoppiano commented Aug 4, 2017 • edited Loading

dksanyal commented Jan 11, 2018

yoannspace commented Aug 1, 2018 • edited Loading

rathancage commented May 11, 2017 •

edited

Loading

kermitt2 commented May 11, 2017 •

edited

Loading

UntoterOstgote commented May 16, 2017 •

edited

Loading

lfoppiano commented Aug 4, 2017 •

edited

Loading

yoannspace commented Aug 1, 2018 •

edited

Loading