pdftotext.Error: Poppler error creating document #17

prakritidev · 2018-04-15T18:14:08Z

while using pdftotext with multiprocessing module on ec2

('read pdf file', '1004.5293.pdf')
Traceback (most recent call last):
  File "main.py", line 44, in <module>
    result = pool.map(pdf_extract, filenames)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
pdftotext.Error: Poppler error creating document

My code:

def pdf_extract(dirs):
    paths, filename = dirs
    file = filename.replace(".pdf", ".txt")
    if file in have:
        print("file alreafy extracted!!")
    else:
	print("read pdf file", filename)
        with open(os.path.join(paths, filename), "rb") as f:
            pdf = pdftotext.PDF(f)
            prin(len(pdf))
        text = "\n\n".join(pdf)
        print("converted file")
        file = filename.replace(".pdf", ".txt")
        with open(txt_path+file, "w") as f:
            f.writelines(text)
            f.close()
            print("saved file")
        time.sleep(0.01)

Link : arxiv paper

The text was updated successfully, but these errors were encountered:

jalan · 2018-04-16T00:13:04Z

Can you attach the PDF in question, 1004.5293.pdf?

prakritidev · 2018-04-16T05:22:11Z

Its a research paper from arxiv.org

jalan · 2018-04-16T14:00:18Z

Okay, I found it. Works fine here:

>>> import pdftotext
>>> f = open("1004.5293.pdf", "rb")
>>> pdf = pdftotext.PDF(f)
>>> len(pdf)
34

What version of poppler do you have?

prakritidev · 2018-04-17T12:50:44Z

I'm using Version: 3:4.8.5-1, I had 1 million pdf files and I got this error for many files. I'll try to do this again. I'm not sure if I'm doing something wrong in my function. Please let me know if you think I'm doing something wrong while reading pdf files.

Thanks for the help.

jalan · 2018-04-17T18:40:30Z

3:4.8.5-1 doesn't look like a poppler version. It should look something like 0.63.0. You can see some common versions at poppler.freedesktop.org under the Packaged Versions section.

Your code looks fine to me, but if you're using an older version of poppler, errors like this are more common.

anuragladdha-ml · 2018-06-21T14:53:08Z

how do I install poppler on windows? After I download a zip, which directory it needs to be placed at.

bhakti-visotrust · 2022-09-21T00:38:29Z

I'm getting this error when running pdftotext.PDF(f) on AWS Lambda (Linux) but it works successfully with the same document when running on MacOS.

using pdftotext==2.2.2, poppler-utils==0.1.0

AlexisH · 2022-10-05T09:04:09Z

I'm getting this error too when running pdftotext.PDF(f) on Ubuntu 20.04.5 LTS. Was working fine last month. No code change.
I upgraded the pdftotext package from 2.1.5 to 2.2.2 but the issue is still there.

Edit: nevermind. I found one of the read files was actually not a pdf file but a png image of the document...

dvtate · 2024-01-02T03:02:40Z

Same issue with same file, Arch Linux, poppler 23.12.0-1, and pdftotext 2.2.2-4

jalan closed this as completed May 20, 2018

dvtate mentioned this issue Jan 2, 2024

#17 in arch linux #120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdftotext.Error: Poppler error creating document #17

pdftotext.Error: Poppler error creating document #17

prakritidev commented Apr 15, 2018 •

edited

jalan commented Apr 16, 2018

prakritidev commented Apr 16, 2018 •

edited

jalan commented Apr 16, 2018

prakritidev commented Apr 17, 2018 •

edited

jalan commented Apr 17, 2018

anuragladdha-ml commented Jun 21, 2018

bhakti-visotrust commented Sep 21, 2022 •

edited

AlexisH commented Oct 5, 2022 •

edited

dvtate commented Jan 2, 2024 •

edited

pdftotext.Error: Poppler error creating document #17

pdftotext.Error: Poppler error creating document #17

Comments

prakritidev commented Apr 15, 2018 • edited

jalan commented Apr 16, 2018

prakritidev commented Apr 16, 2018 • edited

jalan commented Apr 16, 2018

prakritidev commented Apr 17, 2018 • edited

jalan commented Apr 17, 2018

anuragladdha-ml commented Jun 21, 2018

bhakti-visotrust commented Sep 21, 2022 • edited

AlexisH commented Oct 5, 2022 • edited

dvtate commented Jan 2, 2024 • edited

prakritidev commented Apr 15, 2018 •

edited

prakritidev commented Apr 16, 2018 •

edited

prakritidev commented Apr 17, 2018 •

edited

bhakti-visotrust commented Sep 21, 2022 •

edited

AlexisH commented Oct 5, 2022 •

edited

dvtate commented Jan 2, 2024 •

edited