Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdftotext.Error: Poppler error creating document #17

Closed
prakritidev opened this issue Apr 15, 2018 · 9 comments
Closed

pdftotext.Error: Poppler error creating document #17

prakritidev opened this issue Apr 15, 2018 · 9 comments

Comments

@prakritidev
Copy link

prakritidev commented Apr 15, 2018

while using pdftotext with multiprocessing module on ec2

('read pdf file', '1004.5293.pdf')
Traceback (most recent call last):
  File "main.py", line 44, in <module>
    result = pool.map(pdf_extract, filenames)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
pdftotext.Error: Poppler error creating document

My code:

def pdf_extract(dirs):
    paths, filename = dirs
    file = filename.replace(".pdf", ".txt")
    if file in have:
        print("file alreafy extracted!!")
    else:
	print("read pdf file", filename)
        with open(os.path.join(paths, filename), "rb") as f:
            pdf = pdftotext.PDF(f)
            prin(len(pdf))
        text = "\n\n".join(pdf)
        print("converted file")
        file = filename.replace(".pdf", ".txt")
        with open(txt_path+file, "w") as f:
            f.writelines(text)
            f.close()
            print("saved file")
        time.sleep(0.01)

Link : arxiv paper

@jalan
Copy link
Owner

jalan commented Apr 16, 2018

Can you attach the PDF in question, 1004.5293.pdf?

@prakritidev
Copy link
Author

prakritidev commented Apr 16, 2018

Its a research paper from arxiv.org

@jalan
Copy link
Owner

jalan commented Apr 16, 2018

Okay, I found it. Works fine here:

>>> import pdftotext
>>> f = open("1004.5293.pdf", "rb")
>>> pdf = pdftotext.PDF(f)
>>> len(pdf)
34

What version of poppler do you have?

@prakritidev
Copy link
Author

prakritidev commented Apr 17, 2018

I'm using Version: 3:4.8.5-1, I had 1 million pdf files and I got this error for many files. I'll try to do this again. I'm not sure if I'm doing something wrong in my function. Please let me know if you think I'm doing something wrong while reading pdf files.

Thanks for the help.

@jalan
Copy link
Owner

jalan commented Apr 17, 2018

3:4.8.5-1 doesn't look like a poppler version. It should look something like 0.63.0. You can see some common versions at poppler.freedesktop.org under the Packaged Versions section.

Your code looks fine to me, but if you're using an older version of poppler, errors like this are more common.

@jalan jalan closed this as completed May 20, 2018
@anuragladdha-ml
Copy link

how do I install poppler on windows? After I download a zip, which directory it needs to be placed at.

@bhakti-visotrust
Copy link

bhakti-visotrust commented Sep 21, 2022

I'm getting this error when running pdftotext.PDF(f) on AWS Lambda (Linux) but it works successfully with the same document when running on MacOS.

using pdftotext==2.2.2, poppler-utils==0.1.0

@AlexisH
Copy link

AlexisH commented Oct 5, 2022

I'm getting this error too when running pdftotext.PDF(f) on Ubuntu 20.04.5 LTS. Was working fine last month. No code change.
I upgraded the pdftotext package from 2.1.5 to 2.2.2 but the issue is still there.

Edit: nevermind. I found one of the read files was actually not a pdf file but a png image of the document...

@dvtate
Copy link

dvtate commented Jan 2, 2024

Same issue with same file, Arch Linux, poppler 23.12.0-1, and pdftotext 2.2.2-4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants