Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Experience Should Not Require Poppler for PDFs #20

Open
ankrgyl opened this issue Sep 7, 2022 · 1 comment
Open

Default Experience Should Not Require Poppler for PDFs #20

ankrgyl opened this issue Sep 7, 2022 · 1 comment

Comments

@ankrgyl
Copy link
Contributor

ankrgyl commented Sep 7, 2022

PDFs take advantage of Poppler to create image previews; however, these are unnecessary if the file has embedded text for certain models (e.g. LayoutLMv1). We should make sure that the default scenario of poppler not being available still works.

@RamesanPP
Copy link

I am facing an error with the pdf2image library and mentioning to install Poppler to PATH.
This is my code:

def doc_type(temp_path):
    p = pipeline('document-question-answering')
    doc = document.load_document(temp_path)
    response = p("What type of document is this?", **doc.context)
    return response

The error I receive is :
response = p("What type of document is this?", **doc.context) ^^^^^^^^^^^^ File "C:\Users\Cirruslabs\AppData\Local\Programs\Python\Python311\Lib\functools.py", line 1001, in __get__ val = self.func(instance) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\docquery\document.py", line 117, in context images = self._images ^^^^^^^^^^^^ File "C:\Users\Cirruslabs\AppData\Local\Programs\Python\Python311\Lib\functools.py", line 1001, in __get__ val = self.func(instance) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\docquery\document.py", line 156, in _images return [x.convert("RGB") for x in pdf2image.convert_from_bytes(self.b)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\pdf2image\pdf2image.py", line 358, in convert_from_bytes return convert_from_path( ^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\pdf2image\pdf2image.py", line 127, in convert_from_path page_count = pdfinfo_from_path( ^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\pdf2image\pdf2image.py", line 594, in pdfinfo_from_path raise PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Is there any workaround to this. I've tried installing popper-utils and pdf2image and still no use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants