Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? #16315

Closed
4 of 15 tasks
KaifAhmad1 opened this issue Jan 20, 2024 · 1 comment
Closed
4 of 15 tasks
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@KaifAhmad1
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.

Example Code

Here is my issue in brief '''

import os

poppler_path = 'C:\\Users\\Mohd Kaif\\Downloads\\poppler-23.08.0\\Library\\bin'
os.environ["PATH"] += os.pathsep + poppler_path
directory = '/content/drive/MyDrive/History_QA_dataset'
from pathlib import Path

def load_files(directory):
    documents = list(Path(directory).iterdir())
    return documents

documents = load_files(directory)
print(len(documents))
documents
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("/content/drive/MyDrive/History_QA_dataset/ncert_s_modern_india_bipan_chandra_old_edition-1566975158976.pdf")
pages = loader.load()

Description

Rasing this Type Error:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    580             env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "")
--> 581         proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
    582 

14 frames
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

PDFInfoNotInstalledError                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    605 
    606     except OSError:
--> 607         raise PDFInfoNotInstalledError(
    608             "Unable to get page count. Is poppler installed and in PATH?"
    609         )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

System Info

System Information:
Python version: 3.10.10
Operating System: Windows 11
Windows: 11
pip == 23.3.1
python == 3.10.10
long-chain == 0.1.0
transformers == 4.36.2
sentence_transformers == 2.2.2
unstructured == 0.12.0

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async
Copy link

dosubot bot commented Jan 20, 2024

🤖

Hey there, @KaifAhmad1! It's good to see you again. Hope you're doing well! Let us dive into the issue you've encountered.

From the error message you provided, it seems like the issue is related to the poppler library not being found in your system's PATH. This is not directly related to the LangChain framework, but rather to the way your system is set up.

The poppler library is used by the pdf2image package, which is a dependency of the UnstructuredPDFLoader class in LangChain. This class is used to load PDF documents and convert them into a format that can be processed by LangChain.

The error message suggests that poppler is not installed or not in the PATH. Here are a few things you can try:

  1. Check if poppler is installed: You can do this by running where poppler in your command prompt. If it's installed, this command should return the path to the poppler executable.

  2. Check if poppler is in your PATH: You can do this by running echo %PATH% in your command prompt. This will print out all the directories in your PATH. Look for the directory that contains poppler.

  3. Add poppler to your PATH: If poppler is not in your PATH, you can add it by modifying your system's environment variables. Here's how you can do it:

    • Right-click on 'This PC' and click on 'Properties'.
    • Click on 'Advanced system settings'.
    • Click on 'Environment Variables'.
    • In the 'System variables' section, find the 'Path' variable, select it, and click on 'Edit'.
    • In the 'Variable value' field, append the path to the poppler executable. Make sure to separate it from the existing paths with a semicolon (;).
  4. Reinstall poppler: If poppler is not installed, you can download it from here. After downloading, extract the files and add the bin directory to your PATH.

Please try these steps and let me know if the issue persists. If it does, please provide more details about your system setup and the exact steps you're following so I can better assist you.

For more information about how to use the UnstructuredPDFLoader class, you can refer to the LangChain documentation.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jan 20, 2024
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 20, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

1 participant