Skip to content

Error when extracting images with PyMuPDFLoader and PyPDFLoader #294

@BennisonDevadoss

Description

@BennisonDevadoss

TRANSFERED BY @mdrxy
DOCS NOTE: SEE ATTACHED CLOSED PR THAT SHOULD BE APPLIED HERE

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

  1. Use the following code to load a PDF with image extraction enabled with PyMuPDFLoader:
########################################
# PyMuPDFLoader
########################################
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("google-2024-environmental-report.pdf", extract_images=True)
pages = loader.load()

for page in pages:
    print(page.page_content)
  1. Download the PDF located at: Google 2024 Environmental Report.
  2. Additionally, I also tried using PyPDFLoader with the same PDF, and I encountered the same issue.

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/mnt/c/Users/BennisonJ/Yavar/projects/zypher-2.0/backend/apps/rag/main.py", line 811, in store_doc
    data = loader.load()
           ^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 387, in load
    return list(self._lazy_load(**kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 384, in _lazy_load
    yield from parser.lazy_parse(blob)
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 244, in lazy_parse
    yield from [
               ^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 247, in <listcomp>
    + self._extract_images_from_page(doc, page),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 283, in _extract_images_from_page
    return extract_from_images_with_rapidocr(imgs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 74, in extract_from_images_with_rapidocr
    result, _ = ocr(img)
                ^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/rapid_ocr_api.py", line 80, in __call__
    dt_boxes, det_elapse = self.text_detector(img)
                           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/ch_ppocr_v3_det/text_detect.py", line 66, in __call__
    data = transform(data, self.preprocess_op)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/ch_ppocr_v3_det/utils.py", line 220, in transform
    data = op(data)
           ^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/ch_ppocr_v3_det/utils.py", line 75, in __call__
    data['image'] = (img * self.scale - self.mean) / self.std
                     ~~~~~~~~~~~~~~~~~^~~~~~~~~~~
ValueError: operands could not be broadcast together with shapes (896,800) (1,1,3)

Description

I am encountering a ValueError when using both PyMuPDFLoader and PyPDFLoader to extract images from certain PDFs. The error message indicates that operands could not be broadcast together with shapes (896,800) (1,1,3). This occurs specifically when the extract_images parameter is set to True.

Expected Behavior
The code should successfully extract text and images from the PDF without errors.

Additional Information
This issue seems to occur with specific PDFs that may have unique formatting or image properties. I would appreciate any guidance on how to resolve this issue or if there are any workarounds available.

System Info

langchain==0.1.16
langchain-chroma==0.1.0
langchain-community==0.0.34
langchain-core==0.1.52
langchain-text-splitters==0.0.2

PyMuPDF Version: 1.24.10
PyPDF Version: 4.2.0
Operating System: Ubuntu 22 LTS

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingintegrationFor docs updates for LangChain integrationslangchainFor docs changes to LangChain

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions