Error when extracting images with PyMuPDFLoader and PyPDFLoader

TRANSFERED BY @mdrxy 
DOCS NOTE: SEE ATTACHED CLOSED PR THAT SHOULD BE APPLIED HERE

### Checked other resources

- [X] I added a very descriptive title to this issue.
- [x] I searched the LangChain documentation with the integrated search.
- [X] I used the GitHub search to find a similar question and didn't find it.
- [X] I am sure that this is a bug in LangChain rather than my code.
- [X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

### Example Code

1. Use the following code to load a PDF with image extraction enabled with PyMuPDFLoader:

```python
########################################
# PyMuPDFLoader
########################################
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("google-2024-environmental-report.pdf", extract_images=True)
pages = loader.load()

for page in pages:
    print(page.page_content)
```

2. Download the PDF located at: [Google 2024 Environmental Report](https://www.gstatic.com/gumdrop/sustainability/google-2024-environmental-report.pdf).
3. Additionally, I also tried using PyPDFLoader with the same PDF, and I encountered the same issue.

### Error Message and Stack Trace (if applicable)

```
Traceback (most recent call last):
  File "/mnt/c/Users/BennisonJ/Yavar/projects/zypher-2.0/backend/apps/rag/main.py", line 811, in store_doc
    data = loader.load()
           ^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 387, in load
    return list(self._lazy_load(**kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 384, in _lazy_load
    yield from parser.lazy_parse(blob)
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 244, in lazy_parse
    yield from [
               ^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 247, in <listcomp>
    + self._extract_images_from_page(doc, page),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 283, in _extract_images_from_page
    return extract_from_images_with_rapidocr(imgs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 74, in extract_from_images_with_rapidocr
    result, _ = ocr(img)
                ^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/rapid_ocr_api.py", line 80, in __call__
    dt_boxes, det_elapse = self.text_detector(img)
                           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/ch_ppocr_v3_det/text_detect.py", line 66, in __call__
    data = transform(data, self.preprocess_op)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/ch_ppocr_v3_det/utils.py", line 220, in transform
    data = op(data)
           ^^^^^^^^
  File "/home/bennison/miniconda3/envs/open-webui/lib/python3.11/site-packages/rapidocr_onnxruntime/ch_ppocr_v3_det/utils.py", line 75, in __call__
    data['image'] = (img * self.scale - self.mean) / self.std
                     ~~~~~~~~~~~~~~~~~^~~~~~~~~~~
ValueError: operands could not be broadcast together with shapes (896,800) (1,1,3)
```

### Description

I am encountering a ValueError when using both PyMuPDFLoader and PyPDFLoader to extract images from certain PDFs. The error message indicates that operands could not be broadcast together with shapes (896,800) (1,1,3). This occurs specifically when the extract_images parameter is set to True.

**Expected Behavior**
The code should successfully extract text and images from the PDF without errors.

**Additional Information**
This issue seems to occur with specific PDFs that may have unique formatting or image properties. I would appreciate any guidance on how to resolve this issue or if there are any workarounds available. 


### System Info

```
langchain==0.1.16
langchain-chroma==0.1.0
langchain-community==0.0.34
langchain-core==0.1.52
langchain-text-splitters==0.0.2

PyMuPDF Version: 1.24.10
PyPDF Version: 4.2.0
Operating System: Ubuntu 22 LTS
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error when extracting images with PyMuPDFLoader and PyPDFLoader #294

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error when extracting images with PyMuPDFLoader and PyPDFLoader #294

Description

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions