When using langchain-community, some PDF images will report errors during OCR #22892

wwbrave002 · 2024-06-14T11:02:04Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

class PyPDFParser(BaseBlobParser):
"""Load PDF using pypdf"""

def __init__(
    self, password: Optional[Union[str, bytes]] = None, extract_images: bool = False
):
    self.password = password
    self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:  # type: ignore[valid-type]
    """Lazily parse the blob."""
    import pypdf
    
    self.pdf_blob = blob

    with blob.as_bytes_io() as pdf_file_obj:  # type: ignore[attr-defined]
        pdf_reader = pypdf.PdfReader(pdf_file_obj, password=self.password)
        yield from [
            Document(
                page_content=page.extract_text()
                + self._extract_images_from_page(page),
                metadata={"source": blob.source, "page": page_number},  # type: ignore[attr-defined]
            )
            for page_number, page in enumerate(pdf_reader.pages)
        ]

def _extract_images_from_page(self, page: pypdf._page.PageObject) -> str:
    """Extract images from page and get the text with RapidOCR."""
    if not self.extract_images or "/XObject" not in page["/Resources"].keys():
        return ""

    xObject = page["/Resources"]["/XObject"].get_object()  # type: ignore
    images = []
    for obj in xObject:
        # print(f"obj: {xObject[obj]}")
        if xObject[obj]["/Subtype"] == "/Image":
            if xObject[obj].get("/Filter"):
                if isinstance(xObject[obj]["/Filter"], str):
                    if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
                        height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]
                        # print(xObject[obj].get_data())
                        try:
                            images.append(
                                np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
                                    height, width, -1
                                )
                            )
                        except Exception as e:
                            if xObject[obj]["/Filter"][1:] == "CCITTFaxDecode":
                                import fitz
                                with self.pdf_blob.as_bytes_io() as pdf_file_obj:  # type: ignore[attr-defined]
                                    with fitz.open("pdf", pdf_file_obj.read()) as doc:
                                        pix = doc.load_page(page.page_number).get_pixmap(matrix=fitz.Matrix(1,1), colorspace=fitz.csGRAY)
                                        images.append(pix.tobytes())
                            else:
                                warnings.warn(f"Reshape Error: {xObject[obj]}")
                    elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
                        images.append(xObject[obj].get_data())
                    else:
                        warnings.warn(f"Unknown PDF Filter: {xObject[obj]["/Filter"][1:]}")
                elif isinstance(xObject[obj]["/Filter"], list):
                    for xObject_filter in xObject[obj]["/Filter"]:
                        if xObject_filter[1:] in _PDF_FILTER_WITHOUT_LOSS:
                            height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]
                            # print(xObject[obj].get_data())
                            try:
                                images.append(
                                    np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
                                        height, width, -1
                                    )
                                )
                            except Exception as e:
                                if xObject[obj]["/Filter"][1:] == "CCITTFaxDecode":
                                    import fitz
                                    with self.pdf_blob.as_bytes_io() as pdf_file_obj:  # type: ignore[attr-defined]
                                        with fitz.open("pdf", pdf_file_obj.read()) as doc:
                                            pix = doc.load_page(page.number).get_pixmap(matrix=fitz.Matrix(1,1), colorspace=fitz.csGRAY)
                                            images.append(pix.tobytes())
                                else:
                                    warnings.warn(f"Reshape Error: {xObject[obj]}")
                            break
                        elif xObject_filter[1:] in _PDF_FILTER_WITH_LOSS:
                            images.append(xObject[obj].get_data())
                            break
                        else:
                            warnings.warn(f"Unknown PDF Filter: {xObject_filter[1:]}")
            else:
                warnings.warn("Can Not Find PDF Filter!")
    return extract_from_images_with_rapidocr(images)

Error Message and Stack Trace (if applicable)

No response

Description

When I use langchain-community, some PDF images will report errors during OCR. I tried to add some processing based on the source code PyPDFParser class, which temporarily solved the problem. Administrators can check whether to add this part of code in the new version. The complete PyPDFParser class is shown in Example Code.

System Info

langchain-community==0.2.4

The text was updated successfully, but these errors were encountered:

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jun 14, 2024

Avgor46 mentioned this issue Jul 19, 2024

Using PyPDFLoader causes a crash #24439

Open

5 tasks

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 13, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using langchain-community, some PDF images will report errors during OCR #22892

When using langchain-community, some PDF images will report errors during OCR #22892

wwbrave002 commented Jun 14, 2024

When using langchain-community, some PDF images will report errors during OCR #22892

When using langchain-community, some PDF images will report errors during OCR #22892

Comments

wwbrave002 commented Jun 14, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info