Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using langchain-community, some PDF images will report errors during OCR #22892

Closed
5 tasks done
wwbrave002 opened this issue Jun 14, 2024 · 0 comments
Closed
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@wwbrave002
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

class PyPDFParser(BaseBlobParser):
"""Load PDF using pypdf"""

def __init__(
    self, password: Optional[Union[str, bytes]] = None, extract_images: bool = False
):
    self.password = password
    self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:  # type: ignore[valid-type]
    """Lazily parse the blob."""
    import pypdf
    
    self.pdf_blob = blob

    with blob.as_bytes_io() as pdf_file_obj:  # type: ignore[attr-defined]
        pdf_reader = pypdf.PdfReader(pdf_file_obj, password=self.password)
        yield from [
            Document(
                page_content=page.extract_text()
                + self._extract_images_from_page(page),
                metadata={"source": blob.source, "page": page_number},  # type: ignore[attr-defined]
            )
            for page_number, page in enumerate(pdf_reader.pages)
        ]

def _extract_images_from_page(self, page: pypdf._page.PageObject) -> str:
    """Extract images from page and get the text with RapidOCR."""
    if not self.extract_images or "/XObject" not in page["/Resources"].keys():
        return ""

    xObject = page["/Resources"]["/XObject"].get_object()  # type: ignore
    images = []
    for obj in xObject:
        # print(f"obj: {xObject[obj]}")
        if xObject[obj]["/Subtype"] == "/Image":
            if xObject[obj].get("/Filter"):
                if isinstance(xObject[obj]["/Filter"], str):
                    if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
                        height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]
                        # print(xObject[obj].get_data())
                        try:
                            images.append(
                                np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
                                    height, width, -1
                                )
                            )
                        except Exception as e:
                            if xObject[obj]["/Filter"][1:] == "CCITTFaxDecode":
                                import fitz
                                with self.pdf_blob.as_bytes_io() as pdf_file_obj:  # type: ignore[attr-defined]
                                    with fitz.open("pdf", pdf_file_obj.read()) as doc:
                                        pix = doc.load_page(page.page_number).get_pixmap(matrix=fitz.Matrix(1,1), colorspace=fitz.csGRAY)
                                        images.append(pix.tobytes())
                            else:
                                warnings.warn(f"Reshape Error: {xObject[obj]}")
                    elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
                        images.append(xObject[obj].get_data())
                    else:
                        warnings.warn(f"Unknown PDF Filter: {xObject[obj]["/Filter"][1:]}")
                elif isinstance(xObject[obj]["/Filter"], list):
                    for xObject_filter in xObject[obj]["/Filter"]:
                        if xObject_filter[1:] in _PDF_FILTER_WITHOUT_LOSS:
                            height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]
                            # print(xObject[obj].get_data())
                            try:
                                images.append(
                                    np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
                                        height, width, -1
                                    )
                                )
                            except Exception as e:
                                if xObject[obj]["/Filter"][1:] == "CCITTFaxDecode":
                                    import fitz
                                    with self.pdf_blob.as_bytes_io() as pdf_file_obj:  # type: ignore[attr-defined]
                                        with fitz.open("pdf", pdf_file_obj.read()) as doc:
                                            pix = doc.load_page(page.number).get_pixmap(matrix=fitz.Matrix(1,1), colorspace=fitz.csGRAY)
                                            images.append(pix.tobytes())
                                else:
                                    warnings.warn(f"Reshape Error: {xObject[obj]}")
                            break
                        elif xObject_filter[1:] in _PDF_FILTER_WITH_LOSS:
                            images.append(xObject[obj].get_data())
                            break
                        else:
                            warnings.warn(f"Unknown PDF Filter: {xObject_filter[1:]}")
            else:
                warnings.warn("Can Not Find PDF Filter!")
    return extract_from_images_with_rapidocr(images)

Error Message and Stack Trace (if applicable)

No response

Description

When I use langchain-community, some PDF images will report errors during OCR. I tried to add some processing based on the source code PyPDFParser class, which temporarily solved the problem. Administrators can check whether to add this part of code in the new version. The complete PyPDFParser class is shown in Example Code.

System Info

langchain-community==0.2.4

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jun 14, 2024
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 13, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant