Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate better PDF Loaders - PDFMiner, Textract, Azure Document Intelligence #810

Open
ishan00 opened this issue Jun 10, 2024 · 3 comments
Labels
upgrade New feature or request

Comments

@ishan00
Copy link

ishan00 commented Jun 10, 2024

I looked through the code and the current PDF loader used is PyMuPDF. Within the free libraries, PDFMiner works better than PyMuPDF and PyPDF so it would be good to have it. Additionally, documents that are handwritten or scanned will require an OCR engine which none of the above loaders support. Langchain has integrated Textract and Azure Document Intelligence loaders for the OCR use case and it will be nice to have them for khoj as well.

Happy to integrate if it's part of the plan

@ishan00 ishan00 added the upgrade New feature or request label Jun 10, 2024
@MythicalCow
Copy link
Contributor

Hi Ishan this is something we are thinking about. If you would like to work on this I can take a look at your PR.

@sandesh0202
Copy link

I would like to work on integrating this feature

@debanjum
Copy link
Member

debanjum commented Jul 1, 2024

Hi @ishan00 , can you clarify for what usecases you find PDFMiner to work better than PyMuPDF?

PyMuPDF does support OCR when used with the RapidOCR library. So Khoj can handle PDFs with scans or handwritten content.

Of course, using a local OCR library may not be as good as using an online services like Azure Document Intelligence. If so, we could add support for using a better online OCR service when configured (e.g when AZURE_API_KEY set) but falling back to use a local OCR library by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upgrade New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants