Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reading PDFS #88

Closed
focusai opened this issue Mar 30, 2023 · 4 comments
Closed

Issue reading PDFS #88

focusai opened this issue Mar 30, 2023 · 4 comments

Comments

@focusai
Copy link

focusai commented Mar 30, 2023

Just to highlight that some PDFs just don't read during the ingest process. So if you think it's not working it could actually be the PDF. Possibly just the OCR during the process? Does anyone know of a solution to this? Or what would cause certain PDF files to be unreadable?

Thanks

@bschleter
Copy link

i had this issue too, they wouldn't even ingest. I got a .ds_store error, so delete a few pdfs that aren't necessarily viewable, and this allowed me to ingest and solve the issue. There might be an issue with PDF reader etc. I think this initial repo worked well because the PDFs were so simple, mostly text but I think the PDF reader struggles with PDFs with images and other elements.

@focusai
Copy link
Author

focusai commented Mar 30, 2023

Makes sense, thanks for sharing

@mayooear
Copy link
Owner

mayooear commented Apr 1, 2023

Yes, good point. OCR, corrupted/non-viewable PDFs and scanned PDFs are very difficult to read, so that is another cause of errors.

@dosubot
Copy link

dosubot bot commented Sep 23, 2023

Hi, @focusai! I'm Dosu, and I'm helping the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, you were experiencing problems with certain PDF files not being read during the ingest process. It seems that deleting PDFs that aren't necessarily viewable has solved the issue for some users. Additionally, bschleter mentioned that the PDF reader may struggle with PDFs containing images and other elements. mayooear also mentioned that OCR, corrupted/non-viewable PDFs, and scanned PDFs can cause errors.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the gpt4-pdf-chatbot-langchain project!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 23, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants