-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added reader utilities for PDF files #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Overall, I added comments to improve the PR, but we must add the new requirements and proper unit tests for all of this to avoid a drop in coverage!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few additional comments with your latest modifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! We'll need to add some unittests as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, check my comments ;)
Additionally, for imports, add a init.py in the documents folder, and in this file, add the following: "from .reader import *"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just remove the tensorflow dependency since it's not used by the library currently please?
And I added a small comment for memory optimization but nothing major!
imgs, names = convert_pdf_pages_to_imgs( | ||
pdf=pdf, filename=filename, page_idxs=None, num_pixels=num_pixels) | ||
return imgs, names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick trick, let's avoid the potential memory copy by returning directly the result of convert_pdf_pages_to_imgs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some trailing whitespaces to be removed and we're good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the edits and the decoding investigation!
@charlesmindee I just noticed: the tensorflow dependency was wrongly removed in this PR |
pdf doc reader module as described in the issue : [documents] Add basic document reader #1