Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added reader utilities for PDF files #8

Merged
merged 17 commits into from
Jan 14, 2021
Merged

Conversation

charlesmindee
Copy link
Collaborator

pdf doc reader module as described in the issue : [documents] Add basic document reader #1

@fg-mindee fg-mindee reopened this Jan 12, 2021
@fg-mindee fg-mindee added this to the 0.1.0 milestone Jan 12, 2021
@fg-mindee fg-mindee added type: enhancement Improvement module: io Related to doctr.io labels Jan 12, 2021
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Overall, I added comments to improve the PR, but we must add the new requirements and proper unit tests for all of this to avoid a drop in coverage!

doctr/documents/reader.py Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few additional comments with your latest modifications

requirements.txt Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! We'll need to add some unittests as well

doctr/documents/reader.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, check my comments ;)
Additionally, for imports, add a init.py in the documents folder, and in this file, add the following: "from .reader import *"

doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
test/test_documents.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just remove the tensorflow dependency since it's not used by the library currently please?
And I added a small comment for memory optimization but nothing major!

Comment on lines +84 to +86
imgs, names = convert_pdf_pages_to_imgs(
pdf=pdf, filename=filename, page_idxs=None, num_pixels=num_pixels)
return imgs, names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick trick, let's avoid the potential memory copy by returning directly the result of convert_pdf_pages_to_imgs

requirements.txt Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some trailing whitespaces to be removed and we're good to go

doctr/documents/reader.py Outdated Show resolved Hide resolved
doctr/documents/reader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edits and the decoding investigation!

@fg-mindee fg-mindee changed the title Pdf doc reader feat: Added reader utilities for PDF files Jan 14, 2021
@charlesmindee charlesmindee merged commit 634c160 into main Jan 14, 2021
@charlesmindee charlesmindee deleted the pdf_doc_reader branch January 14, 2021 12:04
@fg-mindee
Copy link
Contributor

@charlesmindee I just noticed: the tensorflow dependency was wrongly removed in this PR
We need to add it back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: io Related to doctr.io type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants