feat: Added reader utilities for PDF files #8

charlesmindee · 2021-01-12T09:20:36Z

pdf doc reader module as described in the issue : [documents] Add basic document reader #1

fg-mindee

Thanks for the PR!
Overall, I added comments to improve the PR, but we must add the new requirements and proper unit tests for all of this to avoid a drop in coverage!

doctr/documents/reader.py

fg-mindee

Just left a few additional comments with your latest modifications

requirements.txt

setup.py

doctr/documents/reader.py

fg-mindee

Almost there! We'll need to add some unittests as well

doctr/documents/reader.py

setup.py

requirements.txt

doctr/documents/reader.py

fg-mindee

Almost there, check my comments ;)
Additionally, for imports, add a init.py in the documents folder, and in this file, add the following: "from .reader import *"

doctr/documents/reader.py

test/test_documents.py

doctr/documents/reader.py

fg-mindee

Could you just remove the tensorflow dependency since it's not used by the library currently please?
And I added a small comment for memory optimization but nothing major!

fg-mindee · 2021-01-13T15:50:15Z

doctr/documents/reader.py

+    imgs, names = convert_pdf_pages_to_imgs(
+        pdf=pdf, filename=filename, page_idxs=None, num_pixels=num_pixels)
+    return imgs, names


quick trick, let's avoid the potential memory copy by returning directly the result of convert_pdf_pages_to_imgs

requirements.txt

setup.py

fg-mindee

Some trailing whitespaces to be removed and we're good to go

doctr/documents/reader.py

fg-mindee

Thanks for the edits and the decoding investigation!

fg-mindee · 2021-01-18T12:50:04Z

@charlesmindee I just noticed: the tensorflow dependency was wrongly removed in this PR
We need to add it back

charlesmindee added 5 commits January 11, 2021 15:34

feat: ✨ pdf reader

5329fcf

feat: ✨ add doc_to_string function

d160e9d

feat: ✨ add document reader

e7f5327

Merge branch 'main' into pdf_doc_reader

24c843f

CI_fix : 💚 changed code for flake8 CI

fbd5662

charlesmindee closed this Jan 12, 2021

fg-mindee reopened this Jan 12, 2021

fg-mindee added this to the 0.1.0 milestone Jan 12, 2021

fg-mindee added type: enhancement Improvement module: io Related to doctr.io labels Jan 12, 2021

fg-mindee assigned charlesmindee Jan 12, 2021

CI_fix : 💚 changed code for flake8 CI

683fbe8

charlesmindee requested a review from fg-mindee January 12, 2021 10:03

fg-mindee suggested changes Jan 12, 2021

View reviewed changes

typing: 🏷️ add typing + refactor

863da32

fg-mindee reviewed Jan 12, 2021

View reviewed changes

charlesmindee added 2 commits January 12, 2021 16:33

typing: 🏷️ add typing + refactor

bdcaf4d

Merge branch 'main' into pdf_doc_reader

90c2040

fg-mindee reviewed Jan 12, 2021

View reviewed changes

doctr/documents/reader.py Outdated Show resolved Hide resolved

setup.py Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

requirements.txt Outdated Show resolved Hide resolved

requirements.txt Show resolved Hide resolved

fg-mindee reviewed Jan 12, 2021

View reviewed changes

doctr/documents/reader.py Outdated Show resolved Hide resolved

fg-mindee mentioned this pull request Jan 12, 2021

[documents] Add basic document reader #1

Closed

3 tasks

charlesmindee added 4 commits January 13, 2021 11:58

test: ::white_check_mark: add unit test for docs

43c4625

test: ::white_check_mark: add unit test for docs

cf6f3e5

test: ::white_check_mark: add unit test for docs

f9651b8

test: ::white_check_mark: add unit test for docs

271a095

fg-mindee suggested changes Jan 13, 2021

View reviewed changes

doctr/documents/reader.py Outdated Show resolved Hide resolved

doctr/documents/reader.py Outdated Show resolved Hide resolved

test/test_documents.py Outdated Show resolved Hide resolved

doctr/documents/reader.py Show resolved Hide resolved

charlesmindee added 2 commits January 13, 2021 15:20

refacto: ♻️ remove magic + test passed

ac351c7

resolved conflicts

c712c37

fg-mindee suggested changes Jan 13, 2021

View reviewed changes

fix: 🐛 typing

79c81fb

fg-mindee suggested changes Jan 13, 2021

View reviewed changes

doctr/documents/reader.py Outdated Show resolved Hide resolved

doctr/documents/reader.py Outdated Show resolved Hide resolved

fix: 🐛 np.float32 for bytestrings

6e7f1e2

fg-mindee approved these changes Jan 14, 2021

View reviewed changes

fg-mindee changed the title ~~Pdf doc reader~~ feat: Added reader utilities for PDF files Jan 14, 2021

charlesmindee merged commit 634c160 into main Jan 14, 2021

charlesmindee deleted the pdf_doc_reader branch January 14, 2021 12:04

fg-mindee mentioned this pull request Jan 14, 2021

docs: Added sphinx built documentation #12

Merged

fg-mindee mentioned this pull request Jan 18, 2021

chore: Fixed wrong dep removal #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added reader utilities for PDF files #8

feat: Added reader utilities for PDF files #8

charlesmindee commented Jan 12, 2021

fg-mindee left a comment •

edited

Loading

fg-mindee left a comment

fg-mindee left a comment

fg-mindee left a comment

fg-mindee left a comment

fg-mindee Jan 13, 2021

fg-mindee left a comment

fg-mindee left a comment

fg-mindee commented Jan 18, 2021

feat: Added reader utilities for PDF files #8

feat: Added reader utilities for PDF files #8

Conversation

charlesmindee commented Jan 12, 2021

fg-mindee left a comment • edited Loading

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee Jan 13, 2021

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee commented Jan 18, 2021

fg-mindee left a comment •

edited

Loading