Detection module : Preprocessor #20

charlesmindee · 2021-01-15T15:15:35Z

Preprocessor module to batch/norm/resize documents before injection in the model

charlesmindee · 2021-01-15T15:16:28Z

doctr/models/detection/__init__.py

@@ -0,0 +1,2 @@
+from . import differentiable_binarization


Sorry this file is deprecated

fg-mindee

Thanks for the PR! I added a few comments. Additionally, since the preprocessor is in the models' module, could you move your tests to test/test_models.py please?

fg-mindee · 2021-01-15T17:35:27Z

doctr/models/preprocessor.py

+from typing import Union, List, Tuple, Optional, Any, Dict
+
+
+class Preprocessor():


You can remove the brackets here

doctr/models/preprocessor.py

fg-mindee · 2021-01-15T17:37:07Z

doctr/models/preprocessor.py

+        documents: Tuple[List[List[np.ndarray]], List[List[str]], List[List[Tuple[int, int]]]]
+    ) -> List[Tuple[List[np.ndarray], List[str], List[Tuple[int, int]]]]:
+        """
+        perform resizing, normalization and batching on documents
+        """
+        images, names, shapes = documents


The input signature can take several arguments, it would be way cleaner than having a tuple of 3 different type of objects

missing a self in argument btw

fg-mindee · 2021-01-15T17:42:49Z

test/test_preprocessor.py

+    return fn
+
+
+def test_preprocess_documents(num_docs=10, batch_size=3):


missing a mock_pdf argument for the fixture to work here!
You can move the others function arguments inside its definition

fg-mindee · 2021-01-15T17:44:30Z

test/test_preprocessor.py

+@pytest.fixture(scope="session")
+def mock_pdf(tmpdir_factory):
+    url = 'https://arxiv.org/pdf/1911.08947.pdf'
+    file = BytesIO(requests.get(url).content)
+    fn = tmpdir_factory.mktemp("data").join("mock_pdf_file.pdf")
+    with open(fn, 'wb') as f:
+        f.write(file.getbuffer())
+    return fn


Could you check if this works after removing this definition please?
The fixture is defined at the session level, so I guess this will work as is (or we can import it)

fg-mindee · 2021-01-15T17:45:13Z

test/test_preprocessor.py

+    if num_docs > batch_size:
+        for batch in batched_docs[:-1]:
+            for i in range(len(batch)):
+                assert len(batch[i]) == batch_size


Could you add a specific test checking for actual values here since you now the doc + num of pages?

fg-mindee · 2021-01-15T17:46:59Z

Also, you need to add a PR description 😅

fg-mindee · 2021-01-18T09:15:30Z

test/test_preprocessor.py

        for batch in batched_docs[:-1]:
-            for i in range(len(batch)):
-                assert len(batch[i]) == batch_size
+            for _, batch_i in enumerate(batch):
+                assert len(batch_i) == batch_size


assert all(len(batch) == batch_size for batches in batched_docs[:-1] for batch in batches)

Generally speaking, if we can avoid "for" loops, it's better ;)

fg-mindee · 2021-01-18T09:21:56Z

The fail on the docker job is unrelated (a PR was already merged on main to fix this), so no need to bother with this!
But the mypy issue has to be tackled

fg-mindee

Looks good to me!

charlesmindee added 11 commits January 11, 2021 15:34

feat: ✨ pdf reader

5329fcf

feat: ✨ add doc_to_string function

d160e9d

Merge branch 'main' into detection_module

b3b69f4

feat ✨ add inference_utilities + inference for DBnet

60ee9e2

save: saving work before switching to doc reader

eb1c91a

feat: ✨ add model meta class

f48b84c

add: postprocessor

06512ee

feat ✨ preprocessor

a30d8f5

test: passed test

0cc5a06

Merge branch 'main' into detection_module

0f66142

test: passed all tests except unitest

85cd1c9

charlesmindee commented Jan 15, 2021

View reviewed changes

refacto: remove deprecated file

4485844

fg-mindee suggested changes Jan 15, 2021

View reviewed changes

fg-mindee assigned charlesmindee Jan 15, 2021

fg-mindee added the module: models Related to doctr.models label Jan 15, 2021

fg-mindee added this to the 0.1.0 milestone Jan 15, 2021

charlesmindee added 2 commits January 18, 2021 10:05

test: passed all tests

229acf7

test: passed all tests

8be1d9d

fg-mindee reviewed Jan 18, 2021

View reviewed changes

fg-mindee closed this Jan 18, 2021

fg-mindee reopened this Jan 18, 2021

charlesmindee added 4 commits January 18, 2021 11:36

test: passed tests

45bed76

test: passed tests

40f128f

test: passed tests

e61db48

test: passed tests

d92067d

fg-mindee mentioned this pull request Jan 18, 2021

[models] Add detection module #3

Closed

4 tasks

fg-mindee approved these changes Jan 18, 2021

View reviewed changes

charlesmindee merged commit b8eb11b into main Jan 18, 2021

charlesmindee deleted the detection_module branch January 18, 2021 11:38

fg-mindee mentioned this pull request Jan 21, 2021

[models] Add text recognition module #4

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection module : Preprocessor #20

Detection module : Preprocessor #20

charlesmindee commented Jan 15, 2021 •

edited

charlesmindee Jan 15, 2021

fg-mindee left a comment

fg-mindee Jan 15, 2021

fg-mindee Jan 15, 2021

fg-mindee Jan 15, 2021

fg-mindee Jan 15, 2021

fg-mindee Jan 15, 2021

fg-mindee Jan 15, 2021

fg-mindee commented Jan 15, 2021

fg-mindee Jan 18, 2021

fg-mindee commented Jan 18, 2021

fg-mindee left a comment

		from typing import Union, List, Tuple, Optional, Any, Dict


		class Preprocessor():

		return fn


		def test_preprocess_documents(num_docs=10, batch_size=3):

Detection module : Preprocessor #20

Detection module : Preprocessor #20

Conversation

charlesmindee commented Jan 15, 2021 • edited

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fg-mindee commented Jan 15, 2021

Choose a reason for hiding this comment

fg-mindee commented Jan 18, 2021

fg-mindee left a comment

Choose a reason for hiding this comment

charlesmindee commented Jan 15, 2021 •

edited