Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection module : Preprocessor #20

Merged
merged 18 commits into from Jan 18, 2021
Merged

Detection module : Preprocessor #20

merged 18 commits into from Jan 18, 2021

Conversation

charlesmindee
Copy link
Collaborator

@charlesmindee charlesmindee commented Jan 15, 2021

Preprocessor module to batch/norm/resize documents before injection in the model

@@ -0,0 +1,2 @@
from . import differentiable_binarization
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this file is deprecated

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a few comments. Additionally, since the preprocessor is in the models' module, could you move your tests to test/test_models.py please?

from typing import Union, List, Tuple, Optional, Any, Dict


class Preprocessor():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the brackets here

doctr/models/preprocessor.py Show resolved Hide resolved
doctr/models/preprocessor.py Show resolved Hide resolved
Comment on lines 27 to 32
documents: Tuple[List[List[np.ndarray]], List[List[str]], List[List[Tuple[int, int]]]]
) -> List[Tuple[List[np.ndarray], List[str], List[Tuple[int, int]]]]:
"""
perform resizing, normalization and batching on documents
"""
images, names, shapes = documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input signature can take several arguments, it would be way cleaner than having a tuple of 3 different type of objects

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a self in argument btw

return fn


def test_preprocess_documents(num_docs=10, batch_size=3):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a mock_pdf argument for the fixture to work here!
You can move the others function arguments inside its definition

Comment on lines 8 to 15
@pytest.fixture(scope="session")
def mock_pdf(tmpdir_factory):
url = 'https://arxiv.org/pdf/1911.08947.pdf'
file = BytesIO(requests.get(url).content)
fn = tmpdir_factory.mktemp("data").join("mock_pdf_file.pdf")
with open(fn, 'wb') as f:
f.write(file.getbuffer())
return fn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if this works after removing this definition please?
The fixture is defined at the session level, so I guess this will work as is (or we can import it)

if num_docs > batch_size:
for batch in batched_docs[:-1]:
for i in range(len(batch)):
assert len(batch[i]) == batch_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a specific test checking for actual values here since you now the doc + num of pages?

@fg-mindee
Copy link
Contributor

Also, you need to add a PR description 😅

@fg-mindee fg-mindee added the module: models Related to doctr.models label Jan 15, 2021
@fg-mindee fg-mindee added this to the 0.1.0 milestone Jan 15, 2021
Comment on lines 26 to 28
for batch in batched_docs[:-1]:
for i in range(len(batch)):
assert len(batch[i]) == batch_size
for _, batch_i in enumerate(batch):
assert len(batch_i) == batch_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert all(len(batch) == batch_size for batches in batched_docs[:-1] for batch in batches)

Generally speaking, if we can avoid "for" loops, it's better ;)

@fg-mindee
Copy link
Contributor

The fail on the docker job is unrelated (a PR was already merged on main to fix this), so no need to bother with this!
But the mypy issue has to be tackled

@fg-mindee fg-mindee closed this Jan 18, 2021
@fg-mindee fg-mindee reopened this Jan 18, 2021
@fg-mindee fg-mindee mentioned this pull request Jan 18, 2021
4 tasks
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@charlesmindee charlesmindee merged commit b8eb11b into main Jan 18, 2021
@charlesmindee charlesmindee deleted the detection_module branch January 18, 2021 11:38
@fg-mindee fg-mindee mentioned this pull request Jan 21, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: models Related to doctr.models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants