New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detection module : Preprocessor #20
Conversation
doctr/models/detection/__init__.py
Outdated
@@ -0,0 +1,2 @@ | |||
from . import differentiable_binarization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this file is deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I added a few comments. Additionally, since the preprocessor is in the models' module, could you move your tests to test/test_models.py please?
doctr/models/preprocessor.py
Outdated
from typing import Union, List, Tuple, Optional, Any, Dict | ||
|
||
|
||
class Preprocessor(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the brackets here
doctr/models/preprocessor.py
Outdated
documents: Tuple[List[List[np.ndarray]], List[List[str]], List[List[Tuple[int, int]]]] | ||
) -> List[Tuple[List[np.ndarray], List[str], List[Tuple[int, int]]]]: | ||
""" | ||
perform resizing, normalization and batching on documents | ||
""" | ||
images, names, shapes = documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input signature can take several arguments, it would be way cleaner than having a tuple of 3 different type of objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a self
in argument btw
test/test_preprocessor.py
Outdated
return fn | ||
|
||
|
||
def test_preprocess_documents(num_docs=10, batch_size=3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a mock_pdf
argument for the fixture to work here!
You can move the others function arguments inside its definition
test/test_preprocessor.py
Outdated
@pytest.fixture(scope="session") | ||
def mock_pdf(tmpdir_factory): | ||
url = 'https://arxiv.org/pdf/1911.08947.pdf' | ||
file = BytesIO(requests.get(url).content) | ||
fn = tmpdir_factory.mktemp("data").join("mock_pdf_file.pdf") | ||
with open(fn, 'wb') as f: | ||
f.write(file.getbuffer()) | ||
return fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if this works after removing this definition please?
The fixture is defined at the session level, so I guess this will work as is (or we can import it)
test/test_preprocessor.py
Outdated
if num_docs > batch_size: | ||
for batch in batched_docs[:-1]: | ||
for i in range(len(batch)): | ||
assert len(batch[i]) == batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a specific test checking for actual values here since you now the doc + num of pages?
Also, you need to add a PR description 😅 |
test/test_preprocessor.py
Outdated
for batch in batched_docs[:-1]: | ||
for i in range(len(batch)): | ||
assert len(batch[i]) == batch_size | ||
for _, batch_i in enumerate(batch): | ||
assert len(batch_i) == batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert all(len(batch) == batch_size for batches in batched_docs[:-1] for batch in batches)
Generally speaking, if we can avoid "for" loops, it's better ;)
The fail on the docker job is unrelated (a PR was already merged on main to fix this), so no need to bother with this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Preprocessor module to batch/norm/resize documents before injection in the model