Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] unify recognition dataset parts return signature #1041

Merged
merged 4 commits into from
Sep 2, 2022

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Sep 2, 2022

This PR:

  • revert mistake for recognition dataset parts (img, label) instead of (img, dict)
  • add target check in _read_sample
  • now we have two cases in the library for datasets
  • more than one annotation type: targets has to be a dict with boxes / labels keys
  • only one annotation type: target has to be str (label) or np.ndarray (boxes)

Hopefully that this is now a match to avoid mistakes for transformations

Closes:
#935

@felixdittrich92 felixdittrich92 self-assigned this Sep 2, 2022
@felixdittrich92 felixdittrich92 linked an issue Sep 2, 2022 that may be closed by this pull request
23 tasks
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Sep 2, 2022
@felixdittrich92 felixdittrich92 added type: enhancement Improvement module: datasets Related to doctr.datasets framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend labels Sep 2, 2022
@codecov
Copy link

codecov bot commented Sep 2, 2022

Codecov Report

Merging #1041 (e619807) into main (75aa42a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1041   +/-   ##
=======================================
  Coverage   94.93%   94.94%           
=======================================
  Files         135      135           
  Lines        5625     5633    +8     
=======================================
+ Hits         5340     5348    +8     
  Misses        285      285           
Flag Coverage Δ
unittests 94.94% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/cord.py 97.72% <100.00%> (ø)
doctr/datasets/datasets/pytorch.py 100.00% <100.00%> (ø)
doctr/datasets/datasets/tensorflow.py 100.00% <100.00%> (ø)
doctr/datasets/funsd.py 97.43% <100.00%> (ø)
doctr/datasets/ic03.py 97.43% <100.00%> (ø)
doctr/datasets/ic13.py 96.77% <100.00%> (ø)
doctr/datasets/iiit5k.py 96.96% <100.00%> (ø)
doctr/datasets/imgur5k.py 92.53% <100.00%> (ø)
doctr/datasets/mjsynth.py 95.83% <100.00%> (ø)
doctr/datasets/sroie.py 97.36% <100.00%> (ø)
... and 4 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, LGTM

@felixdittrich92 felixdittrich92 merged commit f9d3d78 into mindee:main Sep 2, 2022
@felixdittrich92 felixdittrich92 deleted the unify-dataset-return branch September 2, 2022 15:39
Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Small comment though :)

Comment on lines +23 to +31

# Check target
if isinstance(target, dict):
assert "boxes" in target, "Target should contain 'boxes' key"
assert "labels" in target, "Target should contain 'labels' key"
else:
assert isinstance(target, str) or isinstance(
target, np.ndarray
), "Target should be a string or a numpy array"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the target is unchanged since the constructor was run, that should be moved to the constructor (because otherwise it increases data reading latency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to check it inside every dataset ? otherwise we have only access inside _read_sample and getitem !? And i don't think it changes the runtime noticeably

Comment on lines +23 to +32

# Check target
if isinstance(target, dict):
assert "boxes" in target, "Target should contain 'boxes' key"
assert "labels" in target, "Target should contain 'labels' key"
else:
assert isinstance(target, str) or isinstance(
target, np.ndarray
), "Target should be a string or a numpy array"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[datasets] Filter currupted and wrong annotated files in ready to use datasets
3 participants