Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Split DocArtefacts into subsets and updated its class mapping #601

Merged
merged 10 commits into from
Nov 12, 2021

Conversation

fg-mindee
Copy link
Contributor

This PR introduces the following modifications:

  • updates the URL of the zip which now contains a train and val subset
  • updates the labels target: rather than being a list of string, it will return a numpy.array of class indices
  • updates the docstring of the class

Any feedback is welcome!

@fg-mindee fg-mindee added type: enhancement Improvement module: datasets Related to doctr.datasets labels Nov 9, 2021
@fg-mindee fg-mindee added this to the 0.5.0 milestone Nov 9, 2021
@fg-mindee fg-mindee self-assigned this Nov 9, 2021
charlesmindee
charlesmindee previously approved these changes Nov 9, 2021
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@SiddhantBahuguna SiddhantBahuguna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we add the collate function here?
  2. Also, we need to obtain absolute box coordinates. Right now we use relative ones.

@fg-mindee
Copy link
Contributor Author

fg-mindee commented Nov 10, 2021

  1. Should we add the collate function here?

Oh yeah, you're right, I'll change this

  1. Also, we need to obtain absolute box coordinates. Right now we use relative ones.

But that's OK, most of our datasets are in relative coords, and we convert it if not.

@fg-mindee
Copy link
Contributor Author

For the collate function, actually I just checked and it's already working @SiddhantBahuguna :

from doctr.datasets import DocArtefacts
from doctr.datasets import DataLoader

ds = DocArtefacts(train=True, download=True)
train_loader = DataLoader(ds, batch_size=2)
train_iter = iter(train_loader)
x, targets = next(train_iter)

And the results look satisfactory to me:

# Check shape of input
print(x.shape)
TensorShape([2, 1024, 800, 3])
print(len(targets))
print(targets[0])
2
{'boxes': array([[0.94625   , 0.39746094, 0.99375   , 0.4345703 ],
       [0.38375   , 0.5957031 , 0.4275    , 0.6269531 ],
       [0.41875   , 0.28027344, 0.52875   , 0.36621094],
       [0.39625   , 0.36816406, 0.6675    , 0.57910156],
       [0.26625   , 0.5332031 , 0.2975    , 0.5576172 ],
       [0.20125   , 0.1953125 , 0.2575    , 0.23925781],
       [0.105     , 0.00976562, 0.17125   , 0.06152344],
       [0.15875   , 0.078125  , 0.2425    , 0.14355469],
       [0.10125   , 0.5292969 , 0.25875   , 0.6796875 ],
       [0.73625   , 0.7138672 , 0.85375   , 0.8261719 ]], dtype=float32), 'labels': array([1, 2, 1, 1, 1, 1, 3, 3, 4, 4])}

@codecov
Copy link

codecov bot commented Nov 12, 2021

Codecov Report

Merging #601 (bd74611) into main (f97e92b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head bd74611 differs from pull request most recent head 0c7c672. Consider uploading reports for the commit 0c7c672 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             main     #601   +/-   ##
=======================================
  Coverage   96.06%   96.06%           
=======================================
  Files         110      110           
  Lines        4265     4269    +4     
=======================================
+ Hits         4097     4101    +4     
  Misses        168      168           
Flag Coverage Δ
unittests 96.06% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/doc_artefacts.py 94.11% <100.00%> (+0.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab26073...0c7c672. Read the comment docs.

Copy link
Contributor

@SiddhantBahuguna SiddhantBahuguna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is following our expectations. Collate problem is resolved now. Thanks:)

@fg-mindee fg-mindee merged commit 400aec0 into main Nov 12, 2021
@fg-mindee fg-mindee deleted the artefact-update branch November 12, 2021 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants