Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] revert whitespace filtering and fix svhn reco #987

Merged
merged 3 commits into from
Jul 20, 2022

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Jul 14, 2022

This PR:

  • reverts the whitespace filtering in FUNSD and CORD (is to much focused then to be used only inside docTR)
  • update recognition eval scripts and corresponding CI job ( batch_size = 1 as default to ensure nothing is skipped in fact of whitespaces seperated labels)
  • add fix for svhn dataset (recognition)

Any feddback is welcome 馃

@felixdittrich92 felixdittrich92 self-assigned this Jul 14, 2022
@felixdittrich92 felixdittrich92 added module: datasets Related to doctr.datasets ext: references Related to references folder type: misc Miscellaneous labels Jul 14, 2022
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Jul 14, 2022
@felixdittrich92
Copy link
Contributor Author

@frgfm I think it would be the best to filter the labels directly .. the dataset is fix and we know what we need to exclude feels cleaner instead of any encode-decode try/except block 馃槄 wdyt ?

@codecov
Copy link

codecov bot commented Jul 14, 2022

Codecov Report

Merging #987 (d4b981b) into main (1dafd33) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #987   +/-   ##
=======================================
  Coverage   94.88%   94.88%           
=======================================
  Files         134      134           
  Lines        5556     5556           
=======================================
  Hits         5272     5272           
  Misses        284      284           
Flag Coverage 螖
unittests 94.88% <100.00%> (酶)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage 螖
doctr/datasets/cord.py 97.72% <100.00%> (-0.06%) 猬囷笍
doctr/datasets/funsd.py 97.43% <100.00%> (酶)
doctr/datasets/svhn.py 95.34% <100.00%> (+0.11%) 猬嗭笍
doctr/transforms/modules/base.py 94.59% <0.00%> (酶)

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 1dafd33...d4b981b. Read the comment docs.

@felixdittrich92 felixdittrich92 changed the title [datasets] revert whitespace filtering [datasets] revert whitespace filtering and fix svhn reco Jul 15, 2022
Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! just added a few questions

And about removing white spaces early on, we can if it helps! I was just raising the point that depending on the dataset, that filter might seriously reduce the number of samples.

Either way we need to make a well-informed decision :)

parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation')
parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it best to specify the batch size in the CI but leave a decent batch size for training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only eval ;) in this case we ensure while try block that no sample is skipped which can be read (does not contain whitespace and chars in vocab)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI uses a batch size of 32 to speedup so it's not relevant if samples are skipped for testing

parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation')
parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@felixdittrich92 felixdittrich92 merged commit 739943a into mindee:main Jul 20, 2022
@felixdittrich92 felixdittrich92 deleted the fix-funsd-cord branch July 20, 2022 10:00
@felixdittrich92
Copy link
Contributor Author

Thanks! just added a few questions

And about removing white spaces early on, we can if it helps! I was just raising the point that depending on the dataset, that filter might seriously reduce the number of samples.

Either way we need to make a well-informed decision :)

About the whitespaces that was a bit to much focused on docTR behavior but we need to provide the whole dataset for example if a user only wants to use the integrated datasets maybe for a model which is able to recognize also whitespaces

@felixdittrich92 felixdittrich92 mentioned this pull request Sep 26, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: references Related to references folder module: datasets Related to doctr.datasets type: misc Miscellaneous
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants