-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[datasets] revert whitespace filtering and fix svhn reco #987
Conversation
@frgfm I think it would be the best to filter the labels directly .. the dataset is fix and we know what we need to exclude feels cleaner instead of any encode-decode try/except block 馃槄 wdyt ? |
Codecov Report
@@ Coverage Diff @@
## main #987 +/- ##
=======================================
Coverage 94.88% 94.88%
=======================================
Files 134 134
Lines 5556 5556
=======================================
Hits 5272 5272
Misses 284 284
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! just added a few questions
And about removing white spaces early on, we can if it helps! I was just raising the point that depending on the dataset, that filter might seriously reduce the number of samples.
Either way we need to make a well-informed decision :)
parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation') | ||
parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it best to specify the batch size in the CI but leave a decent batch size for training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only eval ;) in this case we ensure while try block that no sample is skipped which can be read (does not contain whitespace and chars in vocab)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI uses a batch size of 32 to speedup so it's not relevant if samples are skipped for testing
parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation') | ||
parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
About the whitespaces that was a bit to much focused on docTR behavior but we need to provide the whole dataset for example if a user only wants to use the integrated datasets maybe for a model which is able to recognize also whitespaces |
This PR:
Any feddback is welcome 馃