[datasets] revert whitespace filtering and fix svhn reco #987

felixdittrich92 · 2022-07-14T07:32:20Z

This PR:

reverts the whitespace filtering in FUNSD and CORD (is to much focused then to be used only inside docTR)
update recognition eval scripts and corresponding CI job ( batch_size = 1 as default to ensure nothing is skipped in fact of whitespaces seperated labels)
add fix for svhn dataset (recognition)

Any feddback is welcome 🤗

felixdittrich92 · 2022-07-14T07:38:40Z

@frgfm I think it would be the best to filter the labels directly .. the dataset is fix and we know what we need to exclude feels cleaner instead of any encode-decode try/except block 😅 wdyt ?

codecov · 2022-07-14T07:49:00Z

Codecov Report

Merging #987 (d4b981b) into main (1dafd33) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #987   +/-   ##
=======================================
  Coverage   94.88%   94.88%           
=======================================
  Files         134      134           
  Lines        5556     5556           
=======================================
  Hits         5272     5272           
  Misses        284      284

Flag	Coverage Δ
unittests	`94.88% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/cord.py	`97.72% <100.00%> (-0.06%)`	⬇️
doctr/datasets/funsd.py	`97.43% <100.00%> (ø)`
doctr/datasets/svhn.py	`95.34% <100.00%> (+0.11%)`	⬆️
doctr/transforms/modules/base.py	`94.59% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dafd33...d4b981b. Read the comment docs.

frgfm

Thanks! just added a few questions

And about removing white spaces early on, we can if it helps! I was just raising the point that depending on the dataset, that filter might seriously reduce the number of samples.

Either way we need to make a well-informed decision :)

frgfm · 2022-07-20T09:35:48Z

references/recognition/evaluate_pytorch.py

-    parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation')
+    parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation')


isn't it best to specify the batch size in the CI but leave a decent batch size for training?

It's only eval ;) in this case we ensure while try block that no sample is skipped which can be read (does not contain whitespace and chars in vocab)

CI uses a batch size of 32 to speedup so it's not relevant if samples are skipped for testing

frgfm · 2022-07-20T09:35:59Z

references/recognition/evaluate_tensorflow.py

-    parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation')
+    parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation')


felixdittrich92 · 2022-07-20T10:02:43Z

Thanks! just added a few questions

And about removing white spaces early on, we can if it helps! I was just raising the point that depending on the dataset, that filter might seriously reduce the number of samples.

Either way we need to make a well-informed decision :)

About the whitespaces that was a bit to much focused on docTR behavior but we need to provide the whole dataset for example if a user only wants to use the integrated datasets maybe for a model which is able to recognize also whitespaces

revert whitespaces filtering and update scripts

5e4bfe6

felixdittrich92 self-assigned this Jul 14, 2022

felixdittrich92 added module: datasets Related to doctr.datasets ext: references Related to references folder type: misc Miscellaneous labels Jul 14, 2022

felixdittrich92 added this to the 0.6.0 milestone Jul 14, 2022

felixdittrich92 requested a review from frgfm July 14, 2022 07:35

update comment

949e10d

felixdittrich92 requested a review from charlesmindee July 14, 2022 11:50

fix svhn recognition

d4b981b

felixdittrich92 changed the title ~~[datasets] revert whitespace filtering~~ [datasets] revert whitespace filtering and fix svhn reco Jul 15, 2022

felixdittrich92 mentioned this pull request Jul 15, 2022

[datasets] Filter currupted and wrong annotated files in ready to use datasets #935

Closed

23 tasks

frgfm approved these changes Jul 20, 2022

View reviewed changes

felixdittrich92 merged commit 739943a into mindee:main Jul 20, 2022

felixdittrich92 deleted the fix-funsd-cord branch July 20, 2022 10:00

felixdittrich92 mentioned this pull request Sep 26, 2022

Release tracker - v0.6.0 #791

Closed

85 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] revert whitespace filtering and fix svhn reco #987

[datasets] revert whitespace filtering and fix svhn reco #987

felixdittrich92 commented Jul 14, 2022 •

edited

felixdittrich92 commented Jul 14, 2022

codecov bot commented Jul 14, 2022 •

edited

frgfm left a comment

frgfm Jul 20, 2022

felixdittrich92 Jul 20, 2022

felixdittrich92 Jul 20, 2022

frgfm Jul 20, 2022

felixdittrich92 commented Jul 20, 2022

		parser.add_argument('-b', '--batch_size', type=int, default=32, help='batch size for evaluation')
		parser.add_argument('-b', '--batch_size', type=int, default=1, help='batch size for evaluation')

[datasets] revert whitespace filtering and fix svhn reco #987

[datasets] revert whitespace filtering and fix svhn reco #987

Conversation

felixdittrich92 commented Jul 14, 2022 • edited

felixdittrich92 commented Jul 14, 2022

codecov bot commented Jul 14, 2022 • edited

Codecov Report

frgfm left a comment

Choose a reason for hiding this comment

frgfm Jul 20, 2022

Choose a reason for hiding this comment

felixdittrich92 Jul 20, 2022

Choose a reason for hiding this comment

felixdittrich92 Jul 20, 2022

Choose a reason for hiding this comment

frgfm Jul 20, 2022

Choose a reason for hiding this comment

felixdittrich92 commented Jul 20, 2022

felixdittrich92 commented Jul 14, 2022 •

edited

codecov bot commented Jul 14, 2022 •

edited