Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add MNIST-like characters dataset generator #408

Closed
wants to merge 17 commits into from
Closed

Conversation

charlesmindee
Copy link
Collaborator

@charlesmindee charlesmindee commented Aug 11, 2021

This PR implements a generate_character function to generate random single character images from a vocabulary to train our recognition backbones.

Linked to #255

Any feedback is welcome!

Samples:
mnist

@charlesmindee charlesmindee self-assigned this Aug 11, 2021
@charlesmindee charlesmindee added type: enhancement Improvement module: datasets Related to doctr.datasets labels Aug 11, 2021
@charlesmindee charlesmindee added this to the 0.4.0 milestone Aug 11, 2021
@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #408 (e598a50) into main (bb611f5) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head e598a50 differs from pull request most recent head e478b52. Consider uploading reports for the commit e478b52 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #408      +/-   ##
==========================================
- Coverage   95.83%   95.82%   -0.01%     
==========================================
  Files          91       92       +1     
  Lines        3815     3833      +18     
==========================================
+ Hits         3656     3673      +17     
- Misses        159      160       +1     
Flag Coverage Δ
unittests 95.82% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/__init__.py 100.00% <100.00%> (ø)
doctr/datasets/character_generator.py 100.00% <100.00%> (ø)
...dels/detection/differentiable_binarization/base.py 91.35% <0.00%> (-0.62%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb611f5...e478b52. Read the comment docs.

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
To make it more modular, could you turn your function into something deterministic please?

  • one to generate an image for a given character
    (if later on, we use this for random generation, we'll just have to randomly pick within a vocab)

@fg-mindee
Copy link
Contributor

Also we need some unittests, and mypy isn't happy 😅

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few improvement suggestions in comments!

doctr/datasets/character_generator.py Outdated Show resolved Hide resolved
doctr/datasets/character_generator.py Outdated Show resolved Hide resolved
@fg-mindee
Copy link
Contributor

Also quick suggestion: let's produce straight characters (we'll be able to use transforms afterwards to rotate the characters if needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants