Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added WordGenerator dataset #760

Merged
merged 8 commits into from
Dec 27, 2021
Merged

feat: Added WordGenerator dataset #760

merged 8 commits into from
Dec 27, 2021

Conversation

fg-mindee
Copy link
Contributor

@fg-mindee fg-mindee commented Dec 26, 2021

This PR introduces the following modifications:

  • implements a WordGenerator in the same spirit as the CharGenerator using a min_chars and max_chars to specify the word length.
  • updates unittests and documentation

To illustrate the PR, the following snippet:

from doctr.datasets.generator.base import synthesize_text_img
import matplotlib.pyplot as plt

plt.imshow(synthesize_text_img('a', font_size=32)); plt.show()
plt.imshow(synthesize_text_img('abcdefgh', font_size=32)); plt.show()

produces:
single_letter
full_word

And the same in font_size=64:
single_letter64
word64

Finally, the dataset itself:

from doctr.datasets import WordGenerator, VOCABS
import matplotlib.pyplot as plt
from math import floor

# Font will be picked randomly in the list
ds = WordGenerator(VOCABS['french'], 1, 10, num_samples=8, font_family=["FreeMono.ttf", "FreeSans.ttf", "FreeSerif.ttf"])
# Visualize
_, axes = plt.subplots(4, 2)
for idx in range(8):
    img, target = ds[idx]
    row_idx = int(idx / 2)
    col_idx = idx % 2
    axes[row_idx, col_idx].imshow(img.numpy())
    axes[row_idx, col_idx].set_title(target)
for ax in axes.ravel():
   ax.axis('off')
plt.show()

renders as follows:
wordgen

(Easily turned into something more appealing with transformations like ColorInversion, etc.)

Please note that some characters such as bitcoin symbol are not properly rendered silently, this will have to be investigated

Closes #262

@fg-mindee fg-mindee added topic: documentation Improvements or additions to documentation type: enhancement Improvement ext: tests Related to tests folder module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition labels Dec 26, 2021
@fg-mindee fg-mindee added this to the 0.5.0 milestone Dec 26, 2021
@fg-mindee fg-mindee self-assigned this Dec 26, 2021
@codecov
Copy link

codecov bot commented Dec 26, 2021

Codecov Report

Merging #760 (5455a2c) into main (56a5830) will decrease coverage by 0.16%.
The diff coverage is 83.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #760      +/-   ##
==========================================
- Coverage   96.22%   96.05%   -0.17%     
==========================================
  Files         129      129              
  Lines        4764     4794      +30     
==========================================
+ Hits         4584     4605      +21     
- Misses        180      189       +9     
Flag Coverage Δ
unittests 96.05% <83.72%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/generator/__init__.py 100.00% <ø> (ø)
doctr/datasets/generator/base.py 81.08% <79.41%> (ø)
doctr/datasets/__init__.py 100.00% <100.00%> (ø)
doctr/datasets/generator/pytorch.py 100.00% <100.00%> (ø)
doctr/datasets/generator/tensorflow.py 100.00% <100.00%> (ø)
doctr/models/builder.py 96.52% <0.00%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56a5830...5455a2c. Read the comment docs.

Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, indeed we need to find a way to filter out characters that cannot be rendered, an option could be to cast them with unidecode

@fg-mindee fg-mindee merged commit 221e421 into main Dec 27, 2021
@fg-mindee fg-mindee deleted the word-gen branch December 27, 2021 11:16
@fg-mindee fg-mindee added type: new feature New feature ext: docs Related to docs folder and removed type: enhancement Improvement labels Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: docs Related to docs folder ext: tests Related to tests folder module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation topic: text recognition Related to the task of text recognition type: new feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[datasets] Add a synthetic recognition dataset
2 participants