feat: Added WordGenerator dataset #760

fg-mindee · 2021-12-26T17:33:34Z

This PR introduces the following modifications:

implements a WordGenerator in the same spirit as the CharGenerator using a min_chars and max_chars to specify the word length.
updates unittests and documentation

To illustrate the PR, the following snippet:

from doctr.datasets.generator.base import synthesize_text_img
import matplotlib.pyplot as plt

plt.imshow(synthesize_text_img('a', font_size=32)); plt.show()
plt.imshow(synthesize_text_img('abcdefgh', font_size=32)); plt.show()

produces:

And the same in font_size=64:

Finally, the dataset itself:

from doctr.datasets import WordGenerator, VOCABS
import matplotlib.pyplot as plt
from math import floor

# Font will be picked randomly in the list
ds = WordGenerator(VOCABS['french'], 1, 10, num_samples=8, font_family=["FreeMono.ttf", "FreeSans.ttf", "FreeSerif.ttf"])
# Visualize
_, axes = plt.subplots(4, 2)
for idx in range(8):
    img, target = ds[idx]
    row_idx = int(idx / 2)
    col_idx = idx % 2
    axes[row_idx, col_idx].imshow(img.numpy())
    axes[row_idx, col_idx].set_title(target)
for ax in axes.ravel():
   ax.axis('off')
plt.show()

renders as follows:

(Easily turned into something more appealing with transformations like ColorInversion, etc.)

Please note that some characters such as bitcoin symbol are not properly rendered silently, this will have to be investigated

Closes #262

codecov · 2021-12-26T17:43:17Z

Codecov Report

Merging #760 (5455a2c) into main (56a5830) will decrease coverage by 0.16%.
The diff coverage is 83.72%.

@@            Coverage Diff             @@
##             main     #760      +/-   ##
==========================================
- Coverage   96.22%   96.05%   -0.17%     
==========================================
  Files         129      129              
  Lines        4764     4794      +30     
==========================================
+ Hits         4584     4605      +21     
- Misses        180      189       +9

Flag	Coverage Δ
unittests	`96.05% <83.72%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/generator/__init__.py	`100.00% <ø> (ø)`
doctr/datasets/generator/base.py	`81.08% <79.41%> (ø)`
doctr/datasets/__init__.py	`100.00% <100.00%> (ø)`
doctr/datasets/generator/pytorch.py	`100.00% <100.00%> (ø)`
doctr/datasets/generator/tensorflow.py	`100.00% <100.00%> (ø)`
doctr/models/builder.py	`96.52% <0.00%> (-1.74%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56a5830...5455a2c. Read the comment docs.

charlesmindee

Thanks for the PR, indeed we need to find a way to filter out characters that cannot be rendered, an option could be to cast them with unidecode

fg-mindee added 8 commits December 26, 2021 16:45

feat: Added WordGenerator dataset

f7e139e

refactor: Renamed datasets.classification into datasets.generator

195dd18

docs: Updated dataset page

259a214

refactor: Refactored WordGenerator

3f10d73

test: Added unittests

c051976

feat: Improved image sizing

0040c0e

refactor: Refactored text synthesis

f457b36

refactor: Refactored text img synthesis

5455a2c

fg-mindee added topic: documentation Improvements or additions to documentation type: enhancement Improvement ext: tests Related to tests folder module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition labels Dec 26, 2021

fg-mindee added this to the 0.5.0 milestone Dec 26, 2021

fg-mindee requested review from charlesmindee and SiddhantBahuguna December 26, 2021 17:33

fg-mindee self-assigned this Dec 26, 2021

charlesmindee approved these changes Dec 27, 2021

View reviewed changes

fg-mindee merged commit 221e421 into main Dec 27, 2021

fg-mindee deleted the word-gen branch December 27, 2021 11:16

fg-mindee added type: new feature New feature ext: docs Related to docs folder and removed type: enhancement Improvement labels Dec 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added WordGenerator dataset #760

feat: Added WordGenerator dataset #760

fg-mindee commented Dec 26, 2021 •

edited

Loading

codecov bot commented Dec 26, 2021

charlesmindee left a comment

feat: Added WordGenerator dataset #760

feat: Added WordGenerator dataset #760

Conversation

fg-mindee commented Dec 26, 2021 • edited Loading

codecov bot commented Dec 26, 2021

Codecov Report

charlesmindee left a comment

Choose a reason for hiding this comment

fg-mindee commented Dec 26, 2021 •

edited

Loading