Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed support of characters out of latin-1 in page synthesis #496

Merged
merged 1 commit into from
Sep 28, 2021

Conversation

fg-mindee
Copy link
Contributor

@fg-mindee fg-mindee commented Sep 28, 2021

This PR fixes the page synthesis feature. Up until now, we have relied on the support of text drawing using Pillow. However as pointed out by #495, this support is limited.

This snippet crashes:

from PIL import Image, ImageDraw
img = Image.new('RGB', (100, 100), color=(255, 255, 255))
d = ImageDraw.Draw(img)
d.text((0, 0), '€', fill=(0, 0, 0))

this PR changes this by catching the UnicodeEncodeError and normalizing the string as follows:

from PIL import Image, ImageDraw
from unidecode import unidecode
img = Image.new('RGB', (100, 100), color=(255, 255, 255))
d = ImageDraw.Draw(img)
d.text((0, 0), unidecode('€'), fill=(0, 0, 0))  #  <--- over HERE

Please note that in doing so, some string will be synthesized with different length cf the snippet below:

In [1]: from unidecode import unidecode
In [2]: print(unidecode('€'))
EUR

Closes #495

Any feedback is welcome!

@fg-mindee fg-mindee added type: bug Something isn't working module: utils Related to doctr.utils labels Sep 28, 2021
@fg-mindee fg-mindee added this to the 0.4.0 milestone Sep 28, 2021
@fg-mindee fg-mindee self-assigned this Sep 28, 2021
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that!

@fg-mindee fg-mindee merged commit 6517000 into main Sep 28, 2021
@fg-mindee fg-mindee deleted the synthesis-fix branch September 28, 2021 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: utils Related to doctr.utils type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[visualization] Some unsupported characters make page synthesis crash
2 participants