Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add string translating/encoding utilities #116

Merged
merged 7 commits into from
Mar 9, 2021
Merged

Conversation

charlesmindee
Copy link
Collaborator

This PR is linked to issue #115, it implements:

  • a translate function to translate a str sequence to another str sequence of another vocabulary
  • an encoding function to encode an str sequence to an array to be fed in the model
  • a decoding function to decode an array of int into a str sequence

A dictionnary of vocabularies is also provided

Any feeback is welcome!

@charlesmindee charlesmindee added the module: datasets Related to doctr.datasets label Mar 8, 2021
@charlesmindee charlesmindee added this to the 0.2.0 milestone Mar 8, 2021
@charlesmindee charlesmindee self-assigned this Mar 8, 2021
@codecov
Copy link

codecov bot commented Mar 8, 2021

Codecov Report

Merging #116 (bf43dba) into main (a585f03) will decrease coverage by 0.11%.
The diff coverage is 93.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
- Coverage   97.53%   97.41%   -0.12%     
==========================================
  Files          29       32       +3     
  Lines         972     1005      +33     
==========================================
+ Hits          948      979      +31     
- Misses         24       26       +2     
Flag Coverage Δ
unittests 97.41% <93.93%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/utils.py 92.59% <92.59%> (ø)
doctr/datasets/__init__.py 100.00% <100.00%> (ø)
doctr/datasets/vocabs.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a585f03...86ed165. Read the comment docs.

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The general design is quite good, I added a few optimization suggestions!

doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
Comment on lines 50 to 53
char = unicodedata.normalize('NFD', char).encode('ascii', 'ignore').decode('ascii')
if char == '' or char not in vocabs[vocab]:
# if normalization fails or char still not in vocab, return a black square (unknown symbol)
char = '■'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh, don't you think it would be safer for us to hardcode the conversion between vocabs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose you have a dataset with such a word: ¢¾©téØßřůëðÛ@læ5€ë𮵶
You need to map each character to our vocab, I agree I would be perfect to control it manually, but you have hundreds of different characters (russian, greek, math, symbols, punctuation, shapes, norvegian, chineese, ...) and for each of these char you have different possible matching for each vocab so the attribution task seems hardly achievable manually.

doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested a small refactoring, let me know what you think!

doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a default value to the unknown character and we're good to go!

doctr/datasets/utils.py Outdated Show resolved Hide resolved
doctr/datasets/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the edits!

@fg-mindee fg-mindee merged commit 3f81f5c into main Mar 9, 2021
@fg-mindee fg-mindee deleted the str_encoding branch March 9, 2021 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants