feat: add string translating/encoding utilities #116

charlesmindee · 2021-03-08T15:45:31Z

This PR is linked to issue #115, it implements:

a translate function to translate a str sequence to another str sequence of another vocabulary
an encoding function to encode an str sequence to an array to be fed in the model
a decoding function to decode an array of int into a str sequence

A dictionnary of vocabularies is also provided

Any feeback is welcome!

codecov · 2021-03-08T15:50:47Z

Codecov Report

Merging #116 (bf43dba) into main (a585f03) will decrease coverage by 0.11%.
The diff coverage is 93.93%.

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
- Coverage   97.53%   97.41%   -0.12%     
==========================================
  Files          29       32       +3     
  Lines         972     1005      +33     
==========================================
+ Hits          948      979      +31     
- Misses         24       26       +2

Flag	Coverage Δ
unittests	`97.41% <93.93%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/utils.py	`92.59% <92.59%> (ø)`
doctr/datasets/__init__.py	`100.00% <100.00%> (ø)`
doctr/datasets/vocabs.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a585f03...86ed165. Read the comment docs.

fg-mindee

Thanks for the PR! The general design is quite good, I added a few optimization suggestions!

doctr/datasets/utils.py

fg-mindee · 2021-03-08T16:05:20Z

doctr/datasets/utils.py

+            char = unicodedata.normalize('NFD', char).encode('ascii', 'ignore').decode('ascii')
+            if char == '' or char not in vocabs[vocab]:
+                # if normalization fails or char still not in vocab, return a black square (unknown symbol)
+                char = '■'


Mmmh, don't you think it would be safer for us to hardcode the conversion between vocabs?

Suppose you have a dataset with such a word: ¢¾©téØßřůëðÛ@læ5€ëð®µ¶
You need to map each character to our vocab, I agree I would be perfect to control it manually, but you have hundreds of different characters (russian, greek, math, symbols, punctuation, shapes, norvegian, chineese, ...) and for each of these char you have different possible matching for each vocab so the attribution task seems hardly achievable manually.

doctr/datasets/utils.py

fg-mindee

I suggested a small refactoring, let me know what you think!

doctr/datasets/utils.py

fg-mindee

Let's add a default value to the unknown character and we're good to go!

doctr/datasets/utils.py

fg-mindee

Thank for the edits!

feat: add string translating/encoding utilities

536ffac

charlesmindee added the module: datasets Related to doctr.datasets label Mar 8, 2021

charlesmindee added this to the 0.2.0 milestone Mar 8, 2021

charlesmindee requested a review from fg-mindee March 8, 2021 15:45

charlesmindee self-assigned this Mar 8, 2021

fg-mindee reviewed Mar 8, 2021

View reviewed changes

charlesmindee added 3 commits March 9, 2021 09:45

refacto: PR comments

e0661e6

fix: typos

a963f9a

fix: codefactor map

b93620e

fg-mindee reviewed Mar 9, 2021

View reviewed changes

doctr/datasets/utils.py Outdated Show resolved Hide resolved

doctr/datasets/utils.py Outdated Show resolved Hide resolved

charlesmindee added 2 commits March 9, 2021 11:11

refacto: vocab

770fa2f

Merge branch 'main' into str_encoding

5b18666

fg-mindee suggested changes Mar 9, 2021

View reviewed changes

doctr/datasets/utils.py Outdated Show resolved Hide resolved

doctr/datasets/utils.py Outdated Show resolved Hide resolved

refacto: ukn char

86ed165

fg-mindee approved these changes Mar 9, 2021

View reviewed changes

fg-mindee merged commit 3f81f5c into main Mar 9, 2021

fg-mindee deleted the str_encoding branch March 9, 2021 10:35

fg-mindee mentioned this pull request Mar 9, 2021

[encoding] Unify string encoding for datasets and models #115

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add string translating/encoding utilities #116

feat: add string translating/encoding utilities #116

charlesmindee commented Mar 8, 2021

codecov bot commented Mar 8, 2021 •

edited

Loading

fg-mindee left a comment

fg-mindee Mar 8, 2021

charlesmindee Mar 9, 2021

fg-mindee left a comment

fg-mindee left a comment

fg-mindee left a comment

feat: add string translating/encoding utilities #116

feat: add string translating/encoding utilities #116

Conversation

charlesmindee commented Mar 8, 2021

codecov bot commented Mar 8, 2021 • edited Loading

Codecov Report

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee Mar 8, 2021

Choose a reason for hiding this comment

charlesmindee Mar 9, 2021

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 8, 2021 •

edited

Loading