-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add string translating/encoding utilities #116
Conversation
Codecov Report
@@ Coverage Diff @@
## main #116 +/- ##
==========================================
- Coverage 97.53% 97.41% -0.12%
==========================================
Files 29 32 +3
Lines 972 1005 +33
==========================================
+ Hits 948 979 +31
- Misses 24 26 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The general design is quite good, I added a few optimization suggestions!
doctr/datasets/utils.py
Outdated
char = unicodedata.normalize('NFD', char).encode('ascii', 'ignore').decode('ascii') | ||
if char == '' or char not in vocabs[vocab]: | ||
# if normalization fails or char still not in vocab, return a black square (unknown symbol) | ||
char = '■' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmh, don't you think it would be safer for us to hardcode the conversion between vocabs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose you have a dataset with such a word: ¢¾©téØßřůëðÛ@læ5€ë𮵶
You need to map each character to our vocab, I agree I would be perfect to control it manually, but you have hundreds of different characters (russian, greek, math, symbols, punctuation, shapes, norvegian, chineese, ...) and for each of these char you have different possible matching for each vocab so the attribution task seems hardly achievable manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested a small refactoring, let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a default value to the unknown character and we're good to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the edits!
This PR is linked to issue #115, it implements:
A dictionnary of vocabularies is also provided
Any feeback is welcome!