New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add toggle to turn off strip_accents
.
#88
Conversation
@stefan-it Since you are also training language models for German language (and many others): could you please also have a look onto this PR and say what you think? Many thanks. |
Are there any concerns or questions about this PR? Is there any reason not to accept it? |
LGTM, I tested it with the input: ÖÄÜ?
ßöäü Together with ['ö', '##ä', '##ü', '?']
['ß', '##ö', '##ä', '##ü'] whereas the ['o', '##au', '?']
['ß', '##oa', '##u'] |
@stefan-it many thanks for your evaluation. ;-) |
Hi Electra Team. It would be awesome if you could merge this PR or give me a hint what you are still missing. Many thanks |
Dear maintainers. Could you please have a look on this PR? Thanks... |
Hey Google Research team. Could you please check this PR? @clarkkev |
Hello dear friends from Google research - a friendly reminder to check this PR and maybe merge it. |
Hi @PhilipMay, thanks for the contribution and sorry for the delay. We will take a look shortly. |
parser.set_defaults(do_lower_case=True) | ||
parser.set_defaults(strip_accents=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we set the default to False here and other places to preserve the original behavior?
I added a comment above. I think the original behavior is preserved as it is now with
|
Original model is uncased, so yeah I think these default are correct 👍 |
Awesome - thanks for your review @stefan-it . |
Ah that's right. We stripped the accents by default. |
In some languages like German the accents are important and change the sementics. Examples:
But when doing
lower_case
they are automatically always stripped.This PR adds a toggle to make it possible to do
lower_case
but keep the accents. This conforms to thetransformers.tokenization_bert.BertTokenizerFast
which also has an boolean parameter calledstrip_accents
.