Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add toggle to turn off strip_accents. #88

Merged
merged 14 commits into from Mar 31, 2021

Conversation

PhilipMay
Copy link
Contributor

In some languages like German the accents are important and change the sementics. Examples:

  1. mochte vs. möchte
  2. musste vs. müsste
  3. etc.

But when doing lower_case they are automatically always stripped.

This PR adds a toggle to make it possible to do lower_case but keep the accents. This conforms to the transformers.tokenization_bert.BertTokenizerFast which also has an boolean parameter called strip_accents.

@PhilipMay
Copy link
Contributor Author

@stefan-it Since you are also training language models for German language (and many others): could you please also have a look onto this PR and say what you think? Many thanks.

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Aug 8, 2020

Are there any concerns or questions about this PR? Is there any reason not to accept it?
Thanks
Philip

@stefan-it
Copy link
Contributor

LGTM, I tested it with the input:

ÖÄÜ?
ßöäü

Together with --do-lower-case and --no-strip-accents the output from bert_tokens = self._tokenizer.tokenize(line) looks like:

['ö', '##ä', '##ü', '?']
['ß', '##ö', '##ä', '##ü']

whereas the --do-strip-accents option outputs:

['o', '##au', '?']
['ß', '##oa', '##u']

@PhilipMay
Copy link
Contributor Author

@stefan-it many thanks for your evaluation. ;-)

@PhilipMay
Copy link
Contributor Author

Hi Electra Team. It would be awesome if you could merge this PR or give me a hint what you are still missing.
The same change has been made my be to Hugging Face Transformers (and merged): huggingface/transformers#6280

Many thanks
Philip

@PhilipMay
Copy link
Contributor Author

Dear maintainers. Could you please have a look on this PR? Thanks...

@PhilipMay
Copy link
Contributor Author

Hey Google Research team. Could you please check this PR? @clarkkev

@PhilipMay
Copy link
Contributor Author

Hello dear friends from Google research - a friendly reminder to check this PR and maybe merge it.
Thanks
Philip

@lmthang
Copy link

lmthang commented Mar 30, 2021

Hi @PhilipMay, thanks for the contribution and sorry for the delay. We will take a look shortly.

parser.set_defaults(do_lower_case=True)
parser.set_defaults(strip_accents=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set the default to False here and other places to preserve the original behavior?

@PhilipMay
Copy link
Contributor Author

I added a comment above. I think the original behavior is preserved as it is now with

  • do_lower_case=True
  • strip_accents=True
    as defaults.

@stefan-it
Copy link
Contributor

Original model is uncased, so yeah I think these default are correct 👍

@PhilipMay
Copy link
Contributor Author

Awesome - thanks for your review @stefan-it .

@lmthang
Copy link

lmthang commented Mar 31, 2021

Ah that's right. We stripped the accents by default.

@clarkkev clarkkev merged commit 8a46635 into google-research:master Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants