Add toggle to turn off `strip_accents`. #88

PhilipMay · 2020-08-01T09:04:35Z

In some languages like German the accents are important and change the sementics. Examples:

mochte vs. möchte
musste vs. müsste
etc.

But when doing lower_case they are automatically always stripped.

This PR adds a toggle to make it possible to do lower_case but keep the accents. This conforms to the transformers.tokenization_bert.BertTokenizerFast which also has an boolean parameter called strip_accents.

PhilipMay · 2020-08-01T09:06:37Z

@stefan-it Since you are also training language models for German language (and many others): could you please also have a look onto this PR and say what you think? Many thanks.

PhilipMay · 2020-08-08T11:35:53Z

Are there any concerns or questions about this PR? Is there any reason not to accept it?
Thanks
Philip

stefan-it · 2020-08-11T12:35:32Z

LGTM, I tested it with the input:

ÖÄÜ?
ßöäü

Together with --do-lower-case and --no-strip-accents the output from bert_tokens = self._tokenizer.tokenize(line) looks like:

['ö', '##ä', '##ü', '?']
['ß', '##ö', '##ä', '##ü']

whereas the --do-strip-accents option outputs:

['o', '##au', '?']
['ß', '##oa', '##u']

PhilipMay · 2020-08-11T18:10:50Z

@stefan-it many thanks for your evaluation. ;-)

PhilipMay · 2020-08-30T17:22:06Z

Hi Electra Team. It would be awesome if you could merge this PR or give me a hint what you are still missing.
The same change has been made my be to Hugging Face Transformers (and merged): huggingface/transformers#6280

Many thanks
Philip

PhilipMay · 2020-11-01T20:19:26Z

Dear maintainers. Could you please have a look on this PR? Thanks...

PhilipMay · 2020-12-04T18:37:44Z

Hey Google Research team. Could you please check this PR? @clarkkev

PhilipMay · 2021-03-20T15:45:42Z

Hello dear friends from Google research - a friendly reminder to check this PR and maybe merge it.
Thanks
Philip

lmthang · 2021-03-30T17:41:33Z

Hi @PhilipMay, thanks for the contribution and sorry for the delay. We will take a look shortly.

lmthang · 2021-03-30T17:43:52Z

build_openwebtext_pretraining_dataset.py

  parser.set_defaults(do_lower_case=True)
+  parser.set_defaults(strip_accents=True)


should we set the default to False here and other places to preserve the original behavior?

PhilipMay · 2021-03-30T19:02:33Z

I added a comment above. I think the original behavior is preserved as it is now with

do_lower_case=True
strip_accents=True
as defaults.

stefan-it · 2021-03-30T19:36:14Z

Original model is uncased, so yeah I think these default are correct 👍

PhilipMay · 2021-03-30T19:58:02Z

Awesome - thanks for your review @stefan-it .

lmthang · 2021-03-31T22:38:44Z

Ah that's right. We stripped the accents by default.

PhilipMay added 14 commits July 31, 2020 13:00

No _run_strip_accents

b414724

add strip_accents param

988a077

Fix bug accessing strip_accents

13b196c

Add strip_acc. opt. to build_pretraining_dataset

229f785

Update build_pretraining_dataset.py

cde962e

Update build_pretraining_dataset.py

719d16b

Update build_pretraining_dataset.py

2923de7

Update build_pretraining_dataset.py

575f96a

rename strip-accents command line argument

3bbce09

code doc for command line params

229efd2

trim trailing whitespace

1cd5c4c

add strip_accents toggle to build_openweb

5e66ae7

command line doc

9dd0028

Docstring for accents

ce4b869

PhilipMay mentioned this pull request Aug 1, 2020

Add option to use fast HF tokenizer. deepset-ai/FARM#482

Merged

4 tasks

stefan-it mentioned this pull request Feb 10, 2021

Infos about the german model dbmdz/berts#26

Closed

lmthang reviewed Mar 30, 2021

View reviewed changes

lmthang approved these changes Mar 31, 2021

View reviewed changes

clarkkev merged commit 8a46635 into google-research:master Mar 31, 2021

PhilipMay mentioned this pull request Apr 3, 2021

Add toggle to turn off strip_accents. yitu-opensource/ConvBert#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add toggle to turn off `strip_accents`. #88

Add toggle to turn off `strip_accents`. #88

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 8, 2020 •

edited

stefan-it commented Aug 11, 2020

PhilipMay commented Aug 11, 2020

PhilipMay commented Aug 30, 2020

PhilipMay commented Nov 1, 2020

PhilipMay commented Dec 4, 2020

PhilipMay commented Mar 20, 2021

lmthang commented Mar 30, 2021

lmthang Mar 30, 2021

PhilipMay commented Mar 30, 2021

stefan-it commented Mar 30, 2021

PhilipMay commented Mar 30, 2021

lmthang commented Mar 31, 2021

		parser.set_defaults(do_lower_case=True)
		parser.set_defaults(strip_accents=True)

Add toggle to turn off strip_accents. #88

Add toggle to turn off strip_accents. #88

Conversation

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 1, 2020

PhilipMay commented Aug 8, 2020 • edited

stefan-it commented Aug 11, 2020

PhilipMay commented Aug 11, 2020

PhilipMay commented Aug 30, 2020

PhilipMay commented Nov 1, 2020

PhilipMay commented Dec 4, 2020

PhilipMay commented Mar 20, 2021

lmthang commented Mar 30, 2021

lmthang Mar 30, 2021

Choose a reason for hiding this comment

PhilipMay commented Mar 30, 2021

stefan-it commented Mar 30, 2021

PhilipMay commented Mar 30, 2021

lmthang commented Mar 31, 2021

Add toggle to turn off `strip_accents`. #88

Add toggle to turn off `strip_accents`. #88

PhilipMay commented Aug 8, 2020 •

edited