# Add whole word mask support for lm fine-tune #7925

wlhgtc · 2020-10-20T10:24:33Z

This PR add support for wwm (whole word mask) proxy when fine-tune BERT like model.
And it can be divided into two part : English Model Support and Chinese Model Support

For English, it's simple. The original tokenizer res contains symbols like '##ing'.
I just use the same mask proxy in data_collator.py by Google.

For Chinese, it's hard. We need to rely on (word level) tokenizer, cause BERT is char level in Chinese.
So I do things as follow to get word level tokens:

add get info code in chinese_ref.py
create a new dataset to keep ref info language_model.py
create word level ref according to ref data_collator.py

Then, it's all same to English.

And I add two parameters (wwm and chinese_ref_path ) to run lm.

wlhgtc · 2020-10-20T10:27:34Z

Seem all test passed( expect the format problem), @sgugger @stas00 Could you help me review these PR ?

wlhgtc · 2020-10-20T10:42:05Z

And I wonder which version of black you use in check_code_quality.
I got errors as follows:

would reformat /home/circleci/transformers/examples/language-modeling/run_language_modeling.py
would reformat /home/circleci/transformers/src/transformers/data/datasets/language_modeling.py

I reformat my code with black(19.10b0), and all files are left unchanged

sgugger

Thanks a lot for your PR!

Before digging more in a detailed review, I have a general comment: I think this should be decoupled a bit more: you created a new class LineByLineWithRefDataset, and in the same vein, I think you should create a new DataCollator for the whole-world masking. This will make it clearer to read and easier to customize.

It would also be super nice if you could document in the README how to use your example with a chinese reference file (do you pass the script you added? or use the script you added to generate a file?)

wlhgtc · 2020-10-20T14:49:13Z

Thanks a lot for your PR!

Before digging more in a detailed review, I have a general comment: I think this should be decoupled a bit more: you created a new class LineByLineWithRefDataset, and in the same vein, I think you should create a new DataCollator for the whole-world masking. This will make it clearer to read and easier to customize.

It would also be super nice if you could document in the README how to use your example with a chinese reference file (do you pass the script you added? or use the script you added to generate a file?)

Finish ~ @sgugger

sgugger

Thanks for addressing my initial comments!
Added a few suggestions of rewording, and the last thing left is to rename the argument wwm: this is not very informative and we usually have more descriptive argument names (mlm is not ideal either, I'll fix this one in another PR). Could you replace all wwm by whole_word_masking? Thanks!

examples/language-modeling/README.md

sgugger · 2020-10-20T16:04:56Z

examples/language-modeling/README.md

 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
+export REF_FILE=/path/to/ref.txt

 python run_language_modeling.py \


Can we leave one version with just mlm and no wwm first?

Sure, I update my readme, it's same with English version if only mlm.

examples/language-modeling/run_language_modeling.py

sgugger · 2020-10-20T16:09:51Z

examples/language-modeling/run_language_modeling.py

@@ -143,6 +149,16 @@ def get_dataset(
 ):
    def _dataset(file_path):
        if args.line_by_line:
+            if args.chinese_ref_file:
+                if not args.wwm or args.mlm:


The test is not consistent with the error message that wants both of those to be True.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

stas00 · 2020-10-21T00:30:51Z

And I wonder which version of black you use in check_code_quality.

$ grep black setup.py
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]

wlhgtc · 2020-10-21T00:43:37Z

And I wonder which version of black you use in check_code_quality.
$ grep black setup.py
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]

thx！

sgugger · 2020-10-21T13:29:25Z

Looking good to me except for the code quality. If you don't manage to fix it, I can force-push on your branch.

wlhgtc · 2020-10-21T15:01:13Z

Looking good to me except for the code quality. If you don't manage to fix it, I can force-push on your branch.

OK, I tried to fix it but failed :(

sgugger · 2020-10-21T15:09:57Z

Just made the necessary change. Note that this wasn't styling that caused the isse but the code quality in general. make quality was erroring and telling you to run make fix-copies (which I did).

LysandreJik

This mostly looks good to me except I don't fully understand why we need the reference file. What's LTP? Why do we need reference files? Can this be explained in the README?

wlhgtc · 2020-10-22T09:51:14Z

This mostly looks good to me except I don't fully understand why we need the reference file. What's LTP? Why do we need reference files? Can this be explained in the README?

Thanks for your question.
Q : Why ref file ?
A : Suppose we have a Chinese sentence like : 我喜欢你。 The original Chinese-BERT will tokenize it as ['我','喜','欢','你'] in char level.
Actually, 喜欢 is a whole word. For whole word mask proxy, We need res like ['我','喜','##欢','你'].
So we need a ref file to tell model which pos of BERT original token should be added ##.

Q : Why LTP ?
A : Cause the best known Chinese WWM BERT is https://github.com/ymcui/Chinese-BERT-wwm. It works well on so many Chines Task like CLUE (Chinese GLUE).
They use LTP, so if we want to fine-tune their model, we need LTP.

@LysandreJik hope this would help.

LysandreJik

Okay, I think I understand it better now. Thanks for adding your explanation to the README.

johnsonice · 2020-10-27T19:54:30Z

@wlhgtc ltp is not added to the requirements.txt under examples folder

wlhgtc · 2020-10-28T02:30:13Z

@wlhgtc ltp is not added to the requirements.txt under examples folder

Thanks for your notice. I forgot add it to requirements.txt.
But this is an optional package only for Chinese LM Fine-tune(and could be replaced by others tokenizer), I haven't find a way to note that :(

johnsonice · 2020-10-28T05:06:28Z

@wlhgtc ltp is not added to the requirements.txt under examples folder

Thanks for your notice. I forgot add it to requirements.txt.
But this is an optional package only for Chinese LM Fine-tune(and could be replaced by others tokenizer), I haven't find a way to note that :(

Thanks, I also just tried, ltp requires transformer==3.2. I have no idea why. so have to install ltp with on dependency. Very annoying. By the way, thanks for the excellent work.

one more bug looks like when doing eval, it is referring to the ref file for the training data. if I set train_data = test_data. It goes through fine. Did I do something wrong? I am trying to follow your process as close as I can

Traceback (most recent call last):
  File "../run_language_modeling.py", line 351, in <module>
    main()
  File "../run_language_modeling.py", line 279, in main
    if training_args.do_eval
  File "../run_language_modeling.py", line 174, in get_dataset
    return _dataset(args.eval_data_file)
  File "../run_language_modeling.py", line 160, in _dataset
    ref_path=args.chinese_ref_file,
  File "/home/chengyu/anaconda3/envs/pytorch_transformer/lib/python3.7/site-packages/transformers/data/datasets/language_modeling.py", line 139, in __init__
    assert len(data) == len(ref)

wlhgtc · 2020-10-28T06:46:57Z

Yeah, the LTP version doesn't support the newest transformers. I do the same things as yours.
For the error, it means that your dataset has different length with you ref file(cause we read it line by line, this would lead to mismatch). Seem I didn't add the param eval_ref_file to data_args, then it will read train_ref_file; then cause this error.
I will fix it soon.

* ADD: add whole word mask proxy for both eng and chinese * MOD: adjust format * MOD: reformat code * MOD: update import * MOD: fix bug * MOD: add import * MOD: fix bug * MOD: decouple code and update readme * MOD: reformat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change wwm to whole_word_mask * reformat code * reformat * format * Code quality * ADD: update chinese ref readme * MOD: small changes * MOD: small changes2 * update readme Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

)" This reverts commit 3f8b9a5.

wlhgtc added 2 commits October 20, 2020 17:58

ADD: add whole word mask proxy for both eng and chinese

2f610e7

MOD: adjust format

eeb0e02

MOD: reformat code

dde73e2

wlhgtc added 4 commits October 20, 2020 18:45

MOD: update import

6470d5b

MOD: fix bug

30928b0

MOD: add import

dc7794f

MOD: fix bug

1fd0e15

sgugger reviewed Oct 20, 2020

View reviewed changes

wlhgtc added 2 commits October 20, 2020 22:19

MOD: decouple code and update readme

0aec80a

MOD: reformat code

156d40a

sgugger requested changes Oct 20, 2020

View reviewed changes

wlhgtc and others added 6 commits October 21, 2020 07:50

Update examples/language-modeling/README.md

30668ee

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update examples/language-modeling/README.md

d1c4d25

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update examples/language-modeling/run_language_modeling.py

56bf427

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update examples/language-modeling/run_language_modeling.py

745e49d

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update examples/language-modeling/run_language_modeling.py

cf909c0

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update examples/language-modeling/run_language_modeling.py

960e4c8

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

wlhgtc added 2 commits October 21, 2020 08:42

change wwm to whole_word_mask

45265b7

reformat code

68d8832

wlhgtc added 2 commits October 21, 2020 10:20

reformat

ea21325

format

bf976a4

sgugger requested a review from LysandreJik October 21, 2020 13:28

Code quality

9b50ee3

LysandreJik reviewed Oct 22, 2020

View reviewed changes

wlhgtc added 4 commits October 22, 2020 18:04

ADD: update chinese ref readme

a285531

MOD: small changes

ed15eba

MOD: small changes2

87ab48c

update readme

beeb7aa

LysandreJik approved these changes Oct 22, 2020

View reviewed changes

sgugger merged commit a16e568 into huggingface:master Oct 22, 2020

wlhgtc mentioned this pull request Oct 24, 2020

transformer 实现 WWM 来 Fine-tune Bert 相关问题 ymcui/Chinese-BERT-wwm#153

Closed

wlhgtc mentioned this pull request Oct 28, 2020

Fix eval ref miss in Chinese WWM. #8115

Merged

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "# Add whole word mask support for lm fine-tune (huggingface#7925

4b2684c

)" This reverts commit 3f8b9a5.

ionicsolutions mentioned this pull request Jun 26, 2021

[Documentation] Warn that DataCollatorForWholeWordMask is limited to BertTokenizer-like tokenizers #12371

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Add whole word mask support for lm fine-tune #7925

# Add whole word mask support for lm fine-tune #7925

wlhgtc commented Oct 20, 2020 •

edited

Loading

wlhgtc commented Oct 20, 2020 •

edited

Loading

wlhgtc commented Oct 20, 2020 •

edited

Loading

sgugger left a comment

wlhgtc commented Oct 20, 2020

sgugger left a comment

sgugger Oct 20, 2020

wlhgtc Oct 21, 2020

sgugger Oct 20, 2020

wlhgtc Oct 21, 2020

stas00 commented Oct 21, 2020

wlhgtc commented Oct 21, 2020

sgugger commented Oct 21, 2020

wlhgtc commented Oct 21, 2020

sgugger commented Oct 21, 2020

LysandreJik left a comment

wlhgtc commented Oct 22, 2020 •

edited

Loading

LysandreJik left a comment

johnsonice commented Oct 27, 2020

wlhgtc commented Oct 28, 2020

johnsonice commented Oct 28, 2020

wlhgtc commented Oct 28, 2020 •

edited

Loading

# Add whole word mask support for lm fine-tune #7925

# Add whole word mask support for lm fine-tune #7925

Conversation

wlhgtc commented Oct 20, 2020 • edited Loading

wlhgtc commented Oct 20, 2020 • edited Loading

wlhgtc commented Oct 20, 2020 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

wlhgtc commented Oct 20, 2020

sgugger left a comment

Choose a reason for hiding this comment

sgugger Oct 20, 2020

Choose a reason for hiding this comment

wlhgtc Oct 21, 2020

Choose a reason for hiding this comment

sgugger Oct 20, 2020

Choose a reason for hiding this comment

wlhgtc Oct 21, 2020

Choose a reason for hiding this comment

stas00 commented Oct 21, 2020

wlhgtc commented Oct 21, 2020

sgugger commented Oct 21, 2020

wlhgtc commented Oct 21, 2020

sgugger commented Oct 21, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

wlhgtc commented Oct 22, 2020 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

johnsonice commented Oct 27, 2020

wlhgtc commented Oct 28, 2020

johnsonice commented Oct 28, 2020

wlhgtc commented Oct 28, 2020 • edited Loading

wlhgtc commented Oct 20, 2020 •

edited

Loading

wlhgtc commented Oct 20, 2020 •

edited

Loading

wlhgtc commented Oct 20, 2020 •

edited

Loading

wlhgtc commented Oct 22, 2020 •

edited

Loading

wlhgtc commented Oct 28, 2020 •

edited

Loading