Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

# Add whole word mask support for lm fine-tune #7925

Merged
merged 24 commits into from
Oct 22, 2020
Merged

# Add whole word mask support for lm fine-tune #7925

merged 24 commits into from
Oct 22, 2020

Conversation

wlhgtc
Copy link
Contributor

@wlhgtc wlhgtc commented Oct 20, 2020

This PR add support for wwm (whole word mask) proxy when fine-tune BERT like model.
And it can be divided into two part : English Model Support and Chinese Model Support

For English, it's simple. The original tokenizer res contains symbols like '##ing'.
I just use the same mask proxy in data_collator.py by Google.

For Chinese, it's hard. We need to rely on (word level) tokenizer, cause BERT is char level in Chinese.
So I do things as follow to get word level tokens:

  1. add get info code in chinese_ref.py
  2. create a new dataset to keep ref info language_model.py
  3. create word level ref according to ref data_collator.py

Then, it's all same to English.

And I add two parameters (wwm and chinese_ref_path ) to run lm.

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 20, 2020

Seem all test passed( expect the format problem), @sgugger @stas00 Could you help me review these PR ?

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 20, 2020

And I wonder which version of black you use in check_code_quality.
I got errors as follows:

would reformat /home/circleci/transformers/examples/language-modeling/run_language_modeling.py
would reformat /home/circleci/transformers/src/transformers/data/datasets/language_modeling.py

I reformat my code with black(19.10b0), and all files are left unchanged

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your PR!

Before digging more in a detailed review, I have a general comment: I think this should be decoupled a bit more: you created a new class LineByLineWithRefDataset, and in the same vein, I think you should create a new DataCollator for the whole-world masking. This will make it clearer to read and easier to customize.

It would also be super nice if you could document in the README how to use your example with a chinese reference file (do you pass the script you added? or use the script you added to generate a file?)

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 20, 2020

Thanks a lot for your PR!

Before digging more in a detailed review, I have a general comment: I think this should be decoupled a bit more: you created a new class LineByLineWithRefDataset, and in the same vein, I think you should create a new DataCollator for the whole-world masking. This will make it clearer to read and easier to customize.

It would also be super nice if you could document in the README how to use your example with a chinese reference file (do you pass the script you added? or use the script you added to generate a file?)

Finish ~ @sgugger

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my initial comments!
Added a few suggestions of rewording, and the last thing left is to rename the argument wwm: this is not very informative and we usually have more descriptive argument names (mlm is not ideal either, I'll fix this one in another PR). Could you replace all wwm by whole_word_masking? Thanks!

examples/language-modeling/README.md Outdated Show resolved Hide resolved
examples/language-modeling/README.md Outdated Show resolved Hide resolved
```bash
export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw
export REF_FILE=/path/to/ref.txt

python run_language_modeling.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave one version with just mlm and no wwm first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I update my readme, it's same with English version if only mlm.

examples/language-modeling/run_language_modeling.py Outdated Show resolved Hide resolved
examples/language-modeling/run_language_modeling.py Outdated Show resolved Hide resolved
examples/language-modeling/run_language_modeling.py Outdated Show resolved Hide resolved
examples/language-modeling/run_language_modeling.py Outdated Show resolved Hide resolved
@@ -143,6 +149,16 @@ def get_dataset(
):
def _dataset(file_path):
if args.line_by_line:
if args.chinese_ref_file:
if not args.wwm or args.mlm:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is not consistent with the error message that wants both of those to be True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

wlhgtc and others added 6 commits October 21, 2020 07:50
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@stas00
Copy link
Contributor

stas00 commented Oct 21, 2020

And I wonder which version of black you use in check_code_quality.

$ grep black setup.py
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 21, 2020

And I wonder which version of black you use in check_code_quality.

$ grep black setup.py
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]

thx!

@sgugger
Copy link
Collaborator

sgugger commented Oct 21, 2020

Looking good to me except for the code quality. If you don't manage to fix it, I can force-push on your branch.

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 21, 2020

Looking good to me except for the code quality. If you don't manage to fix it, I can force-push on your branch.

OK, I tried to fix it but failed :(

@sgugger
Copy link
Collaborator

sgugger commented Oct 21, 2020

Just made the necessary change. Note that this wasn't styling that caused the isse but the code quality in general. make quality was erroring and telling you to run make fix-copies (which I did).

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good to me except I don't fully understand why we need the reference file. What's LTP? Why do we need reference files? Can this be explained in the README?

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 22, 2020

This mostly looks good to me except I don't fully understand why we need the reference file. What's LTP? Why do we need reference files? Can this be explained in the README?

Thanks for your question.
Q : Why ref file ?
A : Suppose we have a Chinese sentence like : 我喜欢你。 The original Chinese-BERT will tokenize it as ['我','喜','欢','你'] in char level.
Actually, 喜欢 is a whole word. For whole word mask proxy, We need res like ['我','喜','##欢','你'].
So we need a ref file to tell model which pos of BERT original token should be added ##.

Q : Why LTP ?
A : Cause the best known Chinese WWM BERT is https://github.com/ymcui/Chinese-BERT-wwm. It works well on so many Chines Task like CLUE (Chinese GLUE).
They use LTP, so if we want to fine-tune their model, we need LTP.

@LysandreJik hope this would help.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I understand it better now. Thanks for adding your explanation to the README.

@johnsonice
Copy link

@wlhgtc ltp is not added to the requirements.txt under examples folder

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 28, 2020

@wlhgtc ltp is not added to the requirements.txt under examples folder

Thanks for your notice. I forgot add it to requirements.txt.
But this is an optional package only for Chinese LM Fine-tune(and could be replaced by others tokenizer), I haven't find a way to note that :(

@johnsonice
Copy link

@wlhgtc ltp is not added to the requirements.txt under examples folder

Thanks for your notice. I forgot add it to requirements.txt.
But this is an optional package only for Chinese LM Fine-tune(and could be replaced by others tokenizer), I haven't find a way to note that :(

Thanks, I also just tried, ltp requires transformer==3.2. I have no idea why. so have to install ltp with on dependency. Very annoying. By the way, thanks for the excellent work.

one more bug looks like when doing eval, it is referring to the ref file for the training data. if I set train_data = test_data. It goes through fine. Did I do something wrong? I am trying to follow your process as close as I can

Traceback (most recent call last):
  File "../run_language_modeling.py", line 351, in <module>
    main()
  File "../run_language_modeling.py", line 279, in main
    if training_args.do_eval
  File "../run_language_modeling.py", line 174, in get_dataset
    return _dataset(args.eval_data_file)
  File "../run_language_modeling.py", line 160, in _dataset
    ref_path=args.chinese_ref_file,
  File "/home/chengyu/anaconda3/envs/pytorch_transformer/lib/python3.7/site-packages/transformers/data/datasets/language_modeling.py", line 139, in __init__
    assert len(data) == len(ref)

@wlhgtc
Copy link
Contributor Author

wlhgtc commented Oct 28, 2020

  1. Yeah, the LTP version doesn't support the newest transformers. I do the same things as yours.
  2. For the error, it means that your dataset has different length with you ref file(cause we read it line by line, this would lead to mismatch). Seem I didn't add the param eval_ref_file to data_args, then it will read train_ref_file; then cause this error.
    I will fix it soon.

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
* ADD: add whole word mask proxy for both eng and chinese

* MOD: adjust format

* MOD: reformat code

* MOD: update import

* MOD: fix bug

* MOD: add import

* MOD: fix bug

* MOD: decouple code and update readme

* MOD: reformat code

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update examples/language-modeling/run_language_modeling.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* change wwm to whole_word_mask

* reformat code

* reformat

* format

* Code quality

* ADD: update chinese ref readme

* MOD: small changes

* MOD: small changes2

* update readme

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants