-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
# Add whole word mask support for lm fine-tune #7925
Conversation
And I wonder which version of black you use in check_code_quality.
I reformat my code with black(19.10b0), and all files are left unchanged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your PR!
Before digging more in a detailed review, I have a general comment: I think this should be decoupled a bit more: you created a new class LineByLineWithRefDataset
, and in the same vein, I think you should create a new DataCollator
for the whole-world masking. This will make it clearer to read and easier to customize.
It would also be super nice if you could document in the README how to use your example with a chinese reference file (do you pass the script you added? or use the script you added to generate a file?)
Finish ~ @sgugger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my initial comments!
Added a few suggestions of rewording, and the last thing left is to rename the argument wwm: this is not very informative and we usually have more descriptive argument names (mlm is not ideal either, I'll fix this one in another PR). Could you replace all wwm
by whole_word_masking
? Thanks!
```bash | ||
export TRAIN_FILE=/path/to/dataset/wiki.train.raw | ||
export TEST_FILE=/path/to/dataset/wiki.test.raw | ||
export REF_FILE=/path/to/ref.txt | ||
|
||
python run_language_modeling.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave one version with just mlm and no wwm first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I update my readme, it's same with English version if only mlm.
@@ -143,6 +149,16 @@ def get_dataset( | |||
): | |||
def _dataset(file_path): | |||
if args.line_by_line: | |||
if args.chinese_ref_file: | |||
if not args.wwm or args.mlm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is not consistent with the error message that wants both of those to be True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
|
thx! |
Looking good to me except for the code quality. If you don't manage to fix it, I can force-push on your branch. |
OK, I tried to fix it but failed :( |
Just made the necessary change. Note that this wasn't styling that caused the isse but the code quality in general. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good to me except I don't fully understand why we need the reference file. What's LTP
? Why do we need reference files? Can this be explained in the README?
Thanks for your question. Q : Why LTP ? @LysandreJik hope this would help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I understand it better now. Thanks for adding your explanation to the README.
@wlhgtc ltp is not added to the requirements.txt under examples folder |
Thanks for your notice. I forgot add it to requirements.txt. |
Thanks, I also just tried, ltp requires transformer==3.2. I have no idea why. so have to install ltp with on dependency. Very annoying. By the way, thanks for the excellent work. one more bug looks like when doing eval, it is referring to the ref file for the training data. if I set train_data = test_data. It goes through fine. Did I do something wrong? I am trying to follow your process as close as I can
|
|
* ADD: add whole word mask proxy for both eng and chinese * MOD: adjust format * MOD: reformat code * MOD: update import * MOD: fix bug * MOD: add import * MOD: fix bug * MOD: decouple code and update readme * MOD: reformat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change wwm to whole_word_mask * reformat code * reformat * format * Code quality * ADD: update chinese ref readme * MOD: small changes * MOD: small changes2 * update readme Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
)" This reverts commit 3f8b9a5.
This PR add support for wwm (whole word mask) proxy when fine-tune BERT like model.
And it can be divided into two part : English Model Support and Chinese Model Support
For English, it's simple. The original tokenizer res contains symbols like '##ing'.
I just use the same mask proxy in data_collator.py by Google.
For Chinese, it's hard. We need to rely on (word level) tokenizer, cause BERT is char level in Chinese.
So I do things as follow to get word level tokens:
Then, it's all same to English.
And I add two parameters (
wwm
andchinese_ref_path
) to run lm.