-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT-Base Chinese data details #155
Comments
|
I have some additional questions.
Thank you! |
|
I have some additional questions. |
@Continue7777 |
@jacobdevlin-google Is any special preprocessing for BookCorpus? For example, removing TOCs? Also, is a book treated as a document, or is every chapter treated as a document? |
hello,I have the same question .Have you try to pre-train 100M Chinese sentences from scratch use 8*8 GPUS? Could you tell me about the training time? Thank you |
@yyx911216 |
@jacobdevlin-google One question about wiki Chinese preprocessing: using the wiki Chinese dump, I got 12.5M lines(sentences) after pre-processing. However, the above post said you got 25M lines. Can you let me know what's wrong with my steps? What I did:
|
@LiweiPeng what script are you using for sentence segmentation? That might lead to different number of sentences |
@eric-haibin-lin I used something very similar to https://blog.csdn.net/blmoistawinde/article/details/82379256. I added some extra like ';' as sentence token. |
I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M. |
@LiweiPeng Hi, did you successfully reproduce the result of Google bert? I was trying to do so but what I pretrained is 1-3 points less than Google bert. |
I ran pretrain several times with difference parameters. the best result I got has XNLI 77.0, very close to the published Google result. |
@LiweiPeng Thanks for your reply. I will appreciate if you could tell me which parameters you changed in your experiment. I would like to have a try. |
The parameters I adjusted are batch size and learning rate. The recent Reducing BERT Pre-Training Time from 3 Days to 76 Minutes paper has a good research on this topic: https://arxiv.org/abs/1904.00962 |
@LiweiPeng Thank you very much. I have read that paper. May I know the batch size and learning rate of the best model you trained? |
The batch size I used is 2304. learning rate=2.4e-4. I used 16 V100 GPU and trained for 400k steps. |
@LiweiPeng Thank you. BTW, which delimiters did you use to split the wiki text (after WikiExtract processing) into sentences? I used "re.split('([;|;|。|!|!|?|?|;])',line)" but could only get 11.4M lines. I found that the final files contains both simplified and tranditional Chinese and thus, this is not caused by the problem you met before. |
@ItachiUchihaVictor I'm also confused about the number of sentences, have you figure it out? |
Does it mean that Pre-training of BERT of Chinese Version only uses wiki corpus (not use BookCorpus)? |
Hi, I have some questions about the detail of Chinese BERT-Base model.
a) train with task-specific corpus.
b) train with task-specific corpus and general corpus such as wikipedia.
which way is better?
Thank you in advance!
The text was updated successfully, but these errors were encountered: