BERT-Base Chinese data details #155

htw2012 · 2018-11-21T09:14:06Z

Hi, I have some questions about the detail of Chinese BERT-Base model.

Is the model trained base on entire Chinese wikipedia raw text ?
Are there additional pre-processing steps for raw corpus?
How many (lines) sentences in the pre-training samples ？
How long did it take for you to finish the pre-training process?
In addition, if we have a large domain-specific corpus , we can train the pre-training model as follows:
a) train with task-specific corpus.
b) train with task-specific corpus and general corpus such as wikipedia.
which way is better?

Thank you in advance!

jacobdevlin-google · 2018-11-25T01:32:09Z

Yes, a processed version of that which only keeps the text portions without formatting. In both traditional and simplified.
Pre-processed to remove tables/images/formatting.
25M sentences.
It was done using Google's parallel processing so only a few minutes. Probably a few hours if done on a single machine.
It depends on how big it is. The best approach will probably be to run pre-training first on Wikipedia and then for more epochs on only your corpus. Or even better, to use the models we released and then to run pre-training for more steps (unless you want to do everything from scratch).

chenjiasheng · 2018-11-25T17:26:42Z

I have some additional questions.

How long would it take to pre-train 100M sentences (each with length 1~127 Chinese characters) from scratch on a horovod cluster with 8*8 V100 GPUs?
Is it possible to accelerate the training speed, by grouping sentences in the way that seq_length is the same inside a batch while differs between different batches, and making the model accept different seq_length?
Should I remove the CLS token and segment_ids if my purpose is just a language model?

Thank you!

jacobdevlin-google · 2018-11-25T21:32:36Z

Not sure, I've never trained on GPUs.
I would recommend packing multiple sentences until (approximately) the max sequence length, which is what create_pretraining_data.py already does.
It doesn't hurt to include them in case you might want to use the model for other stuff, but if you only care about predicting missing words then it probably doesn't matter. But keep in mind that BERT doesn't give you a true "language model", it just allows you to predict single missing wordpieces.

Continue7777 · 2018-12-03T01:49:12Z

I have some additional questions.
1.Wether the bert method will work on other model,just like text-rnn or something else.
2.Will it help,if we pretrain the task on dataset we use to finetune? (sentence about 5M)
3.In my short_text_classification task ,max_sequence_length is only 20,can i use bert chinese?

chenjiasheng · 2018-12-12T13:15:06Z

@Continue7777
I think the answers to all you three questions is YES.
I also pruned the vocab from 22k+ to 6500, with top used Chinese characters only. You can check my fork.

eric-haibin-lin · 2018-12-18T07:09:34Z

@jacobdevlin-google Is any special preprocessing for BookCorpus? For example, removing TOCs? Also, is a book treated as a document, or is every chapter treated as a document?

y111x · 2018-12-19T08:33:06Z

I have some additional questions.

How long would it take to pre-train 100M sentences (each with length 1~127 Chinese characters) from scratch on a horovod cluster with 8*8 V100 GPUs?

Is it possible to accelerate the training speed, by grouping sentences in the way that seq_length is the same inside a batch while differs between different batches, and making the model accept different seq_length?

Should I remove the CLS token and segment_ids if my purpose is just a language model?

Thank you!

hello，I have the same question .Have you try to pre-train 100M Chinese sentences from scratch use 8*8 GPUS? Could you tell me about the training time？ Thank you

chenjiasheng · 2018-12-23T06:50:57Z

@yyx911216
Not yet. Being busy with the acoustic model. Welcome and pleased to share the future info.

LiweiPeng · 2019-02-19T18:49:38Z

@jacobdevlin-google One question about wiki Chinese preprocessing: using the wiki Chinese dump, I got 12.5M lines(sentences) after pre-processing. However, the above post said you got 25M lines. Can you let me know what's wrong with my steps?

What I did:

download https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
use WikiExtractor.py to extract the paragraphs. I got total 14.7M lines.
use a script to split the paragraphs into sentences. Ignore empty lines in wiki extracts. Add new line for wiki only. I got total 12.5M lines.

eric-haibin-lin · 2019-02-19T19:52:50Z

@LiweiPeng what script are you using for sentence segmentation? That might lead to different number of sentences

LiweiPeng · 2019-02-19T19:57:54Z

@eric-haibin-lin I used something very similar to https://blog.csdn.net/blmoistawinde/article/details/82379256. I added some extra like ';' as sentence token.

LiweiPeng · 2019-02-20T00:28:38Z

I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.

ItachiUchihaVictor · 2019-05-01T03:37:18Z

I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.

@LiweiPeng Hi, did you successfully reproduce the result of Google bert? I was trying to do so but what I pretrained is 1-3 points less than Google bert.

LiweiPeng · 2019-05-01T04:59:57Z

I ran pretrain several times with difference parameters. the best result I got has XNLI 77.0, very close to the published Google result.

ItachiUchihaVictor · 2019-05-01T05:50:01Z

I ran pretrain several times with difference parameters. the best result I got has XNLI 77.0, very close to the published Google result.

@LiweiPeng Thanks for your reply. I will appreciate if you could tell me which parameters you changed in your experiment. I would like to have a try.

LiweiPeng · 2019-05-01T13:44:32Z

The parameters I adjusted are batch size and learning rate. The recent Reducing BERT Pre-Training Time from 3 Days to 76 Minutes paper has a good research on this topic: https://arxiv.org/abs/1904.00962

ItachiUchihaVictor · 2019-05-01T16:51:36Z

The parameters I adjusted are batch size and learning rate. The recent Reducing BERT Pre-Training Time from 3 Days to 76 Minutes paper has a good research on this topic: https://arxiv.org/abs/1904.00962

@LiweiPeng Thank you very much. I have read that paper. May I know the batch size and learning rate of the best model you trained?

LiweiPeng · 2019-05-01T17:21:00Z

The batch size I used is 2304. learning rate=2.4e-4. I used 16 V100 GPU and trained for 400k steps.

ItachiUchihaVictor · 2019-05-02T06:17:23Z

The batch size I used is 2304. learning rate=2.4e-4. I used 16 V100 GPU and trained for 400k steps.

@LiweiPeng Thank you. BTW, which delimiters did you use to split the wiki text (after WikiExtract processing) into sentences? I used "re.split('([;|;|。|！|!|？|?|；])',line)" but could only get 11.4M lines. I found that the final files contains both simplified and tranditional Chinese and thus, this is not caused by the problem you met before.

light8lee · 2019-05-23T06:42:03Z

@ItachiUchihaVictor I'm also confused about the number of sentences, have you figure it out?

aslicedbread · 2019-05-24T13:29:58Z

@jacobdevlin-google One question about wiki Chinese preprocessing: using the wiki Chinese dump, I got 12.5M lines(sentences) after pre-processing. However, the above post said you got 25M lines. Can you let me know what's wrong with my steps?

What I did:

download https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

use WikiExtractor.py to extract the paragraphs. I got total 14.7M lines.

use a script to split the paragraphs into sentences. Ignore empty lines in wiki extracts. Add new line for wiki only. I got total 12.5M lines.

I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.

@LiweiPeng

Does it mean that Pre-training of BERT of Chinese Version only uses wiki corpus (not use BookCorpus)?

jacobdevlin-google closed this as completed Nov 25, 2018

yoquankara mentioned this issue Feb 15, 2019

Clarification of document for BookCorpus #439

Open

LiweiPeng mentioned this issue Feb 19, 2019

Wiki Chinese dump preprocessing: #lines not matching #447

Closed

alznn mentioned this issue Feb 25, 2019

Adding domain specific vocabulary #9

Closed

xwzhong mentioned this issue Mar 25, 2019

Questions about pretraining #171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT-Base Chinese data details #155

BERT-Base Chinese data details #155

htw2012 commented Nov 21, 2018 •

edited

Loading

jacobdevlin-google commented Nov 25, 2018

chenjiasheng commented Nov 25, 2018

jacobdevlin-google commented Nov 25, 2018

Continue7777 commented Dec 3, 2018

chenjiasheng commented Dec 12, 2018

eric-haibin-lin commented Dec 18, 2018 •

edited

Loading

y111x commented Dec 19, 2018

chenjiasheng commented Dec 23, 2018

LiweiPeng commented Feb 19, 2019 •

edited

Loading

eric-haibin-lin commented Feb 19, 2019

LiweiPeng commented Feb 19, 2019

LiweiPeng commented Feb 20, 2019

ItachiUchihaVictor commented May 1, 2019

LiweiPeng commented May 1, 2019

ItachiUchihaVictor commented May 1, 2019

LiweiPeng commented May 1, 2019

ItachiUchihaVictor commented May 1, 2019

LiweiPeng commented May 1, 2019

ItachiUchihaVictor commented May 2, 2019

light8lee commented May 23, 2019 •

edited

Loading

aslicedbread commented May 24, 2019 •

edited

Loading

BERT-Base Chinese data details #155

BERT-Base Chinese data details #155

Comments

htw2012 commented Nov 21, 2018 • edited Loading

jacobdevlin-google commented Nov 25, 2018

chenjiasheng commented Nov 25, 2018

jacobdevlin-google commented Nov 25, 2018

Continue7777 commented Dec 3, 2018

chenjiasheng commented Dec 12, 2018

eric-haibin-lin commented Dec 18, 2018 • edited Loading

y111x commented Dec 19, 2018

chenjiasheng commented Dec 23, 2018

LiweiPeng commented Feb 19, 2019 • edited Loading

eric-haibin-lin commented Feb 19, 2019

LiweiPeng commented Feb 19, 2019

LiweiPeng commented Feb 20, 2019

ItachiUchihaVictor commented May 1, 2019

LiweiPeng commented May 1, 2019

ItachiUchihaVictor commented May 1, 2019

LiweiPeng commented May 1, 2019

ItachiUchihaVictor commented May 1, 2019

LiweiPeng commented May 1, 2019

ItachiUchihaVictor commented May 2, 2019

light8lee commented May 23, 2019 • edited Loading

aslicedbread commented May 24, 2019 • edited Loading

htw2012 commented Nov 21, 2018 •

edited

Loading

eric-haibin-lin commented Dec 18, 2018 •

edited

Loading

LiweiPeng commented Feb 19, 2019 •

edited

Loading

light8lee commented May 23, 2019 •

edited

Loading

aslicedbread commented May 24, 2019 •

edited

Loading