Wiki Chinese dump preprocessing: #lines not matching #447

LiweiPeng · 2019-02-19T19:06:25Z

The BERT Chinese model was trained from wikipedia dump. According to #155 , BERT original model got 25M lines/sentences after pre-processing.

However, I got 12.5M lines(sentences) after pre-processing. Can someone let me know what could be wrong with my steps?

What I did:

download https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
use WikiExtractor.py to extract the paragraphs. I got total 14.7M lines.
use a script to split the paragraphs into sentences. Ignore empty lines in wiki extracts. Add new line for wiki only. I got total 12.5M lines.

LiweiPeng · 2019-02-20T00:28:43Z

I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.

VictorSanh · 2019-10-22T18:22:03Z

The BERT Chinese model was trained from wikipedia dump. According to #155 , BERT original model got 25M lines/sentences after pre-processing.

However, I got 12.5M lines(sentences) after pre-processing. Can someone let me know what could be wrong with my steps?

What I did:

download https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

use WikiExtractor.py to extract the paragraphs. I got total 14.7M lines.

use a script to split the paragraphs into sentences. Ignore empty lines in wiki extracts. Add new line for wiki only. I got total 12.5M lines.

Hey @LiweiPeng,
I am not sure where to get the simplified Chinese version of Wikipedia.
I can't find the code (like zh) for simplified Chinese, or do I need to use an automatic converter?

Victor

LiweiPeng · 2019-10-22T19:04:09Z

@VictorSanh I used the opencc tool (https://github.com/BYVoid/OpenCC) to convert all wikipedia Chinese text into simplified Chinese version and traditional Chinese version. Then use BERT to train both versions together.

I did other tests where only simplified Chinese wiki are used. The results are not as good as using both simplified and traditional for XNLI dataset.

VictorSanh · 2019-10-22T19:26:08Z

Thank you @LiweiPeng for you answer!
Just to make sure, which config files (the json files) are you calling? t2s.json?
Did you use the Taiwan standard?

LiweiPeng · 2019-10-22T21:17:08Z

t2s json to convert to simplified Chinese. s2t.json to convert to traditional Chinese.

VictorSanh · 2019-10-22T22:25:42Z

So you are applying t2s.json to the output of wikiextractor to get the simplified Chinese version and then reapply s2t.json to the latter to get the traditional Chinese version (i.e. two steps process)?

notabigfish · 2021-02-02T04:25:14Z

her tests where only simplified Chinese wiki are us

Hi, this method works for me, thanks! But I have a few questions. Did you split tfrecord
into train and test?
Here is what I did after generating simplified and traditional chinese text files:

concate two txt files using "cat simplified.txt traditional.txt > all.txt"
generate tf record files using

python create_pretraining_data.py \
  --input_file=data/download/wikicorpus_zh/all.txt \
  --output_file=data/download/wikicorpus_zh/all.tfrecord \
  --vocab_file=data/download/google_pretrained_weights/chinese_L-12_H-768_A-12/vocab.txt \
  --do_lower_case=True \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5

In this way, I could train using all.tfrecord as training set. However, no test or eval set is generated. May I ask, how did you split zhwiki into different sets? Thank you!

LiweiPeng closed this as completed Feb 20, 2019

nlpBeginner mentioned this issue Dec 17, 2019

TinyBERT的疑问 huawei-noah/Pretrained-Language-Model#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wiki Chinese dump preprocessing: #lines not matching #447

Wiki Chinese dump preprocessing: #lines not matching #447

LiweiPeng commented Feb 19, 2019

LiweiPeng commented Feb 20, 2019

VictorSanh commented Oct 22, 2019

LiweiPeng commented Oct 22, 2019

VictorSanh commented Oct 22, 2019

LiweiPeng commented Oct 22, 2019

VictorSanh commented Oct 22, 2019

notabigfish commented Feb 2, 2021

Wiki Chinese dump preprocessing: #lines not matching #447

Wiki Chinese dump preprocessing: #lines not matching #447

Comments

LiweiPeng commented Feb 19, 2019

LiweiPeng commented Feb 20, 2019

VictorSanh commented Oct 22, 2019

LiweiPeng commented Oct 22, 2019

VictorSanh commented Oct 22, 2019

LiweiPeng commented Oct 22, 2019

VictorSanh commented Oct 22, 2019

notabigfish commented Feb 2, 2021