Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiki Chinese dump preprocessing: #lines not matching #447

Closed
LiweiPeng opened this issue Feb 19, 2019 · 7 comments
Closed

Wiki Chinese dump preprocessing: #lines not matching #447

LiweiPeng opened this issue Feb 19, 2019 · 7 comments

Comments

@LiweiPeng
Copy link

The BERT Chinese model was trained from wikipedia dump. According to #155 , BERT original model got 25M lines/sentences after pre-processing.

However, I got 12.5M lines(sentences) after pre-processing. Can someone let me know what could be wrong with my steps?

What I did:

@LiweiPeng
Copy link
Author

I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M.

@VictorSanh
Copy link

The BERT Chinese model was trained from wikipedia dump. According to #155 , BERT original model got 25M lines/sentences after pre-processing.

However, I got 12.5M lines(sentences) after pre-processing. Can someone let me know what could be wrong with my steps?

What I did:

Hey @LiweiPeng,
I am not sure where to get the simplified Chinese version of Wikipedia.
I can't find the code (like zh) for simplified Chinese, or do I need to use an automatic converter?

Victor

@LiweiPeng
Copy link
Author

@VictorSanh I used the opencc tool (https://github.com/BYVoid/OpenCC) to convert all wikipedia Chinese text into simplified Chinese version and traditional Chinese version. Then use BERT to train both versions together.

I did other tests where only simplified Chinese wiki are used. The results are not as good as using both simplified and traditional for XNLI dataset.

@VictorSanh
Copy link

Thank you @LiweiPeng for you answer!
Just to make sure, which config files (the json files) are you calling? t2s.json?
Did you use the Taiwan standard?

@LiweiPeng
Copy link
Author

t2s json to convert to simplified Chinese. s2t.json to convert to traditional Chinese.

@VictorSanh
Copy link

So you are applying t2s.json to the output of wikiextractor to get the simplified Chinese version and then reapply s2t.json to the latter to get the traditional Chinese version (i.e. two steps process)?

@notabigfish
Copy link

her tests where only simplified Chinese wiki are us

Hi, this method works for me, thanks! But I have a few questions. Did you split tfrecord
into train and test?
Here is what I did after generating simplified and traditional chinese text files:

  1. concate two txt files using "cat simplified.txt traditional.txt > all.txt"
  2. generate tf record files using
python create_pretraining_data.py \
  --input_file=data/download/wikicorpus_zh/all.txt \
  --output_file=data/download/wikicorpus_zh/all.tfrecord \
  --vocab_file=data/download/google_pretrained_weights/chinese_L-12_H-768_A-12/vocab.txt \
  --do_lower_case=True \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --masked_lm_prob=0.15 \
  --random_seed=12345 \
  --dupe_factor=5

In this way, I could train using all.tfrecord as training set. However, no test or eval set is generated. May I ask, how did you split zhwiki into different sets? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants