-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wiki Chinese dump preprocessing: #lines not matching #447
Comments
I found the reason for my issue. I need to include both simplified and traditional Chinese versions. That'll be totally 25M. |
Hey @LiweiPeng, Victor |
@VictorSanh I used the opencc tool (https://github.com/BYVoid/OpenCC) to convert all wikipedia Chinese text into simplified Chinese version and traditional Chinese version. Then use BERT to train both versions together. I did other tests where only simplified Chinese wiki are used. The results are not as good as using both simplified and traditional for XNLI dataset. |
Thank you @LiweiPeng for you answer! |
t2s json to convert to simplified Chinese. s2t.json to convert to traditional Chinese. |
So you are applying t2s.json to the output of wikiextractor to get the simplified Chinese version and then reapply s2t.json to the latter to get the traditional Chinese version (i.e. two steps process)? |
Hi, this method works for me, thanks! But I have a few questions. Did you split tfrecord
In this way, I could train using all.tfrecord as training set. However, no test or eval set is generated. May I ask, how did you split zhwiki into different sets? Thank you! |
The BERT Chinese model was trained from wikipedia dump. According to #155 , BERT original model got 25M lines/sentences after pre-processing.
However, I got 12.5M lines(sentences) after pre-processing. Can someone let me know what could be wrong with my steps?
What I did:
The text was updated successfully, but these errors were encountered: