Fold3D

Usage

We've provided several scripts for pretraining BERT, GPT, CPM, T5 and Turing-NLG in examples directory.

Data Preprocessing

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py The other metadata are optional and are not used in training.

The loose json is then processed into a binary format for training. To convert the json into mmap, cached index file, or the lazy loader format use preprocess_data.py. Set the --dataset-impl flag to mmap, cached, or lazy, respectively (default is mmap). An example script to prepare data for BERT training is:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-bert \
       --vocab bert-vocab.txt \
       --dataset-impl mmap \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences

The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data-path specified in later BERT training is the full path and new filename, but without the file extension.

For T5 use the same preprocessing as BERT, perhaps renaming it to:

       --output-prefix my-t5 \

Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab gpt2-vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod

Here the output files are named my-gpt2_text_document.bin and my-gpt2_text_document.idx. As before, in GPT training, use the longer name without the extension as --data-path.

Further command line arguments are described in the source file preprocess_data.py.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
images		images
megatron		megatron
tasks		tasks
tests		tests
tools		tools
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
README.md		README.md
bert_text_sentence.bin.0		bert_text_sentence.bin.0
bert_text_sentence.bin.1		bert_text_sentence.bin.1
bert_text_sentence.bin.2		bert_text_sentence.bin.2
bert_text_sentence.idx		bert_text_sentence.idx
pretrain_bert.py		pretrain_bert.py
pretrain_cpm.py		pretrain_cpm.py
pretrain_gpt.py		pretrain_gpt.py
pretrain_ict.py		pretrain_ict.py
pretrain_t5.py		pretrain_t5.py

License

hku-systems/fold3d

Folders and files

Latest commit

History

Repository files navigation

Fold3D

Usage

Data Preprocessing

About

Resources

License

Stars

Watchers

Forks

Languages