Training GPT-2 transformer language model on your own corpora with sentencepiece tokenization.
This repo contains a PyTorch implementation of GPT-2, which support multi-GPU
It also contains a TensorFlow implementation in
but it is not developed any more. They share the same data preparation scripts.
TF training command is
gpt-2-tf-train and needs TensorFlow 1.13.
Documentation below is for PyTorch version.
Python 3.6+ is required with torch nightly or 1.6.0+. Working in a virtualenv is assumed below. Install appropriate version of pytorch first, and then:
pip install -r requirements.txt python setup.py develop
Instructions are below. See also
for a complete pipeline demo on a small corpus (takes a minute on a CPU).
Corpus format: a directory with top-level
folders. Each top-level folder may contain sub-folders. Inside them,
there must be utf-8 encoded text files with
The commands to train sentencepiece model and encode the corpus support
in below examples we assume they can be listed as
Train sentencepiece model (
sp-text.txtcan be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the
sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files:
sp-encode data/corpora-* sp-model.model data/encoded
gpt-2 run-root data/encoded sp-model.model
run-root would contain model checkpoints and json-lines logs,
which can be plotted in a jupyter notebook with
json_log_plots.plot("run-root"), with number of tokens seen on the X axis.
Default hyperparameters correspond to released "small" GPT-2 model.
When multiple GPUs are available, they would be used for training with the
If the path exists and
--clean key is NOT passed, training would be resumed.
Note that all parameters still need to be specified and
model parameters need to match.
Notes on training parameters:
--batch-sizeis per-GPU, so you don't need to re-tune it when changing number of GPUs, just use max that fits into memory.
--g-accum-gradientsis the global number of gradient accumulations, it must be divisible by the number of GPUs. Effective global batch size is always
batch_size * g_accum_gradients.
--lrdoes not need to be changed when changing
--g-accum-gradientsor number of GPUs or
--n-ctx: loss is already scaled appropriately.
gpt-2-gen run-root "Artificial intelligence"
run-root would contain model checkpoints
"Artificial intelligence" is the text prefix used as a starting point for generating tokens
Notes on inference parameters:
--tokens-to-generate: number of tokens to generate, default is 42
--top-k: number of token candidates to generate for each position (beam width), default is 8.
License is MIT.
TensorFlow GPT-2 model is taken from https://github.com/openai/gpt-2/blob/master/src/model.py and TensorFlow GPT-2 training code is based on https://github.com/nshepperd/gpt-2/blob/finetuning/train.py
PyTorch port is based on original OpenAI code.
Test Shakespeare corpus under
is from http://shakespeare.mit.edu under public domain.