simple_elmo_training

Minimal code to train ELMo models in TensorFlow.

Heavily based on https://github.com/allenai/bilm-tf .

Most changes are simplifications and updating the code to the recent versions of TensorFlow 1. See also our repository with simple code to infer contextualized word vectors from pre-trained ELMo models.

Training

python3 bilm/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT

where

$DATA is a path to the directory containing 2 or more of (possibly gzipped) plain text files: your training corpus.

$SIZE if the number of word tokens in $DATA (necessary to properly construct and log batches).

$VOCAB is a (possibly gzipped) one-word-per-line vocabulary file to be used for language modeling; it should always contain at least <S>, </S> and <UNK>.

$OUT is a directory where the TensorFlow checkpoints will be saved.

Before training, please review the settings in bilm/train_elmo.py. The most important are:

batch_size (default 128)
n_gpus (default 2; if no GPU, all available CPU cores are used)
LSTM dimensionality (default 2048; the original paper used 4096)
n_epochs (default 3; optimal value depends on the size of your corpus)
n_negative_samples_batch (default 4096; the original paper used 8192)

Converting to HDF5

After the training, use the bilm/dump_weights.py script to convert the checkpoints to and HDF5 model.

python3 bilm/dump_weights.py --save_dir $MODEL_DIR --outfile $MODEL_DIR/model.hdf5

Save your vocabulary file in the same directory. Change the n_characters value in the options.json file from 261 to 262 to use the saved model for inference.

More details at https://github.com/allenai/bilm-tf

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
bilm		bilm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
count_long_words.py		count_long_words.py
requirements.txt		requirements.txt
testchar.py		testchar.py
usage_cached.py		usage_cached.py
usage_token.py		usage_token.py
vocabulary.py		vocabulary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bilm

bilm

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

count_long_words.py

count_long_words.py

requirements.txt

requirements.txt

testchar.py

testchar.py

usage_cached.py

usage_cached.py

usage_token.py

usage_token.py

vocabulary.py

vocabulary.py

Repository files navigation

simple_elmo_training

Training

Converting to HDF5

About

Releases

Packages

Contributors 2

Languages

License

ltgoslo/simple_elmo_training

Folders and files

Latest commit

History

Repository files navigation

simple_elmo_training

Training

Converting to HDF5

About

Resources

License

Stars

Watchers

Forks

Languages