Skip to content
Builds wordpiece(subword) vocabulary compatible for Google Research's BERT
Branch: master
Clone or download
kwonmha Fix gfile attribute error, issue #8 #9 -> tensorflow.gfile.Glob
Latest commit 5078b7c Dec 16, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information. Pattern only works if passed within double quotes Jun 25, 2019 Bug fix Sep 23, 2019 Convert 'count' from 'int' to 'str' type Sep 25, 2019 Fix gfile attribute error, issue #8 #9 Dec 17, 2019

Vocabulary builder for BERT

Modified, simplified version of and its dependencies included in tensor2tensor library, making its output fits to google research's open-sourced BERT project.

Although google opened pre-trained BERT and training scripts, they didn't open source to generate wordpiece(subword) vocabulary matches to vocab.txt in opened model.
And the libraries they suggested to use were not compatible with their of BERT as they mentioned.
So I modified of tensor2tensor library that is one of the suggestions google mentioned to generate wordpiece vocabulary.


  • Original SubwordTextEncoder adds "_" at the end of subwords appear on the first position of words. So I changed to add "_" at the beginning of subwords that follow other subwords, using _my_escape_token() function, and later substitued "_" with "##"

  • Generated vocabulary contains all characters and all characters having "##" in front of them. For example, a and ##a.

  • Made standard special characters like !?@~ and special tokens used for BERT, ex : [SEP], [CLS], [MASK], [UNK] to be added.

  • Removed irrelevant classes in, commented unused functions some of which seem to exist for decoding, and removed mlperf_log module to make this project independent to tensor2tensor library.


The environment I made this project in consists of :

  • python3.6
  • tensorflow 1.11

Basic usage

python \
--corpus_filepattern "{corpus_for_vocab}" \
--output_filename {name_of_vocab}
--min_count {minimum_subtoken_counts}
You can’t perform that action at this time.