Merge subcorpus-specific wordpiece vocabularies #33

jowagner · 2020-11-16T19:47:54Z

When training on Irish, English and possibly other languages, Hung et al. (2020) "Improving Multilingual Models with Language-Clustered Vocabularies" suggest to create wordpiece vocabularies for clusters of related languages and then use the union of these vocabularies as the final vocabulary used during BERT training and prediction. For us this could mean to split the data into (1) clearly English only, (2) clearly Irish only and (3) all other text, train 3 vocabularies and merge them.

jbrry · 2020-11-25T19:07:12Z

Good idea. The script https://github.com/jbrry/wiki-bert-pipeline/blob/858d323e1fa3a63368441d68309d5afb9389d3fe/external_scripts/gather_external_data.py#L17
supports gathering specific corpora and then launching the wiki-bert-pipeline. For (1), we could start out with a corpus 'parallel-en' (which is all the en side of the parallel text in gdrive) and it could be run through the wiki-bert-pipeline to generate the wordpiece vocabulary which is en only.

I haven't read the above paper yet but I wonder how easy it is to merge vocabularies? Is it as simple as just merging three vocab.txt files, which look like the below? (and removing duplicates):

ga/vocab.txt

[UNK]
[CLS]
[SEP]
[MASK]
a
##ch
##ai
##ea
s
d
...

en/vocab.txt

[UNK]
[CLS]
[SEP]
[MASK]
a
##n
##ex
##ample
...

jowagner · 2020-11-26T10:19:47Z

A way to find out would be to remove all other intermediate output files generated when building the vocabulary and to see whether BERT still trains as usual. If it does this means it is only using the vocab.txt file as the basis for the vocabulary.

The example suggests the regular entries do not need to be in any particular order but I'd guess that the first 4 special entries must be at the start. This could be done manually with

head -n 4 ga/vocab.txt > combined/vocab.txt 
tail -q -n +5 ??/vocab.txt | LC_ALL=C sort | uniq >> combined/vocab.txt

alanagiasi · 2020-11-26T13:22:04Z

The BERT vocabulary (for bert-base-uncased) is laid out as follows:

1	[PAD]
2	[unused0]
...	...
100	[unused98]
101	[UNK]
102	[CLS]
103	[SEP]
104	[MASK]
105	[unused99]
...	...
999	[unused993]
1000	!
1001	<more single characters, possibly in UTF-8 encoding ascending order>
...		...
1997	the
1998	<more words and subwords, possibly sorted by frequency in descending order>
30522	##~

The vocabulary is 30,522 tokens: the first 999 tokens are reserved e.g. [unused993] including the tokens 1, 101-105.
It appears the vocabulary is then laid out as individual characters (possibly in UTF-8 ascending order), followed by words and subwords (possibly sorted by frequency in descending order).
The unused tokens can be used to add custom tokens, this may be intended for fine-tuning purposes rather than training from scratch, I'm not certain.

As I understand it the vocabulary is used as a dictionary mapping word_string : word_id e.g. the : 1997. The word_id is subsequently used in a word embedding lookup table (30522 x 768) to retrieve the embedding (typically each embedding has 768 features).

jbrry · 2020-11-26T17:12:01Z

Thanks Alan. Also FYI Joachim, the example I posted skips lines 1-101 in a vocab file, which as Alan pointed out are [PAD] - [unused99] (though the example Alan uses ranges from 0:999). I think the vocab file for bert-base-uncased must be different from multilingual BERT as mBERT only keeps 99 places for unseen tokens, e.g. Footnote 5 in Chau et al., (2020) mentions:

MBERT’s fixed-size vocabulary contains 99 tokens designated as “unused,” whose representations were not updated
during initial pretraining and can be repurposed for vocabulary
augmentation without modifying the pretrained model.

Also the vocab file I am using for mBERT only keeps 99 as well. They are also in the format of just one wordpiece token per line:

##ução
##шни
administración
Italiae
##ждения

Perhaps they changed how they write vocab files between bert-base-uncased and multilingual-bert. In any case, I imagine the word keys are hard-coded to a token ID value in the model itself, even if that's not how they write it for multilingual-bert. So in one model, 'apple' may be mapped to ID 102 and in a model in another language, e.g. French, 'rouge' may be mapped to ID 102 so it may mean that the word-to-ID lookup dictionary has to be changed to accommodate the different key-value pairs.

jowagner · 2020-11-27T10:53:17Z

Thanks.

Yes, I also think that when you work with an existing model you must not append entries to the vocab files or change the order of existing entries. Vocab files with such changes are only useful for training from scratch. As the footnote quoted above says, the unused entries are there to help people add some entries in fine-tuning.

jowagner added the idea Future work idea label Nov 16, 2020

This was referenced Nov 27, 2020

Include unused entries in vocabulary of "from scratch" models #41

Closed

Populate unused vocabulary entries of our mBERT-based models #42

Open

jowagner mentioned this issue Feb 25, 2021

Restrict BERT vocabulary building to clean corpora #54

Open

jowagner mentioned this issue May 13, 2021

Models we trained for summer 2021 (was: New models to be run) #72

Closed

jowagner changed the title ~~Merge language-specific wordpiece vocabularies~~ Merge subcorpus-specific wordpiece vocabularies May 13, 2021

jowagner mentioned this issue Sep 16, 2021

Meta: Summary of future work ideas / feature requests #86

Open

jowagner added the project Suitable for a student or intern project label Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge subcorpus-specific wordpiece vocabularies #33

Merge subcorpus-specific wordpiece vocabularies #33

jowagner commented Nov 16, 2020 •

edited

jbrry commented Nov 25, 2020

jowagner commented Nov 26, 2020 •

edited

alanagiasi commented Nov 26, 2020

jbrry commented Nov 26, 2020

jowagner commented Nov 27, 2020

Merge subcorpus-specific wordpiece vocabularies #33

Merge subcorpus-specific wordpiece vocabularies #33

Comments

jowagner commented Nov 16, 2020 • edited

jbrry commented Nov 25, 2020

jowagner commented Nov 26, 2020 • edited

alanagiasi commented Nov 26, 2020

jbrry commented Nov 26, 2020

jowagner commented Nov 27, 2020

jowagner commented Nov 16, 2020 •

edited

jowagner commented Nov 26, 2020 •

edited