Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge subcorpus-specific wordpiece vocabularies #33

Open
jowagner opened this issue Nov 16, 2020 · 5 comments
Open

Merge subcorpus-specific wordpiece vocabularies #33

jowagner opened this issue Nov 16, 2020 · 5 comments
Labels
idea Future work idea project Suitable for a student or intern project

Comments

@jowagner
Copy link
Collaborator

jowagner commented Nov 16, 2020

When training on Irish, English and possibly other languages, Hung et al. (2020) "Improving Multilingual Models with Language-Clustered Vocabularies" suggest to create wordpiece vocabularies for clusters of related languages and then use the union of these vocabularies as the final vocabulary used during BERT training and prediction. For us this could mean to split the data into (1) clearly English only, (2) clearly Irish only and (3) all other text, train 3 vocabularies and merge them.

@jowagner jowagner added the idea Future work idea label Nov 16, 2020
@jbrry
Copy link
Owner

jbrry commented Nov 25, 2020

Good idea. The script https://github.com/jbrry/wiki-bert-pipeline/blob/858d323e1fa3a63368441d68309d5afb9389d3fe/external_scripts/gather_external_data.py#L17
supports gathering specific corpora and then launching the wiki-bert-pipeline. For (1), we could start out with a corpus 'parallel-en' (which is all the en side of the parallel text in gdrive) and it could be run through the wiki-bert-pipeline to generate the wordpiece vocabulary which is en only.

I haven't read the above paper yet but I wonder how easy it is to merge vocabularies? Is it as simple as just merging three vocab.txt files, which look like the below? (and removing duplicates):

ga/vocab.txt

[UNK]
[CLS]
[SEP]
[MASK]
a
##ch
##ai
##ea
s
d
...

en/vocab.txt

[UNK]
[CLS]
[SEP]
[MASK]
a
##n
##ex
##ample
...

@jowagner
Copy link
Collaborator Author

jowagner commented Nov 26, 2020

A way to find out would be to remove all other intermediate output files generated when building the vocabulary and to see whether BERT still trains as usual. If it does this means it is only using the vocab.txt file as the basis for the vocabulary.

The example suggests the regular entries do not need to be in any particular order but I'd guess that the first 4 special entries must be at the start. This could be done manually with

head -n 4 ga/vocab.txt > combined/vocab.txt 
tail -q -n +5 ??/vocab.txt | LC_ALL=C sort | uniq >> combined/vocab.txt

@alanagiasi
Copy link
Collaborator

The BERT vocabulary (for bert-base-uncased) is laid out as follows:

1	[PAD]
2	[unused0]
...	...
100	[unused98]
101	[UNK]
102	[CLS]
103	[SEP]
104	[MASK]
105	[unused99]
...	...
999	[unused993]
1000	!
1001	<more single characters, possibly in UTF-8 encoding ascending order>
...		...
1997	the
1998	<more words and subwords, possibly sorted by frequency in descending order>
30522	##~

The vocabulary is 30,522 tokens: the first 999 tokens are reserved e.g. [unused993] including the tokens 1, 101-105.
It appears the vocabulary is then laid out as individual characters (possibly in UTF-8 ascending order), followed by words and subwords (possibly sorted by frequency in descending order).
The unused tokens can be used to add custom tokens, this may be intended for fine-tuning purposes rather than training from scratch, I'm not certain.

As I understand it the vocabulary is used as a dictionary mapping word_string : word_id e.g. the : 1997. The word_id is subsequently used in a word embedding lookup table (30522 x 768) to retrieve the embedding (typically each embedding has 768 features).

@jbrry
Copy link
Owner

jbrry commented Nov 26, 2020

Thanks Alan. Also FYI Joachim, the example I posted skips lines 1-101 in a vocab file, which as Alan pointed out are [PAD] - [unused99] (though the example Alan uses ranges from 0:999). I think the vocab file for bert-base-uncased must be different from multilingual BERT as mBERT only keeps 99 places for unseen tokens, e.g. Footnote 5 in Chau et al., (2020) mentions:

MBERT’s fixed-size vocabulary contains 99 tokens designated as “unused,” whose representations were not updated
during initial pretraining and can be repurposed for vocabulary
augmentation without modifying the pretrained model.

Also the vocab file I am using for mBERT only keeps 99 as well. They are also in the format of just one wordpiece token per line:

##ução
##шни
administración
Italiae
##ждения

Perhaps they changed how they write vocab files between bert-base-uncased and multilingual-bert. In any case, I imagine the word keys are hard-coded to a token ID value in the model itself, even if that's not how they write it for multilingual-bert. So in one model, 'apple' may be mapped to ID 102 and in a model in another language, e.g. French, 'rouge' may be mapped to ID 102 so it may mean that the word-to-ID lookup dictionary has to be changed to accommodate the different key-value pairs.

@jowagner
Copy link
Collaborator Author

Thanks.

Yes, I also think that when you work with an existing model you must not append entries to the vocab files or change the order of existing entries. Vocab files with such changes are only useful for training from scratch. As the footnote quoted above says, the unused entries are there to help people add some entries in fine-tuning.

@jowagner jowagner changed the title Merge language-specific wordpiece vocabularies Merge subcorpus-specific wordpiece vocabularies May 13, 2021
@jowagner jowagner added the project Suitable for a student or intern project label Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Future work idea project Suitable for a student or intern project
Projects
None yet
Development

No branches or pull requests

3 participants