Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

Closed
songtaoshi opened this issue Aug 22, 2019 · 12 comments
Closed

hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

songtaoshi opened this issue Aug 22, 2019 · 12 comments
Labels

Comments

@songtaoshi
Copy link

❓ Questions & Help

hello, I trained the robert on my customized corpus following the fairseq instruction. I am confused how to generate the robert vocab.json and also merge.txt because I want to use the pytorch-transformer RoBERTaTokenizer. I only have a dict.txt in my data

@songtaoshi
Copy link
Author

@LysandreJik
Copy link
Member

Hi! RoBERTa's tokenizer is based on the GPT-2 tokenizer.

Please note that except if you have completely re-trained RoBERTa from scratch, there is usually no need to change the vocab.json and merges.txt file.

Currently we do not have a built-in way of creating your vocab/merges files, neither for GPT-2 nor for RoBERTa. I'm describing the process we followed for RoBERTa, hoping that you will be able to solve your problem following a similar process.

Encoding a sentence is done according to the following process:

Say you start with this text:

What's up with the tokenizer?

The tokenizer first tokenizes according to the merges file:

['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']

And then, according to the values in the vocab.json, these tokens are then replaced by their indices:

[   'What',     "'s",    'Ġup',  'Ġwith',   'Ġthe', 'Ġtoken',   'izer',      '?']
---- becomes ----
[     2061,      338,      510,      351,      262,    11241,     7509,       30]

The dict.txt file generated from RoBERTa actually modifies the vocab.json from the original GPT-2 by shifting the indices.

If you open the dict.txt file you should see values such as (the values shown here are the first values of the native RoBERTa dict.txt):

13 850314647
262 800385005
11 800251374
284 432911125

which are token indices ordered by the highest occurence. For the first example, the token 13 in the GPT-2 tokenizer is the token .: gpt2_tokenizer.encode('.') returns [13]

In order to get the appropriate RoBERTa vocab.json we remapped the original GPT-2 vocab.json with this dict. The first four values are the special tokens:

{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}

Following those values, are the values from the dict.txt ordered by index. For example:

gpt2_tokenizer.decode(13) -> '.'       # INDEX 0 (13 is on the 1st line of the dict.txt)
gpt2_tokenizer.decode(262) -> ' the'   # INDEX 1 (262 is on the 2nd line of the dict.txt)
gpt2_tokenizer.decode(11) -> ','       # INDEX 2 (11 is on the third line of the dict.txt)
gpt2_tokenizer.decode(284) ->  to'     # INDEX 3 (284 is on the fourth line of the dict.txt)

The vocab then becomes:

{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3, ".": 4, "Ġthe": 5, ",": 6, "Ġto": 7}

That's how you create the vocab.json. The merges.txt file is unchanged.

@songtaoshi
Copy link
Author

@julien-c Thanks for your reply!

Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers

4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274

each separated line represents a paragraph

So I skip the BPE encode, I just binarize my data into language format, using

TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
--workers 20

The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.

@stale
Copy link

stale bot commented Oct 22, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 22, 2019
@stale stale bot closed this as completed Oct 29, 2019
@yuanmingchen
Copy link

@julien-c Thanks for your reply!

Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers

4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274

each separated line represents a paragraph

So I skip the BPE encode, I just binarize my data into language format, using

TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens
--destdir data-bin/wikitext-103
--workers 20

The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.

I want to know this too

@Caleb66666
Copy link

U guys can get vocab.txt and merges.txt from:
https://huggingface.co/transformers/v1.1.0/_modules/pytorch_transformers/tokenization_roberta.html
the works still come from huggingface.

@rpoli40
Copy link

rpoli40 commented Aug 14, 2020

@songtaoshi I have a similar problem. Did you get your issue resolved.

@Xianchao-Wu
Copy link

For another new language and a totally new dataset, preparing my own merges.txt and vocab.json is for sure necessary:

Check this:
https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403

this is a step-by-step tutorial on how to use "oscar" dataset to train your own byte-level bpe tokenizer (which exactly outputs "merges.txt" and "vocab.json".

1. data prepare

import datasets
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')
from tqdm.auto import tqdm
text_data = []
file_count = 0
for sample in tqdm(dataset['train']):
... sample = sample['text'].replace('\n', '')
... text_data.append(sample)
... if len(text_data) == 5000:
... with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
... text_data = []
... file_count += 1
...
with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
...
from pathlib import Path
paths = [str(x) for x in Path('./oscar_la').glob('*.txt')]
paths
['oscar_la/text_1.txt', 'oscar_la/text_2.txt', 'oscar_la/text_3.txt', 'oscar_la/text_0.txt']

2. train

from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=30522, min_frequency=2, special_tokens=['', '', '', '', ''])

3. save

tokenizer.save_model('./oscar_la/blbpe')
['./oscar_la/blbpe/vocab.json', './oscar_la/blbpe/merges.txt']

@chirico85
Copy link

@Xianchao-Wu
Thanks, that helped me a lot!

@XingLuxi
Copy link

XingLuxi commented Feb 9, 2022 via email

@deephudka005
Copy link

Can you please give any reference to the code or explain how can we generate tokens for a given using the merges.txt file?

@XingLuxi
Copy link

XingLuxi commented Oct 18, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants