hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

songtaoshi · 2019-08-22T11:16:10Z

❓ Questions & Help

hello, I trained the robert on my customized corpus following the fairseq instruction. I am confused how to generate the robert vocab.json and also merge.txt because I want to use the pytorch-transformer RoBERTaTokenizer. I only have a dict.txt in my data

songtaoshi · 2019-08-23T01:00:41Z

@thomwolf @LysandreJik @julien-c

LysandreJik · 2019-08-23T12:50:19Z

Hi! RoBERTa's tokenizer is based on the GPT-2 tokenizer.

Please note that except if you have completely re-trained RoBERTa from scratch, there is usually no need to change the vocab.json and merges.txt file.

Currently we do not have a built-in way of creating your vocab/merges files, neither for GPT-2 nor for RoBERTa. I'm describing the process we followed for RoBERTa, hoping that you will be able to solve your problem following a similar process.

Encoding a sentence is done according to the following process:

Say you start with this text:

What's up with the tokenizer?

The tokenizer first tokenizes according to the merges file:

['What', "'s", 'Ġup', 'Ġwith', 'Ġthe', 'Ġtoken', 'izer', '?']

And then, according to the values in the vocab.json, these tokens are then replaced by their indices:

[   'What',     "'s",    'Ġup',  'Ġwith',   'Ġthe', 'Ġtoken',   'izer',      '?']
---- becomes ----
[     2061,      338,      510,      351,      262,    11241,     7509,       30]

The dict.txt file generated from RoBERTa actually modifies the vocab.json from the original GPT-2 by shifting the indices.

If you open the dict.txt file you should see values such as (the values shown here are the first values of the native RoBERTa dict.txt):

13 850314647
262 800385005
11 800251374
284 432911125

which are token indices ordered by the highest occurence. For the first example, the token 13 in the GPT-2 tokenizer is the token .: gpt2_tokenizer.encode('.') returns [13]

In order to get the appropriate RoBERTa vocab.json we remapped the original GPT-2 vocab.json with this dict. The first four values are the special tokens:

{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}

Following those values, are the values from the dict.txt ordered by index. For example:

gpt2_tokenizer.decode(13) -> '.'       # INDEX 0 (13 is on the 1st line of the dict.txt)
gpt2_tokenizer.decode(262) -> ' the'   # INDEX 1 (262 is on the 2nd line of the dict.txt)
gpt2_tokenizer.decode(11) -> ','       # INDEX 2 (11 is on the third line of the dict.txt)
gpt2_tokenizer.decode(284) ->  to'     # INDEX 3 (284 is on the fourth line of the dict.txt)

The vocab then becomes:

{"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3, ".": 4, "Ġthe": 5, ",": 6, "Ġto": 7}

That's how you create the vocab.json. The merges.txt file is unchanged.

songtaoshi · 2019-08-23T16:01:48Z

@julien-c Thanks for your reply!

Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers

4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274

each separated line represents a paragraph

So I skip the BPE encode, I just binarize my data into language format, using

TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens \
--destdir data-bin/wikitext-103
--workers 20

The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.

stale · 2019-10-22T16:40:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

yuanmingchen · 2019-12-13T05:41:23Z

@julien-c Thanks for your reply!

Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers

4758 7647 16712 6299 11255 6068 695 23 19536 7142 7009 9655 10524 4864 7379 17348 7501 17225 14123 13711 7133 11255 21097 3277 6068 695 4190 1269 4526 12266 2161 17597 15274
23 6484 17225 8217 16374 11122 5592 21224 7251 11188 533 9685 11487 4246 19311 19851 8038 15822 9435 15274
1027 1269 14461 4815 12617 14123 3268 3390 8197 19019 16908 20958 15033 16541 19421 19429 7664 17253 4246 11123 1884 15274
5863 17166 21224 13159 2289 11944 8205 17083 13426 21224 17225 17186 14499 6225 16201 400 5635 3219 16498 15274

each separated line represents a paragraph

So I skip the BPE encode, I just binarize my data into language format, using

TEXT=examples/language_model/wikitext-103
fairseq-preprocess
--only-source
--trainpref $TEXT/wiki.train.tokens
--validpref $TEXT/wiki.valid.tokens
--testpref $TEXT/wiki.test.tokens
--destdir data-bin/wikitext-103
--workers 20

The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging.

I want to know this too

Caleb66666 · 2020-01-07T07:08:05Z

U guys can get vocab.txt and merges.txt from:
https://huggingface.co/transformers/v1.1.0/_modules/pytorch_transformers/tokenization_roberta.html
the works still come from huggingface.

rpoli40 · 2020-08-14T16:32:45Z

@songtaoshi I have a similar problem. Did you get your issue resolved.

Xianchao-Wu · 2021-09-18T23:24:00Z

For another new language and a totally new dataset, preparing my own merges.txt and vocab.json is for sure necessary:

Check this:
https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403

this is a step-by-step tutorial on how to use "oscar" dataset to train your own byte-level bpe tokenizer (which exactly outputs "merges.txt" and "vocab.json".

1. data prepare

import datasets
dataset = datasets.load_dataset('oscar', 'unshuffled_deduplicated_la')
from tqdm.auto import tqdm
text_data = []
file_count = 0
for sample in tqdm(dataset['train']):
... sample = sample['text'].replace('\n', '')
... text_data.append(sample)
... if len(text_data) == 5000:
... with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
... text_data = []
... file_count += 1
...
with open(f'./oscar_la/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
... fp.write('\n'.join(text_data))
...
from pathlib import Path
paths = [str(x) for x in Path('./oscar_la').glob('*.txt')]
paths
['oscar_la/text_1.txt', 'oscar_la/text_2.txt', 'oscar_la/text_3.txt', 'oscar_la/text_0.txt']

2. train

from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=paths, vocab_size=30522, min_frequency=2, special_tokens=['~~', '', '~~', '', ''])

3. save

tokenizer.save_model('./oscar_la/blbpe')
['./oscar_la/blbpe/vocab.json', './oscar_la/blbpe/merges.txt']

chirico85 · 2022-02-09T11:11:01Z

@Xianchao-Wu
Thanks, that helped me a lot!

XingLuxi · 2022-02-09T11:11:25Z

您发给我的信件已经收到，我会尽快查收并回复您。Your e-mail has been received, I will reply as soon as possible.邢璐茜

deephudka005 · 2022-10-18T09:18:05Z

Can you please give any reference to the code or explain how can we generate tokens for a given using the merges.txt file?

XingLuxi · 2022-10-18T09:18:26Z

您发给我的信件已经收到，我会尽快查收并回复您。Your e-mail has been received, I will reply as soon as possible.邢璐茜

stale bot added the wontfix label Oct 22, 2019

stale bot closed this as completed Oct 29, 2019

piegu mentioned this issue Jun 19, 2020

The purpose of files merges.txt, special_tokens_map.json, training_args.bin and add_tokens.json #4777

Closed

manueltonneau mentioned this issue Jun 15, 2021

Special tokens not tokenized properly #12168

Closed

mitramir55 mentioned this issue Jan 5, 2022

How to remove token ? #4827

Closed

abheesht17 mentioned this issue Mar 20, 2022

Add a byte pair encoding (BPE) tokenizer layer keras-team/keras-nlp#46

Closed

mattdangerw mentioned this issue Oct 26, 2022

Add RobertaPreprocessor Layer keras-team/keras-nlp#419

Merged

TimYao18 mentioned this issue Oct 5, 2023

Why .mlpackage format no need the vocab.json & merges.txt file? apple/ml-stable-diffusion#274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

songtaoshi commented Aug 22, 2019

songtaoshi commented Aug 23, 2019

LysandreJik commented Aug 23, 2019

songtaoshi commented Aug 23, 2019

stale bot commented Oct 22, 2019

yuanmingchen commented Dec 13, 2019

Caleb66666 commented Jan 7, 2020

rpoli40 commented Aug 14, 2020

Xianchao-Wu commented Sep 18, 2021

chirico85 commented Feb 9, 2022

XingLuxi commented Feb 9, 2022 via email

deephudka005 commented Oct 18, 2022

XingLuxi commented Oct 18, 2022 via email

hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

hwo to get RoBERTaTokenizer vocab.json and also merge file #1083

Comments

songtaoshi commented Aug 22, 2019

❓ Questions & Help

songtaoshi commented Aug 23, 2019

LysandreJik commented Aug 23, 2019

songtaoshi commented Aug 23, 2019

stale bot commented Oct 22, 2019

yuanmingchen commented Dec 13, 2019

Caleb66666 commented Jan 7, 2020

rpoli40 commented Aug 14, 2020

Xianchao-Wu commented Sep 18, 2021

1. data prepare

2. train

3. save

chirico85 commented Feb 9, 2022

XingLuxi commented Feb 9, 2022 via email

deephudka005 commented Oct 18, 2022

XingLuxi commented Oct 18, 2022 via email