New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hwo to get RoBERTaTokenizer vocab.json and also merge file #1083
Comments
Hi! RoBERTa's tokenizer is based on the GPT-2 tokenizer. Please note that except if you have completely re-trained RoBERTa from scratch, there is usually no need to change the Currently we do not have a built-in way of creating your vocab/merges files, neither for GPT-2 nor for RoBERTa. I'm describing the process we followed for RoBERTa, hoping that you will be able to solve your problem following a similar process. Encoding a sentence is done according to the following process: Say you start with this text:
The tokenizer first tokenizes according to the merges file:
And then, according to the values in the
The dict.txt file generated from RoBERTa actually modifies the If you open the dict.txt file you should see values such as (the values shown here are the first values of the native RoBERTa
which are token indices ordered by the highest occurence. For the first example, the token In order to get the appropriate RoBERTa
Following those values, are the values from the
The vocab then becomes:
That's how you create the |
@julien-c Thanks for your reply! Hi, I am pre-training RoBERTa in my own corpus, which consists of numbers
each separated line represents a paragraph So I skip the BPE encode, I just binarize my data into language format, using
The vocab.json I think I can construct by myself but the merges.txt I didn't use the BPE, So I wondering if I just use an empty file to mean no merging. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I want to know this too |
U guys can get vocab.txt and merges.txt from: |
@songtaoshi I have a similar problem. Did you get your issue resolved. |
For another new language and a totally new dataset, preparing my own merges.txt and vocab.json is for sure necessary: Check this: this is a step-by-step tutorial on how to use "oscar" dataset to train your own byte-level bpe tokenizer (which exactly outputs "merges.txt" and "vocab.json". 1. data prepare
2. train
3. save
|
@Xianchao-Wu |
您发给我的信件已经收到,我会尽快查收并回复您。Your e-mail has been received, I will reply as soon as possible.邢璐茜
|
Can you please give any reference to the code or explain how can we generate tokens for a given using the merges.txt file? |
您发给我的信件已经收到,我会尽快查收并回复您。Your e-mail has been received, I will reply as soon as possible.邢璐茜
|
❓ Questions & Help
hello, I trained the robert on my customized corpus following the fairseq instruction. I am confused how to generate the robert vocab.json and also merge.txt because I want to use the pytorch-transformer RoBERTaTokenizer. I only have a dict.txt in my data
The text was updated successfully, but these errors were encountered: