Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training protocol for Roberta model with SentencePiece #2

Closed
ksopyla opened this issue Feb 11, 2020 · 6 comments
Closed

Training protocol for Roberta model with SentencePiece #2

ksopyla opened this issue Feb 11, 2020 · 6 comments

Comments

@ksopyla
Copy link

ksopyla commented Feb 11, 2020

Hi,
I try to train the Roberta model with fairseq library from scratch. I want to pretrain this model on polish text but can't find any good source which explains the details.

There is a readme which explains how to do this with BPE and English https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

But not everything is so obvious to me. First of all, I want to use sentencepiece instead BPE, you train your model with SentencePiece am I correct?
Could you share how the format of the vocab file for fairseq should look like?
My trained vocab has format

<unk>	0
▁,	-2.98959
▁.	-3.06552
▁w	-3.656
a	-3.99339
▁i	-4.16481
...

What is the target data format? The same as in original BERT?

[unk]
,
.
w
##a
i
...

My concern is also data preparation, how to preprocess and encode the data (text).

In tutorial they encode text with bpe

mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

So, should I write own script which encodes my data with sentencepiece tokens

And then use

fairseq-preprocess \
    --only-source \
    --srcdict sentencepiece/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.sp \
    --validpref wikitext-103-raw/wiki.valid.sp \
    --testpref wikitext-103-raw/wiki.test.sp \
    --destdir data-bin/wikitext-103 \
    --workers 60

Could you also share some information about your settings

TOTAL_UPDATES=??    # Total number of training steps
WARMUP_UPDATES=??    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=16          # Increase the batch size 16x

Finnaly, could you share how many GPU you use, how long did it take to train model?
Any tips, warnings are welcome

Thank you in advance. :)

@simonefrancia
Copy link

Ok, I will try to do a brief description.
Yes we used SentencePiece Tokenizer and we use it through command line.
SentencePiece implements two segmentation algorithms and one of them is BPE that is also the same used in Camembert.
With the command below, you train the sentence piece tokenizer on a very big corpus of data

# Train SentencePiece Tokenizer on large corpus
spm_train \
    --input=[raw_text_file] \
    --max_sentence_length= [ max length of a sentence you accept ]\
    --model_prefix=spm.bpe \
    --vocab_size=[8000, 16000, 32000, etc..] \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --pad_id=-1 \
    --input_sentence_size=[ choose a smaller amount of data randomly ]

Then you have to encode your data in the format that Fairseq training needs.

# Encode Data with SentencePiece Tokenizer
spm_encode \
    --model=spm.bpe.model \ [ model that is from output of sp training ]
    --extra_options=bos:eos \ [ saying that you want begin of sequence and end of sequence encoded ]
    --output_format=piece \ [ here you are telling that encoded data will be as tokens of spm ]
    < file.raw \ [ raw data in input]
    > file.bpe [ encoded data in output ]

Here you will have a dictionary in this format:
you have to change the separator from \t ( sentencepiece) to space because it's the notation expected by fairseq.
split your file.bpe in train.bpe, valid.bpe and test.bpe and preprocess your data.

fairseq-preprocess \
    --only-source \
    --srcdict sentencepiece.bpe.vocab \
    --trainpref train.bpe \
    --validpref valid.bpe \
    --testpref test.bpe \
    --destdir $DATA_DIR \
    --workers $N_WORKERS

For TOTAL_UPDATES we chose based on Roberta Paper Page 6. WARMUP_UPDATES was 10% of TOTAL_UPDATES, and total batch_size was 2k, but it depends on the number of GPUs, and also what kind of GPUs you want to use. we had 256 batch _size (16 MAX_SENTENCES x16 UPDATE_FREQ )for every GPU (8 GPUs) = 2048
Thanks

@simonefrancia
Copy link

Close the issue but feel free to open it again

@ksopyla
Copy link
Author

ksopyla commented Mar 5, 2020

Hi @simonefrancia I have another question about data preparation. Original fairseq tutorial is based on wikitext103, sample below

 = Robert Boulter =

 Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . 

As you see, the text were preprocessed. It was tokenized and each token was surrounded with space (all the dots at the end of the sentence has space).
But when you use BPE this doesn't make much sense, right?
Do you preprocess text (file.raw in your example) for training model in a such way?

@simonefrancia simonefrancia reopened this Mar 5, 2020
@simonefrancia
Copy link

simonefrancia commented Mar 5, 2020

Hi @ksopyla,
in general this preprocessing is not necessary because during the SentencePiece training phase, the algorithm itself understands how to split text in order to optimize the coverage of the dictionary size that you decided at the beginning.
This is the power of dynamic tokenizers that are not static rule-based, but they are based on your data and if the data are a lot, they will probably learn better.
So I think you can leave text data in the original format.

@ksopyla
Copy link
Author

ksopyla commented Mar 9, 2020

My intuition was exactly as you write, thanks for confirmation.

@ksopyla ksopyla closed this as completed Mar 9, 2020
@diiogofernands
Copy link

Ok, I will try to do a brief description.
Yes we used SentencePiece Tokenizer and we use it through command line.
SentencePiece implements two segmentation algorithms and one of them is BPE that is also the same used in Camembert.
With the command below, you train the sentence piece tokenizer on a very big corpus of data

# Train SentencePiece Tokenizer on large corpus
spm_train \
    --input=[raw_text_file] \
    --max_sentence_length= [ max length of a sentence you accept ]\
    --model_prefix=spm.bpe \
    --vocab_size=[8000, 16000, 32000, etc..] \
    --model_type=bpe \
    --shuffle_input_sentence=true \
    --pad_id=-1 \
    --input_sentence_size=[ choose a smaller amount of data randomly ]

Then you have to encode your data in the format that Fairseq training needs.

# Encode Data with SentencePiece Tokenizer
spm_encode \
    --model=spm.bpe.model \ [ model that is from output of sp training ]
    --extra_options=bos:eos \ [ saying that you want begin of sequence and end of sequence encoded ]
    --output_format=piece \ [ here you are telling that encoded data will be as tokens of spm ]
    < file.raw \ [ raw data in input]
    > file.bpe [ encoded data in output ]

Here you will have a dictionary in this format:
you have to change the separator from \t ( sentencepiece) to space because it's the notation expected by fairseq.
split your file.bpe in train.bpe, valid.bpe and test.bpe and preprocess your data.

fairseq-preprocess \
    --only-source \
    --srcdict sentencepiece.bpe.vocab \
    --trainpref train.bpe \
    --validpref valid.bpe \
    --testpref test.bpe \
    --destdir $DATA_DIR \
    --workers $N_WORKERS

For TOTAL_UPDATES we chose based on Roberta Paper Page 6. WARMUP_UPDATES was 10% of TOTAL_UPDATES, and total batch_size was 2k, but it depends on the number of GPUs, and also what kind of GPUs you want to use. we had 256 batch _size (16 MAX_SENTENCES x16 UPDATE_FREQ )for every GPU (8 GPUs) = 2048
Thanks

Hi @simonefrancia, i'm training a roberta from scratch with this description (my dataset size is 6gb) and the MLM loss even decreases up to 50k steps, but when i evaluate this model in NER, the f1 score began a progressive decrease value after 16k steps. What could be causing this strange behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants