decoder parameter initialization and dictionary during define-tuning #4

ever4244 · 2021-03-18T15:33:30Z

Hi:

Good evening!

I read from your paper that Unicode uses an XLM/transformer structure.

I understand that after pertaining you will keep the parameters in the transformer encoder, but I am a little bit confused about how do you deal with the output/decoder part of the model.

I would use POS-tag as an example.

Since the pre-training task is a masked recovering, LM, etc. It will have a fairly large output layer.

If we keep the original output layer parameter, the vocabulary size will be much larger than the legit 17 POS-tag labels. It may produce a token that is outside of legit POS-tag labels.
Those POS-tag labels did not appear in the previous pre-training. Thus they will create an OOV problem.

From what I understand, XLM will attach an NMT translator when the pre-training task is machine translation, it will only keep the encoder if the pre-training task is masked tokens recovering. In the fine-tuning, it will attach a new classifier to the pre-trained encoder for fine-tuning for tasks such as XNLI. It will attach the original NMT decoder if the evaluation task is machine translation.

So during Unicode fine-tuning, do I need to cut off the original output/decoder after the pre-training and re-initialize an output layer based on the task at hand? Using POS-tag as an example, do I need to connect a new decoder/output layer to the encoder and cut the vocabulary of 50K words to the 17 labels in the POS-tag task? If so, what is the optimal structure of the decoder/output layer for the POS-tag task or NER task? should I connect the encoder to a complex decoder like in the NMT, or maybe a simple classifier layer connected to the encoder is fine?

What happens for other tasks like NER? Do we fine-tine all the tasks in one model, or do we fine-tine a new model for each task?

Should I fix the encoder parameters and just fine-tune the decoder part?

#=======================================================================
Question on BPE dictionary that split the original words:

I am a little puzzled about how the BPE dictionary is used on POS-tagging in your paper (the unicoder baseline).

for example a sentence like:
"although , as you will have seen , the dre@@ aded ' millennium bu@@ g ' failed to materi@@ alise , "
The BPE dictionary would split the word "materialize" into "materi@@" and "alise"

That is fine for the translation task, however, for the POS-tagging task, wouldn't it be better if we can keep the word `materialize' together so we can have one POS-tag for each word, instead of two POS-tag for each word-piece or two word-piece using a POS-tag together?

So do you let the "materi@@" "alise" produce two "verb" label or do you let `materi@@ "alise" produces a single "verb" label (then the length of input and output would be different)

                                                                                                                                              Regards!
                                                                                                                                                           Wei

The text was updated successfully, but these errors were encountered:

Eureka6174 · 2021-03-29T10:49:16Z

Sorry for the late reply.

Their are output layers both in pre-training and fine-tuning. But they are different. And for different fine-tuning tasks, the simplest way is to use different output layers. For now, we just used linear layers as output layers. It's still open to try for other options.

For the BPE problem, we only produce one verb for the first subword "materi@@". We choose this option just to follow BERT. We didn't try other options. If you have better choice, could you also tell us?

Please feel free to contact us if this doesn't fully answer your question.

thomas-happify · 2021-04-05T20:13:46Z

@Eureka6174
Hi there!

So for the XLM-R text generation, what exactly is the decoder?
Is it also a simple linear layer instead of initializing another XLM-R as decoder?

Eureka6174 · 2021-04-06T03:12:16Z

It's transformers layers in decoder. Masked self-attention layers are initialized from XLM-R and attention from decoder to encoder is random initialized.

thomas-happify · 2021-04-06T03:33:05Z

@Eureka6174
Thanks! that make sense.

BTW Is the pre-training code in this repo as well? I'm only seeing generation_from_pretrained_xlmr.py but I don't think that's the one right? I'm really interested how you further pretrained the model with decoder.

Thanks!

ever4244 · 2021-04-22T16:21:26Z

Thanks for the answer. I just wonder would it be better if we choose the last subword instead of the first subword as the position of POS-tag.

Another Question:

Q1:
I notice that there is a model_type flag provided during testing to indicate what pre-trained model is used.

If I trained my own model using fairseq-train (for example a classic NMT transformer model)
How do I use it in the POS-tag and NER testing task?
It would have different dimensions and layers comparing to the pre-trained XLM and BERT models you provided.

Can I just declare it as an XLM and run the NER testing codes (since they share the same encoder)?

Q2: Any example and code for pre-training from scratch?
I currently just trying to use the multilingual translation script in the generation/example to pre-train a model, but there are many different tasks for model pre-training in your paper. I understand that you fine-tune the existing XLM-R models using language modeling. I just wonder if there is a pre-training example that trains the model from scratch so that I can change the size and dimension with more flexibility.

Q3:
I see that in the "generation" folder the unicoder X_dae model is fairseq based.
In the "understanding" folder, the pre-trained model is huggings transformer-based.
So can I used them interchangeably? for example, if I trained/fine-tuned a model in the generation folder with fairseq, can I move it to the understanding folder and testing the model there?
It seems to me that the fairseq trained model is xxx.pt while the huggings transformer-based model is saved as pytorch_model.bin and config.json. So I am puzzled about how to use one encoder for generation and understanding tasks.

Thank for very much!

Eureka6174 · 2021-04-23T02:11:22Z

We tried the last token but it got similar results as first token.
Q1: You could use the code from HuggingFace to transform it to HuggingFace format and run POS Tagging and NER.
Q2: Our pre-training scripts is not ready to release for now.
Q3: You need to transform the model with HuggingFace code.

Thanks!

ever4244 · 2021-04-23T02:16:38Z

We tried the last token but it got similar results as first token.
Q1: You could use the code from HuggingFace to transform it to HuggingFace format and run POS Tagging and NER.
Q2: Our pre-training scripts is not ready to release for now.
Q3: You need to transform the model with HuggingFace code.

Thanks!

Thank you for the timely response!
Can you elaborate Q1 or give me a link on "use the code from HuggingFace to transform it to HuggingFace format"
Are you referring to this one?
https://github.com/stas00/porting/tree/master/transformers/fairseq-wmt19
I am not sure whether this is only for the standard model or can it convert the model from a different structure. My model may have a different size, dimension, or even attention connection occasionally, can I use it to convert from xxx.pt to pytorch_model.bin?

BTW:
I trained a model using fairseq-train, but I found that in the Generation folder your pre-trained model contains a "sentencepiece.bpe.model". I don't have this file when compiling data into bpe, I just got "check.pt" and "dict.txt", I want to ask in which process do you get sentencepiece.bpe.model?

Thank you very much!

Eureka6174 · 2021-04-23T02:23:16Z

There is the link: https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py

I think you could read the document of HuggingFace and Transformers at first.

Thanks,
Yaobo

ever4244 · 2021-04-23T03:09:44Z

Thanks for the link, but I have modified my question so I restate it a bit.

Q1:
I also find a link:
https://github.com/stas00/porting/tree/master/transformers/fairseq-wmt19
But both your link and my link seem to be the conversions of a standard model structure. What if have a different model structure?
My model may have a different size, dimension, or even attention connection occasionally, can I use it to convert from xxx.pt to pytorch_model.bin?
As my current pre-trained model is a transformer NMT model 6 encoder layer and 1 decoder layer, Can I use the roberta converter when bert and NMT share a similar encoder but different decoder? I suppose there is no universal converter of various structures between fairseq and huggings?

I am sorry I have been sing fairseq a lot and new to huggings. I will read more of its documents.

ever4244 · 2021-04-23T03:32:32Z

I trained a model using fairseq-train without spm, but I found that I need a "sentencepiece.bpe.model" for later task.

I used a script similar to fairseq translation example: prepare-wmt14en2de.sh, it does not generate a sentence piece model. And I prepare the data and train the old model with it.

https://github.com/pytorch/fairseq/tree/master/examples/translation

while the one that has a sentence piece model is: prepare-iwslt17-multilingual.sh
python "$SPM_TRAIN"
--input=$TRAIN_FILES
--model_prefix=$DATA/sentencepiece.bpe
--vocab_size=$BPESIZE
--character_coverage=1.0
--model_type=bpe.

I am currently want to re-learn the sentencepiece.bpe.model on the training data for prepare-wmt14en2de.sh.

Since I already trained the model without the sentencepiece.bpe.model, I just want to make sure I can get the exact same training data when I reapply the spm learning script on the old data. So that my previously trained model from prepare-wmt14en2de.sh can be coupled with the newly learnt sentencepiece.bpe.model

However,
In prepare-wmt14en2de.sh it use: fastBPE's learn_bpe.py,
https://github.com/glample/fastBPE
in prepare-iwslt17-multilingual.sh it use: sentencepiece's spm_train.py
https://github.com/google/sentencepiece

They use different codes for bpe learning and encoding. They even use different BPE replacement tokens (@@ for fastBPE and _ for sentencepiece ). So how can create a sentencepiece.bpe.model that can be used together with my old model from fastBPE? (That is a sentencepiece.bpe.model which will generate the exact same training data as fastBPE)

Thank you very much!

Eureka6174 · 2021-04-23T05:02:09Z

I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo.

ever4244 · 2021-04-26T04:20:29Z

I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo.

Thanks.
I have some questions on preprocessing as well.

I want my pre-training model replication to be as close as to your model so that there won't be a performance loss due to the difference in the pre-processing text during pre-training and testing. So I want to make sure that I follow your pre-processing procedure on training data.

In prepare-wmt14en2de.sh, they use several scripts to tokenize and clean the corpus.

for example:
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl

https://github.com/pytorch/fairseq/tree/master/examples/translation

In prepare-iwslt17-multilingual.sh, it does not use this preprocessing as sentencepiece can be used on raw text.

So what is your pre-processing procedure for pre-training? I want to use the same tokenizer and normalization as your model in the pre-training and fine-tuning.

Eureka6174 · 2021-04-28T02:16:51Z

I'm using just raw text for pre-processing. But I didn't try the tokenizer you mentioned because they are different for different languages. If you would like to have a try, I appreciate it if you could share us your results, either work or not work.

Eureka6174 closed this as completed May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoder parameter initialization and dictionary during define-tuning #4

decoder parameter initialization and dictionary during define-tuning #4

ever4244 commented Mar 18, 2021 •

edited

Eureka6174 commented Mar 29, 2021

thomas-happify commented Apr 5, 2021

Eureka6174 commented Apr 6, 2021

thomas-happify commented Apr 6, 2021

ever4244 commented Apr 22, 2021 •

edited

Eureka6174 commented Apr 23, 2021

ever4244 commented Apr 23, 2021 •

edited

Eureka6174 commented Apr 23, 2021

ever4244 commented Apr 23, 2021 •

edited

ever4244 commented Apr 23, 2021 •

edited

Eureka6174 commented Apr 23, 2021

ever4244 commented Apr 26, 2021 •

edited

Eureka6174 commented Apr 28, 2021

decoder parameter initialization and dictionary during define-tuning #4

decoder parameter initialization and dictionary during define-tuning #4

Comments

ever4244 commented Mar 18, 2021 • edited

Eureka6174 commented Mar 29, 2021

thomas-happify commented Apr 5, 2021

Eureka6174 commented Apr 6, 2021

thomas-happify commented Apr 6, 2021

ever4244 commented Apr 22, 2021 • edited

Eureka6174 commented Apr 23, 2021

ever4244 commented Apr 23, 2021 • edited

Eureka6174 commented Apr 23, 2021

ever4244 commented Apr 23, 2021 • edited

ever4244 commented Apr 23, 2021 • edited

Eureka6174 commented Apr 23, 2021

ever4244 commented Apr 26, 2021 • edited

Eureka6174 commented Apr 28, 2021

ever4244 commented Mar 18, 2021 •

edited

ever4244 commented Apr 22, 2021 •

edited

ever4244 commented Apr 23, 2021 •

edited

ever4244 commented Apr 23, 2021 •

edited

ever4244 commented Apr 23, 2021 •

edited

ever4244 commented Apr 26, 2021 •

edited