Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoder parameter initialization and dictionary during define-tuning #4

Closed
ever4244 opened this issue Mar 18, 2021 · 13 comments
Closed

Comments

@ever4244
Copy link

ever4244 commented Mar 18, 2021

Hi:

Good evening!

I read from your paper that Unicode uses an XLM/transformer structure.

I understand that after pertaining you will keep the parameters in the transformer encoder, but I am a little bit confused about how do you deal with the output/decoder part of the model.

I would use POS-tag as an example.

Since the pre-training task is a masked recovering, LM, etc. It will have a fairly large output layer.

  1. If we keep the original output layer parameter, the vocabulary size will be much larger than the legit 17 POS-tag labels. It may produce a token that is outside of legit POS-tag labels.

  2. Those POS-tag labels did not appear in the previous pre-training. Thus they will create an OOV problem.

From what I understand, XLM will attach an NMT translator when the pre-training task is machine translation, it will only keep the encoder if the pre-training task is masked tokens recovering. In the fine-tuning, it will attach a new classifier to the pre-trained encoder for fine-tuning for tasks such as XNLI. It will attach the original NMT decoder if the evaluation task is machine translation.

So during Unicode fine-tuning, do I need to cut off the original output/decoder after the pre-training and re-initialize an output layer based on the task at hand? Using POS-tag as an example, do I need to connect a new decoder/output layer to the encoder and cut the vocabulary of 50K words to the 17 labels in the POS-tag task? If so, what is the optimal structure of the decoder/output layer for the POS-tag task or NER task? should I connect the encoder to a complex decoder like in the NMT, or maybe a simple classifier layer connected to the encoder is fine?

What happens for other tasks like NER? Do we fine-tine all the tasks in one model, or do we fine-tine a new model for each task?

Should I fix the encoder parameters and just fine-tune the decoder part?

#=======================================================================
Question on BPE dictionary that split the original words:

I am a little puzzled about how the BPE dictionary is used on POS-tagging in your paper (the unicoder baseline).

for example a sentence like:
"although , as you will have seen , the dre@@ aded ' millennium bu@@ g ' failed to materi@@ alise , "
The BPE dictionary would split the word "materialize" into "materi@@" and "alise"

That is fine for the translation task, however, for the POS-tagging task, wouldn't it be better if we can keep the word `materialize' together so we can have one POS-tag for each word, instead of two POS-tag for each word-piece or two word-piece using a POS-tag together?

So do you let the "materi@@" "alise" produce two "verb" label or do you let `materi@@ "alise" produces a single "verb" label (then the length of input and output would be different)

                                                                                                                                              Regards!
                                                                                                                                                           Wei 
@Eureka6174
Copy link
Contributor

Sorry for the late reply.

Their are output layers both in pre-training and fine-tuning. But they are different. And for different fine-tuning tasks, the simplest way is to use different output layers. For now, we just used linear layers as output layers. It's still open to try for other options.

For the BPE problem, we only produce one verb for the first subword "materi@@". We choose this option just to follow BERT. We didn't try other options. If you have better choice, could you also tell us?

Please feel free to contact us if this doesn't fully answer your question.

@thomas-happify
Copy link

@Eureka6174
Hi there!

So for the XLM-R text generation, what exactly is the decoder?
Is it also a simple linear layer instead of initializing another XLM-R as decoder?

@Eureka6174
Copy link
Contributor

It's transformers layers in decoder. Masked self-attention layers are initialized from XLM-R and attention from decoder to encoder is random initialized.

@thomas-happify
Copy link

@Eureka6174
Thanks! that make sense.

BTW Is the pre-training code in this repo as well? I'm only seeing generation_from_pretrained_xlmr.py but I don't think that's the one right? I'm really interested how you further pretrained the model with decoder.

Thanks!

@ever4244
Copy link
Author

ever4244 commented Apr 22, 2021

Thanks for the answer. I just wonder would it be better if we choose the last subword instead of the first subword as the position of POS-tag.

Another Question:

Q1:
I notice that there is a model_type flag provided during testing to indicate what pre-trained model is used.

If I trained my own model using fairseq-train (for example a classic NMT transformer model)
How do I use it in the POS-tag and NER testing task?
It would have different dimensions and layers comparing to the pre-trained XLM and BERT models you provided.

Can I just declare it as an XLM and run the NER testing codes (since they share the same encoder)?

Q2: Any example and code for pre-training from scratch?
I currently just trying to use the multilingual translation script in the generation/example to pre-train a model, but there are many different tasks for model pre-training in your paper. I understand that you fine-tune the existing XLM-R models using language modeling. I just wonder if there is a pre-training example that trains the model from scratch so that I can change the size and dimension with more flexibility.

Q3:
I see that in the "generation" folder the unicoder X_dae model is fairseq based.
In the "understanding" folder, the pre-trained model is huggings transformer-based.
So can I used them interchangeably? for example, if I trained/fine-tuned a model in the generation folder with fairseq, can I move it to the understanding folder and testing the model there?
It seems to me that the fairseq trained model is xxx.pt while the huggings transformer-based model is saved as pytorch_model.bin and config.json. So I am puzzled about how to use one encoder for generation and understanding tasks.

Thank for very much!

@Eureka6174
Copy link
Contributor

We tried the last token but it got similar results as first token.
Q1: You could use the code from HuggingFace to transform it to HuggingFace format and run POS Tagging and NER.
Q2: Our pre-training scripts is not ready to release for now.
Q3: You need to transform the model with HuggingFace code.

Thanks!

@ever4244
Copy link
Author

ever4244 commented Apr 23, 2021

We tried the last token but it got similar results as first token.
Q1: You could use the code from HuggingFace to transform it to HuggingFace format and run POS Tagging and NER.
Q2: Our pre-training scripts is not ready to release for now.
Q3: You need to transform the model with HuggingFace code.

Thanks!

Thank you for the timely response!
Can you elaborate Q1 or give me a link on "use the code from HuggingFace to transform it to HuggingFace format"
Are you referring to this one?
https://github.com/stas00/porting/tree/master/transformers/fairseq-wmt19
I am not sure whether this is only for the standard model or can it convert the model from a different structure. My model may have a different size, dimension, or even attention connection occasionally, can I use it to convert from xxx.pt to pytorch_model.bin?

BTW:
I trained a model using fairseq-train, but I found that in the Generation folder your pre-trained model contains a "sentencepiece.bpe.model". I don't have this file when compiling data into bpe, I just got "check.pt" and "dict.txt", I want to ask in which process do you get sentencepiece.bpe.model?

Thank you very much!

@Eureka6174
Copy link
Contributor

There is the link: https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py

I think you could read the document of HuggingFace and Transformers at first.

Thanks,
Yaobo

@ever4244
Copy link
Author

ever4244 commented Apr 23, 2021

Thanks for the link, but I have modified my question so I restate it a bit.

Q1:
I also find a link:
https://github.com/stas00/porting/tree/master/transformers/fairseq-wmt19
But both your link and my link seem to be the conversions of a standard model structure. What if have a different model structure?
My model may have a different size, dimension, or even attention connection occasionally, can I use it to convert from xxx.pt to pytorch_model.bin?
As my current pre-trained model is a transformer NMT model 6 encoder layer and 1 decoder layer, Can I use the roberta converter when bert and NMT share a similar encoder but different decoder? I suppose there is no universal converter of various structures between fairseq and huggings?

I am sorry I have been sing fairseq a lot and new to huggings. I will read more of its documents.

@ever4244
Copy link
Author

ever4244 commented Apr 23, 2021

I trained a model using fairseq-train without spm, but I found that I need a "sentencepiece.bpe.model" for later task.

I used a script similar to fairseq translation example: prepare-wmt14en2de.sh, it does not generate a sentence piece model. And I prepare the data and train the old model with it.

https://github.com/pytorch/fairseq/tree/master/examples/translation

while the one that has a sentence piece model is: prepare-iwslt17-multilingual.sh
python "$SPM_TRAIN"
--input=$TRAIN_FILES
--model_prefix=$DATA/sentencepiece.bpe
--vocab_size=$BPESIZE
--character_coverage=1.0
--model_type=bpe.

I am currently want to re-learn the sentencepiece.bpe.model on the training data for prepare-wmt14en2de.sh.

Since I already trained the model without the sentencepiece.bpe.model, I just want to make sure I can get the exact same training data when I reapply the spm learning script on the old data. So that my previously trained model from prepare-wmt14en2de.sh can be coupled with the newly learnt sentencepiece.bpe.model

However,
In prepare-wmt14en2de.sh it use: fastBPE's learn_bpe.py,
https://github.com/glample/fastBPE
in prepare-iwslt17-multilingual.sh it use: sentencepiece's spm_train.py
https://github.com/google/sentencepiece

They use different codes for bpe learning and encoding. They even use different BPE replacement tokens (@@ for fastBPE and _ for sentencepiece ). So how can create a sentencepiece.bpe.model that can be used together with my old model from fastBPE? (That is a sentencepiece.bpe.model which will generate the exact same training data as fastBPE)

Thank you very much!

@Eureka6174
Copy link
Contributor

I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo.

@ever4244
Copy link
Author

ever4244 commented Apr 26, 2021

I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo.

Thanks.
I have some questions on preprocessing as well.

I want my pre-training model replication to be as close as to your model so that there won't be a performance loss due to the difference in the pre-processing text during pre-training and testing. So I want to make sure that I follow your pre-processing procedure on training data.

In prepare-wmt14en2de.sh, they use several scripts to tokenize and clean the corpus.

for example:
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl

https://github.com/pytorch/fairseq/tree/master/examples/translation

In prepare-iwslt17-multilingual.sh, it does not use this preprocessing as sentencepiece can be used on raw text.

So what is your pre-processing procedure for pre-training? I want to use the same tokenizer and normalization as your model in the pre-training and fine-tuning.

@Eureka6174
Copy link
Contributor

I'm using just raw text for pre-processing. But I didn't try the tokenizer you mentioned because they are different for different languages. If you would like to have a try, I appreciate it if you could share us your results, either work or not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants