-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decoder parameter initialization and dictionary during define-tuning #4
Comments
Sorry for the late reply. Their are output layers both in pre-training and fine-tuning. But they are different. And for different fine-tuning tasks, the simplest way is to use different output layers. For now, we just used linear layers as output layers. It's still open to try for other options. For the BPE problem, we only produce one verb for the first subword "materi@@". We choose this option just to follow BERT. We didn't try other options. If you have better choice, could you also tell us? Please feel free to contact us if this doesn't fully answer your question. |
@Eureka6174 So for the XLM-R text generation, what exactly is the decoder? |
It's transformers layers in decoder. Masked self-attention layers are initialized from XLM-R and attention from decoder to encoder is random initialized. |
@Eureka6174 BTW Is the pre-training code in this repo as well? I'm only seeing Thanks! |
Thanks for the answer. I just wonder would it be better if we choose the last subword instead of the first subword as the position of POS-tag. Another Question: Q1: If I trained my own model using fairseq-train (for example a classic NMT transformer model) Can I just declare it as an XLM and run the NER testing codes (since they share the same encoder)? Q2: Any example and code for pre-training from scratch? Q3: Thank for very much! |
We tried the last token but it got similar results as first token. Thanks! |
Thank you for the timely response! BTW: Thank you very much! |
There is the link: https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py I think you could read the document of HuggingFace and Transformers at first. Thanks, |
Thanks for the link, but I have modified my question so I restate it a bit. Q1: I am sorry I have been sing fairseq a lot and new to huggings. I will read more of its documents. |
I trained a model using fairseq-train without spm, but I found that I need a "sentencepiece.bpe.model" for later task. I used a script similar to fairseq translation example: prepare-wmt14en2de.sh, it does not generate a sentence piece model. And I prepare the data and train the old model with it. https://github.com/pytorch/fairseq/tree/master/examples/translation while the one that has a sentence piece model is: prepare-iwslt17-multilingual.sh I am currently want to re-learn the sentencepiece.bpe.model on the training data for prepare-wmt14en2de.sh. Since I already trained the model without the sentencepiece.bpe.model, I just want to make sure I can get the exact same training data when I reapply the spm learning script on the old data. So that my previously trained model from prepare-wmt14en2de.sh can be coupled with the newly learnt sentencepiece.bpe.model However, They use different codes for bpe learning and encoding. They even use different BPE replacement tokens (@@ for fastBPE and _ for sentencepiece ). So how can create a sentencepiece.bpe.model that can be used together with my old model from fastBPE? (That is a sentencepiece.bpe.model which will generate the exact same training data as fastBPE) Thank you very much! |
I think your question is more about Fairseq and Huggingface, they are out of my knowledge. My model doesn't have different structure and different sentencepiece. Maybe you should raise an issue in their github repo. |
Thanks. I want my pre-training model replication to be as close as to your model so that there won't be a performance loss due to the difference in the pre-processing text during pre-training and testing. So I want to make sure that I follow your pre-processing procedure on training data. In prepare-wmt14en2de.sh, they use several scripts to tokenize and clean the corpus. for example: https://github.com/pytorch/fairseq/tree/master/examples/translation In prepare-iwslt17-multilingual.sh, it does not use this preprocessing as sentencepiece can be used on raw text. So what is your pre-processing procedure for pre-training? I want to use the same tokenizer and normalization as your model in the pre-training and fine-tuning. |
I'm using just raw text for pre-processing. But I didn't try the tokenizer you mentioned because they are different for different languages. If you would like to have a try, I appreciate it if you could share us your results, either work or not work. |
Hi:
Good evening!
I read from your paper that Unicode uses an XLM/transformer structure.
I understand that after pertaining you will keep the parameters in the transformer encoder, but I am a little bit confused about how do you deal with the output/decoder part of the model.
I would use POS-tag as an example.
Since the pre-training task is a masked recovering, LM, etc. It will have a fairly large output layer.
If we keep the original output layer parameter, the vocabulary size will be much larger than the legit 17 POS-tag labels. It may produce a token that is outside of legit POS-tag labels.
Those POS-tag labels did not appear in the previous pre-training. Thus they will create an OOV problem.
From what I understand, XLM will attach an NMT translator when the pre-training task is machine translation, it will only keep the encoder if the pre-training task is masked tokens recovering. In the fine-tuning, it will attach a new classifier to the pre-trained encoder for fine-tuning for tasks such as XNLI. It will attach the original NMT decoder if the evaluation task is machine translation.
So during Unicode fine-tuning, do I need to cut off the original output/decoder after the pre-training and re-initialize an output layer based on the task at hand? Using POS-tag as an example, do I need to connect a new decoder/output layer to the encoder and cut the vocabulary of 50K words to the 17 labels in the POS-tag task? If so, what is the optimal structure of the decoder/output layer for the POS-tag task or NER task? should I connect the encoder to a complex decoder like in the NMT, or maybe a simple classifier layer connected to the encoder is fine?
What happens for other tasks like NER? Do we fine-tine all the tasks in one model, or do we fine-tine a new model for each task?
Should I fix the encoder parameters and just fine-tune the decoder part?
#=======================================================================
Question on BPE dictionary that split the original words:
I am a little puzzled about how the BPE dictionary is used on POS-tagging in your paper (the unicoder baseline).
for example a sentence like:
"although , as you will have seen , the dre@@ aded ' millennium bu@@ g ' failed to materi@@ alise , "
The BPE dictionary would split the word "materialize" into "materi@@" and "alise"
That is fine for the translation task, however, for the POS-tagging task, wouldn't it be better if we can keep the word `materialize' together so we can have one POS-tag for each word, instead of two POS-tag for each word-piece or two word-piece using a POS-tag together?
So do you let the "materi@@" "alise" produce two "verb" label or do you let `materi@@ "alise" produces a single "verb" label (then the length of input and output would be different)
The text was updated successfully, but these errors were encountered: