-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Description
System Info
adapter-transformers
version: 3.0.1- Platform: Linux-4.15.0-72-generic-ppc64le-with-debian-buster-sid
- Python version: 3.6.11
- PyTorch version (GPU?): 1.5.0 (True)
- Tensorflow version (GPU?): 2.2.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Although - this is kinda version independent and general purpose.
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I have been going through the GitHub issues and the history of the modeling_bart.py and I basically found that from PR #9134 , #9135 to #9343 the shift_token_right function was modified to do a bunch of things. For most of "those things" the PR docs and comments are super helpful! (For Instance, the issue with -100 in a label and changing it to the PAD token, and other issues that detail that even if 2 consecutive PAD PAD
are changed to PAD
it is fine since the loss does not account for that.)
However - one thing that was changed and never described or mentioned in the issues/docs/comments is how after shifting the input right the EOS index used to be appended to the beginning of the shifted input. But after #9343 the BOS index is being appended!
I realize that if we're doing BartForConditionalGeneration, it is super rare to pass labels and generate the decoder_input_ids from the labels. And this is mainly for doing MLM. This is where the problem arises. Because example/reproducible projects and models are kinda divided on which shift_tokens_right function to use when preparing the input for BART fine-tuning and inference. Some projects do a simple:
from transformers.models.bart.modeling_bart import shift_tokens_right
and usually that is the newest and current version of shifting the input:
def shift_tokens_right_NEW(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
"""
Shift input ids one token to the right.
"""
shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
shifted_input_ids[:, 0] = decoder_start_token_id
if pad_token_id is None:
raise ValueError("self.model.config.pad_token_id has to be defined.")
# replace possible -100 values in labels by `pad_token_id`
shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
return shifted_input_ids
But a lot of projects using BART tend to define and utilize the original shifting code-function:
def shift_tokens_right_OLD(input_ids, pad_token_id):
"""Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
prev_output_tokens = input_ids.clone()
index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)
prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()
prev_output_tokens[:, 1:] = input_ids[:, :-1]
return prev_output_tokens
The difference between them is that old version shifts right and appends the EOS token to the beginning. While, the new version appends the BOS token.
This has been causing some issues while preparing inputs for my project! I just wanted to know - is there a reason why the new code does not use EOS anymore? And - is one correct and one incorrect for Finetuning? Or does it make no difference?
I am asking this question from the point-of-view of a seq2seq task so no need for labels and for preparing the target decoder_input_ids for training and fine-tuning.
A sample difference between the old and true function can be seen from this example:
bos = 0
pad = 100
eos = 1
>>> input_id
tensor([
[ 0, 13, 8, 11, 9, 2, 2, 17, 9, 1],
[ 0, 14, 10, 7, 6, 10, 3, 1, 100, 100],
[ 0, 4, 16, 14, 2, 14, 3, 1, 100, 100],
[ 0, 16, 12, 7, 5, 14, 6, 10, 1, 100],
[ 0, 3, 12, 7, 9, 1, 100, 100, 100, 100]])
#Here it can be assumed that this is the original random target from the tokenizer after padding and EOS/BOS.
#0 = BOS, 1 = EOS, 100 = PAD. Now the prepared target after the old and new shift function will be:
>>> shift_tokens_right_OLD(input_id,pad)
tensor([
[ 1, 0, 13, 8, 11, 9, 2, 2, 17, 9],
[ 1, 0, 14, 10, 7, 6, 10, 3, 1, 100],
[ 1, 0, 4, 16, 14, 2, 14, 3, 1, 100],
[ 1, 0, 16, 12, 7, 5, 14, 6, 10, 1],
[ 1, 0, 3, 12, 7, 9, 1, 100, 100, 100]])
>>> shift_tokens_right_NEW(input_id,pad,bos)
tensor([
[ 0, 0, 13, 8, 11, 9, 2, 2, 17, 9],
[ 0, 0, 14, 10, 7, 6, 10, 3, 1, 100],
[ 0, 0, 4, 16, 14, 2, 14, 3, 1, 100],
[ 0, 0, 16, 12, 7, 5, 14, 6, 10, 1],
[ 0, 0, 3, 12, 7, 9, 1, 100, 100, 100]])
So the main difference in preparation is the EOS/BOS change in decoder_input_ids[:,0]
position.
Expected behavior
I was hoping if I could get some guidance on 2 questions:
- Is there a difference between finetuning/inference with using EOS or BOS index after shifting? from the POV of the BART model?
- As an extension - is one better than the other and should be preferred?
My main reason for this is to somehow combine existing projects/models and the new fine-tuning and models to have the same shift and target preparation approach!