Changes to BART shift_token_right and using the proper shifting index EOS or BOS.

### System Info

- `adapter-transformers` version: 3.0.1
- Platform: Linux-4.15.0-72-generic-ppc64le-with-debian-buster-sid
- Python version: 3.6.11
- PyTorch version (GPU?): 1.5.0 (True)
- Tensorflow version (GPU?): 2.2.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Although - this is kinda version independent and general purpose.

### Who can help?

@ArthurZucker @younesbelkada 

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

I have been going through the GitHub issues and the history of the [modeling_bart.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py) and I basically found that from PR #9134 , #9135 to #9343 the shift_token_right function was modified to do a bunch of things. For most of "those things" the PR docs and comments are super helpful! (For Instance, the issue with -100 in a label and changing it to the PAD token, and other issues that detail that even if 2 consecutive `PAD PAD` are changed to `PAD` it is fine since the loss does not account for that.)
However - one thing that was changed and never described or mentioned in the issues/docs/comments is how after shifting the input right the EOS index used to be appended to the beginning of the shifted input. But after #9343 the BOS index is being appended!

I realize that if we're doing BartForConditionalGeneration, it is super rare to pass labels and generate the decoder_input_ids from the labels. And this is mainly for doing MLM. This is where the problem arises. Because  example/reproducible projects and models are kinda divided on which shift_tokens_right function to use when preparing the input for BART fine-tuning and inference. Some projects do a simple:
`from transformers.models.bart.modeling_bart import shift_tokens_right` and usually that is the newest and current version of shifting the input:
```
def shift_tokens_right_NEW(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
    """
    Shift input ids one token to the right.
    """
    shifted_input_ids = input_ids.new_zeros(input_ids.shape)
    shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
    shifted_input_ids[:, 0] = decoder_start_token_id
    if pad_token_id is None:
        raise ValueError("self.model.config.pad_token_id has to be defined.")
    # replace possible -100 values in labels by `pad_token_id`
    shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
    return shifted_input_ids
```
But a lot of projects using BART tend to define and utilize the original shifting code-function:
```
def shift_tokens_right_OLD(input_ids, pad_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
    prev_output_tokens = input_ids.clone()
    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)
    prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()
    prev_output_tokens[:, 1:] = input_ids[:, :-1]
    return prev_output_tokens
```

The difference between them is that old version shifts right and appends the EOS token to the beginning. While, the new version appends the BOS token.

This has been causing some issues while preparing inputs for my project! I just wanted to know - is there a reason why the new code does not use EOS anymore? And - is one correct and one incorrect for Finetuning? Or does it make no difference?

I am asking this question from the point-of-view of a seq2seq task so no need for labels and for preparing the target decoder_input_ids for training and fine-tuning.

A sample difference between the old and true function can be seen from this example:
```
bos = 0
pad = 100
eos = 1

>>> input_id
tensor([
        [  0,  13,   8,  11,   9,   2,   2,  17,   9,   1],
        [  0,  14,  10,   7,   6,  10,   3,   1, 100, 100],
        [  0,   4,  16,  14,   2,  14,   3,   1, 100, 100],
        [  0,  16,  12,   7,   5,  14,   6,  10,   1, 100],
        [  0,   3,  12,   7,   9,   1, 100, 100, 100, 100]])
#Here it can be assumed that this is the original random target from the tokenizer after padding and EOS/BOS.
#0 = BOS, 1 = EOS, 100 = PAD. Now the prepared target after the old and new shift function will be:

>>> shift_tokens_right_OLD(input_id,pad)
tensor([
        [  1,   0,  13,   8,  11,   9,   2,   2,  17,   9],
        [  1,   0,  14,  10,   7,   6,  10,   3,   1, 100],
        [  1,   0,   4,  16,  14,   2,  14,   3,   1, 100],
        [  1,   0,  16,  12,   7,   5,  14,   6,  10,   1],
        [  1,   0,   3,  12,   7,   9,   1, 100, 100, 100]])
>>> shift_tokens_right_NEW(input_id,pad,bos)
tensor([
        [  0,   0,  13,   8,  11,   9,   2,   2,  17,   9],
        [  0,   0,  14,  10,   7,   6,  10,   3,   1, 100],
        [  0,   0,   4,  16,  14,   2,  14,   3,   1, 100],
        [  0,   0,  16,  12,   7,   5,  14,   6,  10,   1],
        [  0,   0,   3,  12,   7,   9,   1, 100, 100, 100]])
```

So the main difference in preparation is the EOS/BOS change in `decoder_input_ids[:,0]` position. 

### Expected behavior

I was hoping if I could get some guidance on 2 questions:
1) Is there a difference between finetuning/inference with using EOS or BOS index after shifting? from the POV of the BART model?
2) As an extension - is one better than the other and should be preferred?

My main reason for this is to somehow combine existing projects/models and the new fine-tuning and models to have the same shift and target preparation approach!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes to BART shift_token_right and using the proper shifting index EOS or BOS. #20842

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Changes to BART shift_token_right and using the proper shifting index EOS or BOS. #20842

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions