New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Override _pad in LEDTokenizer to deal with global_attention_mask #15940
Override _pad in LEDTokenizer to deal with global_attention_mask #15940
Conversation
The documentation is not available anymore as the PR was closed or merged. |
fca418b
to
8fa617e
Compare
Results before/after this PRimport numpy as np
from transformers import LEDTokenizer, LEDTokenizerFast
tokenizer_slow = LEDTokenizer.from_pretrained("allenai/led-base-16384")
tokenizer_fast = LEDTokenizerFast.from_pretrained("allenai/led-base-16384")
text_1 = "I love dogs"
text_2 = "I love dogs and cats"
texts = [text_1, text_2]
model_inputs_slow = tokenizer_slow(texts, max_length=tokenizer_slow.model_max_length, padding=False, truncation=True)
model_inputs_fast = tokenizer_fast(texts, max_length=tokenizer_fast.model_max_length, padding=False, truncation=True)
model_inputs_slow["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs_slow["input_ids"]]
model_inputs_fast["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs_fast["input_ids"]]
# put global attention on <s> token
for input in model_inputs_slow["global_attention_mask"][:]:
input[0] = 1
for input in model_inputs_fast["global_attention_mask"][:]:
input[0] = 1
print("`model_inputs` without padding (slow tokenizer)")
print(model_inputs_slow)
print("`model_inputs` without padding (fast tokenizer)")
print(model_inputs_fast)
model_inputs_slow = tokenizer_slow.pad(
model_inputs_slow,
padding=True,
max_length=tokenizer_slow.model_max_length,
)
model_inputs_fast = tokenizer_fast.pad(
model_inputs_fast,
padding=True,
max_length=tokenizer_slow.model_max_length,
)
print("=" * 30)
print("`model_inputs` with padding (slow tokenizer)")
print(model_inputs_slow)
print("`model_inputs` with padding (fast tokenizer)")
print(model_inputs_fast) Outputs before this PR(note that `model_inputs` without padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}
`model_inputs` without padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}
==============================
`model_inputs` with padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}
`model_inputs` with padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]} Outputs with this PR`model_inputs` without padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}
`model_inputs` without padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}
==============================
`model_inputs` with padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0, -1, -1], [1, 0, 0, 0, 0, 0, 0]]}
`model_inputs` with padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0, -1, -1], [1, 0, 0, 0, 0, 0, 0]]} |
if "global_attention_mask" in encoded_inputs: | ||
encoded_inputs["global_attention_mask"] = ( | ||
encoded_inputs["global_attention_mask"] + [-1] * difference | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calling the super method to pad everything else but the global_attention_mask
and then just add the code for that here?
if "global_attention_mask" in encoded_inputs: | ||
encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[ | ||
"global_attention_mask" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change here
if "global_attention_mask" in encoded_inputs: | ||
encoded_inputs["global_attention_mask"] = ( | ||
encoded_inputs["global_attention_mask"] + [-1] * difference | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change here
if "global_attention_mask" in encoded_inputs: | ||
encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[ | ||
"global_attention_mask" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
It's a matter of preference since there is going to be some duplicate code anyway, but I would rather us use the super method to pad everything but the global attention mask. This way this new pad method will automatically get any fix we do on the global _pad
method.
Also, docstring aren't necessary as they will automatically be the same as the superclass :-)
encoded_inputs: | ||
Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`). | ||
max_length: maximum length of the returned list and optionally padding length (see below). | ||
Will truncate by taking into account the special tokens. | ||
padding_strategy: PaddingStrategy to use for padding. | ||
|
||
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch | ||
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default) | ||
- PaddingStrategy.DO_NOT_PAD: Do not pad | ||
The tokenizer padding sides are defined in self.padding_side: | ||
|
||
- 'left': pads on the left of the sequences | ||
- 'right': pads on the right of the sequences | ||
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. | ||
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability | ||
>= 7.5 (Volta). | ||
return_attention_mask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a docstring here, it would be great for it to be in the new format :-)
Otherwise you can just leave it blank since it will pick the docstring of the superclass automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it.
(But to get better idea, the base tokenizer still has old format ..? Because I just copied the docstring from PreTrainedTokenizerBase
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since it's a private method and does not appear in the main doc, no one spent time updating it. It would be worth doing one day!
if "global_attention_mask" in encoded_inputs: | ||
encoded_inputs["global_attention_mask"] = ( | ||
encoded_inputs["global_attention_mask"] + [-1] * difference | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calling the super method to pad everything else but the global_attention_mask
and then just add the code for that here?
pad_to_multiple_of: Optional[int] = None, | ||
return_attention_mask: Optional[bool] = None, | ||
) -> dict: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about calling the super method
Good idea!
encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference | ||
if self.padding_side == "right": | ||
encoded_inputs["global_attention_mask"] = ( | ||
encoded_inputs["global_attention_mask"] + [-1] * difference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use -1
because the 0
in global_attention_mask
is used to indicate local attention
rather than not to attend
:
- 0 for local attention (a sliding window attention), |
global_attention_mask
will be used as follows
transformers/src/transformers/models/led/modeling_led.py
Lines 1643 to 1653 in c1000e7
def _merge_to_attention_mask(self, attention_mask: torch.Tensor, global_attention_mask: torch.Tensor): | |
# longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn) | |
# (global_attention_mask + 1) => 1 for local attention, 2 for global attention | |
# => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention | |
if attention_mask is not None: | |
attention_mask = attention_mask * (global_attention_mask + 1) | |
else: | |
# simply use `global_attention_mask` as `attention_mask` | |
# if no `attention_mask` is given | |
attention_mask = global_attention_mask + 1 | |
return attention_mask |
therefore, -1
will gvie 0
here for the (merged) attention_mask
, which makes sense.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice - looks good to me!
Would wait a bit for @sgugger to check the part regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did it exactly as I had envisioned it, perfect!
What does this PR do?
Fix #14648. This PR allows
LEDTokenizer._pad
to treatglobal_attention_mask
if it is provided inencoded_inputs
.Without this,
global_attention_mask
won't be padded, while other tensors will be padded (if users specify padding), and causes the following error (ifreturn_tensors
is specified)More context
Basically, this PR copies the method
_pad
defined inhttps://github.com/huggingface/transformers/blob/e3645fd2806b1e0b9daec89a72e316b71be8609c/src/transformers/tokenization_utils_base.py#L3159
totokenization_led.py
and tokenization_led_fast.py, and added the following block to deal withglobal_attention_mask
:transformers/src/transformers/models/led/tokenization_led.py
Lines 113 to 116 in df6b574
transformers/src/transformers/models/led/tokenization_led.py
Lines 127 to 130 in df6b574
The effect
See this comment