Override _pad in LEDTokenizer to deal with global_attention_mask #15940

ydshieh · 2022-03-04T15:54:34Z

What does this PR do?

Fix #14648. This PR allows LEDTokenizer._pad to treat global_attention_mask if it is provided in encoded_inputs.

Without this, global_attention_mask won't be padded, while other tensors will be padded (if users specify padding), and causes the following error (if return_tensors is specified)

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

More context

Basically, this PR copies the method _pad defined in https://github.com/huggingface/transformers/blob/e3645fd2806b1e0b9daec89a72e316b71be8609c/src/transformers/tokenization_utils_base.py#L3159 to tokenization_led.py and tokenization_led_fast.py, and added the following block to deal with global_attention_mask:

transformers/src/transformers/models/led/tokenization_led.py

Lines 113 to 116 in df6b574

    
           if "global_attention_mask" in encoded_inputs: 
        
               encoded_inputs["global_attention_mask"] = ( 
        
                   encoded_inputs["global_attention_mask"] + [-1] * difference 
        
               )

transformers/src/transformers/models/led/tokenization_led.py

Lines 127 to 130 in df6b574

    
           if "global_attention_mask" in encoded_inputs: 
        
               encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[ 
        
                   "global_attention_mask" 
        
               ]

The effect

See this comment

HuggingFaceDocBuilderDev · 2022-03-04T15:58:41Z

The documentation is not available anymore as the PR was closed or merged.

ydshieh · 2022-03-13T14:28:03Z

Results before/after this PR

import numpy as np
from transformers import LEDTokenizer, LEDTokenizerFast

tokenizer_slow = LEDTokenizer.from_pretrained("allenai/led-base-16384")
tokenizer_fast = LEDTokenizerFast.from_pretrained("allenai/led-base-16384")

text_1 = "I love dogs"
text_2 = "I love dogs and cats"
texts = [text_1, text_2]

model_inputs_slow = tokenizer_slow(texts, max_length=tokenizer_slow.model_max_length, padding=False, truncation=True)
model_inputs_fast = tokenizer_fast(texts, max_length=tokenizer_fast.model_max_length, padding=False, truncation=True)

model_inputs_slow["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs_slow["input_ids"]]
model_inputs_fast["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs_fast["input_ids"]]

# put global attention on <s> token
for input in model_inputs_slow["global_attention_mask"][:]:
    input[0] = 1
for input in model_inputs_fast["global_attention_mask"][:]:
    input[0] = 1

print("`model_inputs` without padding (slow tokenizer)")
print(model_inputs_slow)
print("`model_inputs` without padding (fast tokenizer)")
print(model_inputs_fast)

model_inputs_slow = tokenizer_slow.pad(
    model_inputs_slow,
    padding=True,
    max_length=tokenizer_slow.model_max_length,
)

model_inputs_fast = tokenizer_fast.pad(
    model_inputs_fast,
    padding=True,
    max_length=tokenizer_slow.model_max_length,
)

print("=" * 30)
print("`model_inputs` with padding (slow tokenizer)")
print(model_inputs_slow)
print("`model_inputs` with padding (fast tokenizer)")
print(model_inputs_fast)

Outputs before this PR

(note that global_attention_mask is not padded)

`model_inputs` without padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` without padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

==============================

`model_inputs` with padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` with padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

Outputs with this PR

`model_inputs` without padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` without padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

==============================

`model_inputs` with padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0, -1, -1], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` with padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0, -1, -1], [1, 0, 0, 0, 0, 0, 0]]}

ydshieh · 2022-03-14T13:26:53Z

src/transformers/models/led/tokenization_led.py

+                    if "global_attention_mask" in encoded_inputs:
+                        encoded_inputs["global_attention_mask"] = (
+                            encoded_inputs["global_attention_mask"] + [-1] * difference
+                        )


main change here

What about calling the super method to pad everything else but the global_attention_mask and then just add the code for that here?

ydshieh · 2022-03-14T13:27:13Z

src/transformers/models/led/tokenization_led.py

+                    if "global_attention_mask" in encoded_inputs:
+                        encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[
+                            "global_attention_mask"
+                        ]


main change here

ydshieh · 2022-03-14T13:27:25Z

src/transformers/models/led/tokenization_led_fast.py

+                    if "global_attention_mask" in encoded_inputs:
+                        encoded_inputs["global_attention_mask"] = (
+                            encoded_inputs["global_attention_mask"] + [-1] * difference
+                        )


main change here

ydshieh · 2022-03-14T13:27:36Z

src/transformers/models/led/tokenization_led_fast.py

+                    if "global_attention_mask" in encoded_inputs:
+                        encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[
+                            "global_attention_mask"
+                        ]


main change here

sgugger

Thanks for working on this!

It's a matter of preference since there is going to be some duplicate code anyway, but I would rather us use the super method to pad everything but the global attention mask. This way this new pad method will automatically get any fix we do on the global _pad method.

Also, docstring aren't necessary as they will automatically be the same as the superclass :-)

sgugger · 2022-03-14T14:49:30Z

src/transformers/models/led/tokenization_led.py

+            encoded_inputs:
+                Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask:


If we add a docstring here, it would be great for it to be in the new format :-)
Otherwise you can just leave it blank since it will pick the docstring of the superclass automatically.

I will remove it.

(But to get better idea, the base tokenizer still has old format ..? Because I just copied the docstring from PreTrainedTokenizerBase)

Yes since it's a private method and does not appear in the main doc, no one spent time updating it. It would be worth doing one day!

sgugger · 2022-03-14T14:51:43Z

src/transformers/models/led/tokenization_led.py

+                    if "global_attention_mask" in encoded_inputs:
+                        encoded_inputs["global_attention_mask"] = (
+                            encoded_inputs["global_attention_mask"] + [-1] * difference
+                        )


What about calling the super method to pad everything else but the global_attention_mask and then just add the code for that here?

sgugger · 2022-03-14T14:52:01Z

src/transformers/models/led/tokenization_led_fast.py

+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """


Same comment on the docstring.

What about calling the super method

Good idea!

ydshieh · 2022-03-14T16:58:19Z

src/transformers/models/led/tokenization_led.py

-                        encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
+                if self.padding_side == "right":
+                    encoded_inputs["global_attention_mask"] = (
+                        encoded_inputs["global_attention_mask"] + [-1] * difference


Use -1 because the 0 in global_attention_mask is used to indicate local attention rather than not to attend:

transformers/src/transformers/models/led/modeling_led.py

Line 1536 in c1000e7

- 0 for local attention (a sliding window attention),

global_attention_mask will be used as follows

transformers/src/transformers/models/led/modeling_led.py

Lines 1643 to 1653 in c1000e7

def _merge_to_attention_mask(self, attention_mask: torch.Tensor, global_attention_mask: torch.Tensor):

# longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)

# (global_attention_mask + 1) => 1 for local attention, 2 for global attention

# => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention

if attention_mask is not None:

attention_mask = attention_mask * (global_attention_mask + 1)

else:

# simply use `global_attention_mask` as `attention_mask`

# if no `attention_mask` is given

attention_mask = global_attention_mask + 1

return attention_mask

therefore, -1 will gvie 0 here for the (merged) attention_mask, which makes sense.

ydshieh · 2022-03-14T17:12:43Z

Call super()._pad then dealing with global_attention_mask (cc @sgugger )
Add a comment about using -1 instead of 0.

patrickvonplaten

Very nice - looks good to me!

ydshieh · 2022-03-17T20:41:10Z

Would wait a bit for @sgugger to check the part regarding calling the super method before merge.

sgugger

You did it exactly as I had envisioned it, perfect!

) * Override _pad in LEDTokenizer * Override _pad in LEDTokenizerFast * add Copied from * calling the super method * add comment about -1 Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

ydshieh marked this pull request as draft March 4, 2022 15:54

ydshieh added 2 commits March 13, 2022 14:14

Override _pad in LEDTokenizer (with a call to super._pad for now)

cc9de69

Override _pad in LEDTokenizer

8fa617e

ydshieh force-pushed the fix_led_tokenizer_for_global_attention_mask branch from fca418b to 8fa617e Compare March 13, 2022 13:15

Override _pad in LEDTokenizerFast

df6b574

ydshieh changed the title ~~[WIP] Override _pad in LEDTokenizer to deal with global_attention_mask~~ Override _pad in LEDTokenizer to deal with global_attention_mask Mar 14, 2022

ydshieh marked this pull request as ready for review March 14, 2022 13:12

ydshieh commented Mar 14, 2022

View reviewed changes

ydshieh requested review from patrickvonplaten and sgugger March 14, 2022 13:28

add Copied from

23da0a0

sgugger reviewed Mar 14, 2022

View reviewed changes

calling the super method

2a07a75

ydshieh commented Mar 14, 2022

View reviewed changes

add comment about -1

84faad0

patrickvonplaten approved these changes Mar 17, 2022

View reviewed changes

sgugger approved these changes Mar 18, 2022

View reviewed changes

ydshieh merged commit a627196 into huggingface:master Mar 18, 2022

ydshieh deleted the fix_led_tokenizer_for_global_attention_mask branch March 18, 2022 12:30

ydshieh mentioned this pull request Mar 18, 2022

LEDTokenizer doesn't pad global_attention_mask #14648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override _pad in LEDTokenizer to deal with global_attention_mask #15940

Override _pad in LEDTokenizer to deal with global_attention_mask #15940

ydshieh commented Mar 4, 2022 •

edited

HuggingFaceDocBuilderDev commented Mar 4, 2022 •

edited

ydshieh commented Mar 13, 2022 •

edited

ydshieh Mar 14, 2022

sgugger Mar 14, 2022

ydshieh Mar 14, 2022

ydshieh Mar 14, 2022

ydshieh Mar 14, 2022

sgugger left a comment

sgugger Mar 14, 2022

ydshieh Mar 14, 2022

sgugger Mar 14, 2022

sgugger Mar 14, 2022

sgugger Mar 14, 2022

ydshieh Mar 14, 2022

ydshieh Mar 14, 2022

ydshieh commented Mar 14, 2022

patrickvonplaten left a comment

ydshieh commented Mar 17, 2022

sgugger left a comment

	if "global_attention_mask" in encoded_inputs:
	encoded_inputs["global_attention_mask"] = (
	encoded_inputs["global_attention_mask"] + [-1] * difference
	)

	if "global_attention_mask" in encoded_inputs:
	encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[
	"global_attention_mask"
	]

	def _merge_to_attention_mask(self, attention_mask: torch.Tensor, global_attention_mask: torch.Tensor):
	# longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)
	# (global_attention_mask + 1) => 1 for local attention, 2 for global attention
	# => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention
	if attention_mask is not None:
	attention_mask = attention_mask * (global_attention_mask + 1)
	else:
	# simply use `global_attention_mask` as `attention_mask`
	# if no `attention_mask` is given
	attention_mask = global_attention_mask + 1
	return attention_mask

Override _pad in LEDTokenizer to deal with global_attention_mask #15940

Override _pad in LEDTokenizer to deal with global_attention_mask #15940

Conversation

ydshieh commented Mar 4, 2022 • edited

What does this PR do?

More context

The effect

HuggingFaceDocBuilderDev commented Mar 4, 2022 • edited

ydshieh commented Mar 13, 2022 • edited

Results before/after this PR

Outputs before this PR

Outputs with this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydshieh commented Mar 14, 2022

patrickvonplaten left a comment

Choose a reason for hiding this comment

ydshieh commented Mar 17, 2022

sgugger left a comment

Choose a reason for hiding this comment

ydshieh commented Mar 4, 2022 •

edited

HuggingFaceDocBuilderDev commented Mar 4, 2022 •

edited

ydshieh commented Mar 13, 2022 •

edited