Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override _pad in LEDTokenizer to deal with global_attention_mask #15940

Merged

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Mar 4, 2022

What does this PR do?

Fix #14648. This PR allows LEDTokenizer._pad to treat global_attention_mask if it is provided in encoded_inputs.

Without this, global_attention_mask won't be padded, while other tensors will be padded (if users specify padding), and causes the following error (if return_tensors is specified)

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

More context

Basically, this PR copies the method _pad defined in https://github.com/huggingface/transformers/blob/e3645fd2806b1e0b9daec89a72e316b71be8609c/src/transformers/tokenization_utils_base.py#L3159 to tokenization_led.py and tokenization_led_fast.py, and added the following block to deal with global_attention_mask:

if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = (
encoded_inputs["global_attention_mask"] + [-1] * difference
)

if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[
"global_attention_mask"
]

The effect

See this comment

@ydshieh ydshieh marked this pull request as draft March 4, 2022 15:54
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 4, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh ydshieh force-pushed the fix_led_tokenizer_for_global_attention_mask branch from fca418b to 8fa617e Compare March 13, 2022 13:15
@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 13, 2022

Results before/after this PR

import numpy as np
from transformers import LEDTokenizer, LEDTokenizerFast

tokenizer_slow = LEDTokenizer.from_pretrained("allenai/led-base-16384")
tokenizer_fast = LEDTokenizerFast.from_pretrained("allenai/led-base-16384")

text_1 = "I love dogs"
text_2 = "I love dogs and cats"
texts = [text_1, text_2]

model_inputs_slow = tokenizer_slow(texts, max_length=tokenizer_slow.model_max_length, padding=False, truncation=True)
model_inputs_fast = tokenizer_fast(texts, max_length=tokenizer_fast.model_max_length, padding=False, truncation=True)

model_inputs_slow["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs_slow["input_ids"]]
model_inputs_fast["global_attention_mask"] = [np.zeros_like(input).tolist() for input in model_inputs_fast["input_ids"]]

# put global attention on <s> token
for input in model_inputs_slow["global_attention_mask"][:]:
    input[0] = 1
for input in model_inputs_fast["global_attention_mask"][:]:
    input[0] = 1

print("`model_inputs` without padding (slow tokenizer)")
print(model_inputs_slow)
print("`model_inputs` without padding (fast tokenizer)")
print(model_inputs_fast)

model_inputs_slow = tokenizer_slow.pad(
    model_inputs_slow,
    padding=True,
    max_length=tokenizer_slow.model_max_length,
)

model_inputs_fast = tokenizer_fast.pad(
    model_inputs_fast,
    padding=True,
    max_length=tokenizer_slow.model_max_length,
)

print("=" * 30)
print("`model_inputs` with padding (slow tokenizer)")
print(model_inputs_slow)
print("`model_inputs` with padding (fast tokenizer)")
print(model_inputs_fast)

Outputs before this PR

(note that global_attention_mask is not padded)

`model_inputs` without padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` without padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

==============================

`model_inputs` with padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` with padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

Outputs with this PR

`model_inputs` without padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` without padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0]]}

==============================

`model_inputs` with padding (slow tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0, -1, -1], [1, 0, 0, 0, 0, 0, 0]]}

`model_inputs` with padding (fast tokenizer)
{'input_ids': [[0, 100, 657, 3678, 2, 1, 1], [0, 100, 657, 3678, 8, 10017, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1]], 'global_attention_mask': [[1, 0, 0, 0, 0, -1, -1], [1, 0, 0, 0, 0, 0, 0]]}

@ydshieh ydshieh changed the title [WIP] Override _pad in LEDTokenizer to deal with global_attention_mask Override _pad in LEDTokenizer to deal with global_attention_mask Mar 14, 2022
@ydshieh ydshieh marked this pull request as ready for review March 14, 2022 13:12
if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = (
encoded_inputs["global_attention_mask"] + [-1] * difference
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about calling the super method to pad everything else but the global_attention_mask and then just add the code for that here?

if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[
"global_attention_mask"
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change here

if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = (
encoded_inputs["global_attention_mask"] + [-1] * difference
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change here

if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = [-1] * difference + encoded_inputs[
"global_attention_mask"
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change here

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

It's a matter of preference since there is going to be some duplicate code anyway, but I would rather us use the super method to pad everything but the global attention mask. This way this new pad method will automatically get any fix we do on the global _pad method.

Also, docstring aren't necessary as they will automatically be the same as the superclass :-)

Comment on lines 69 to 85
encoded_inputs:
Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
max_length: maximum length of the returned list and optionally padding length (see below).
Will truncate by taking into account the special tokens.
padding_strategy: PaddingStrategy to use for padding.

- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
- PaddingStrategy.DO_NOT_PAD: Do not pad
The tokenizer padding sides are defined in self.padding_side:

- 'left': pads on the left of the sequences
- 'right': pads on the right of the sequences
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
>= 7.5 (Volta).
return_attention_mask:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a docstring here, it would be great for it to be in the new format :-)
Otherwise you can just leave it blank since it will pick the docstring of the superclass automatically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it.

(But to get better idea, the base tokenizer still has old format ..? Because I just copied the docstring from PreTrainedTokenizerBase)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since it's a private method and does not appear in the main doc, no one spent time updating it. It would be worth doing one day!

if "global_attention_mask" in encoded_inputs:
encoded_inputs["global_attention_mask"] = (
encoded_inputs["global_attention_mask"] + [-1] * difference
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about calling the super method to pad everything else but the global_attention_mask and then just add the code for that here?

pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
) -> dict:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment on the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about calling the super method

Good idea!

encoded_inputs["token_type_ids"] + [self.pad_token_type_id] * difference
if self.padding_side == "right":
encoded_inputs["global_attention_mask"] = (
encoded_inputs["global_attention_mask"] + [-1] * difference
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use -1 because the 0 in global_attention_mask is used to indicate local attention rather than not to attend:

- 0 for local attention (a sliding window attention),

global_attention_mask will be used as follows

def _merge_to_attention_mask(self, attention_mask: torch.Tensor, global_attention_mask: torch.Tensor):
# longformer self attention expects attention mask to have 0 (no attn), 1 (local attn), 2 (global attn)
# (global_attention_mask + 1) => 1 for local attention, 2 for global attention
# => final attention_mask => 0 for no attention, 1 for local attention 2 for global attention
if attention_mask is not None:
attention_mask = attention_mask * (global_attention_mask + 1)
else:
# simply use `global_attention_mask` as `attention_mask`
# if no `attention_mask` is given
attention_mask = global_attention_mask + 1
return attention_mask

therefore, -1 will gvie 0 here for the (merged) attention_mask, which makes sense.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 14, 2022

  • Call super()._pad then dealing with global_attention_mask (cc @sgugger )
  • Add a comment about using -1 instead of 0.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice - looks good to me!

@ydshieh
Copy link
Collaborator Author

ydshieh commented Mar 17, 2022

Would wait a bit for @sgugger to check the part regarding calling the super method before merge.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did it exactly as I had envisioned it, perfect!

@ydshieh ydshieh merged commit a627196 into huggingface:master Mar 18, 2022
@ydshieh ydshieh deleted the fix_led_tokenizer_for_global_attention_mask branch March 18, 2022 12:30
FrancescoSaverioZuppichini pushed a commit that referenced this pull request Mar 24, 2022
)

* Override _pad in LEDTokenizer

* Override _pad in LEDTokenizerFast

* add Copied from

* calling the super method

* add comment about -1

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LEDTokenizer doesn't pad global_attention_mask
4 participants