Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct docstring. #8845

Merged
merged 1 commit into from
Nov 30, 2020
Merged

Correct docstring. #8845

merged 1 commit into from
Nov 30, 2020

Conversation

Fraser-Greenlee
Copy link
Contributor

@Fraser-Greenlee Fraser-Greenlee commented Nov 30, 2020

Related issue: #8837

What does this PR do?

Updating the PreTrainedTokenizerBase.pad argument default value docstring to show the correct default value.

Current
docstring:

padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding

arg:
def pad(
self,
encoded_inputs: Union[
BatchEncoding,
List[BatchEncoding],
Dict[str, EncodedInput],
Dict[str, List[EncodedInput]],
List[Dict[str, EncodedInput]],
],
padding: Union[bool, str, PaddingStrategy] = True,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
verbose: bool = True,
) -> BatchEncoding:
"""
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
in the batch.
Padding side (left/right) padding token ids are defined at the tokenizer level (with ``self.padding_side``,
``self.pad_token_id`` and ``self.pad_token_type_id``)
.. note::
If the ``encoded_inputs`` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
result will use the same type unless you provide a different tensor type with ``return_tensors``. In the
case of PyTorch tensors, you will lose the specific device of your tensors however.
Args:
encoded_inputs (:class:`~transformers.BatchEncoding`, list of :class:`~transformers.BatchEncoding`, :obj:`Dict[str, List[int]]`, :obj:`Dict[str, List[List[int]]` or :obj:`List[Dict[str, List[int]]]`):
Tokenized inputs. Can represent one input (:class:`~transformers.BatchEncoding` or :obj:`Dict[str,
List[int]]`) or a batch of tokenized inputs (list of :class:`~transformers.BatchEncoding`, `Dict[str,
List[List[int]]]` or `List[Dict[str, List[int]]]`) so you can use this method during preprocessing as
well as in a PyTorch Dataloader collate function.
Instead of :obj:`List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors),
see the note above for the return type.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:

Fixes #8837 (issue)

I'm also curious why this method has default padding=True? Other methods (prepare_for_model, encode, call, encode_plus, batch_encode_plus) have padding=False.

Its default means the DataCollatorForLanguageModeling pads input examples which means it can't be simply switched with the default collator in the example script without breaking the attention mask.

batch = self.tokenizer.pad(examples, return_tensors="pt")

@mfuntowicz
@LysandreJik
@sgugger

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@sgugger sgugger merged commit cc983cd into huggingface:master Nov 30, 2020
LysandreJik pushed a commit that referenced this pull request Nov 30, 2020
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
@Fraser-Greenlee Fraser-Greenlee deleted the patch-1 branch December 11, 2020 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent PreTrainedTokenizerBase.pad argument default value & docstring
2 participants