Correct docstring. #8845

Fraser-Greenlee · 2020-11-30T07:27:13Z

Related issue: #8837

What does this PR do?

Updating the PreTrainedTokenizerBase.pad argument default value docstring to show the correct default value.

Current
docstring:

transformers/src/transformers/tokenization_utils_base.py

Lines 2469 to 2470 in d5b3e56

    
                       padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`): 
        
                            Select a strategy to pad the returned sequences (according to the model's padding side and padding

arg:

transformers/src/transformers/tokenization_utils_base.py

Lines 2431 to 2472 in d5b3e56

    
               def pad( 
        
                   self, 
        
                   encoded_inputs: Union[ 
        
                       BatchEncoding, 
        
                       List[BatchEncoding], 
        
                       Dict[str, EncodedInput], 
        
                       Dict[str, List[EncodedInput]], 
        
                       List[Dict[str, EncodedInput]], 
        
                   ], 
        
                   padding: Union[bool, str, PaddingStrategy] = True, 
        
                   max_length: Optional[int] = None, 
        
                   pad_to_multiple_of: Optional[int] = None, 
        
                   return_attention_mask: Optional[bool] = None, 
        
                   return_tensors: Optional[Union[str, TensorType]] = None, 
        
                   verbose: bool = True, 
        
               ) -> BatchEncoding: 
        
                   """ 
        
                   Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length 
        
                   in the batch. 
        
                   Padding side (left/right) padding token ids are defined at the tokenizer level (with ``self.padding_side``, 
        
                   ``self.pad_token_id`` and ``self.pad_token_type_id``) 
        
                   .. note:: 
        
                       If the ``encoded_inputs`` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the 
        
                       result will use the same type unless you provide a different tensor type with ``return_tensors``. In the 
        
                       case of PyTorch tensors, you will lose the specific device of your tensors however. 
        
                   Args: 
        
                       encoded_inputs (:class:`~transformers.BatchEncoding`, list of :class:`~transformers.BatchEncoding`, :obj:`Dict[str, List[int]]`, :obj:`Dict[str, List[List[int]]` or :obj:`List[Dict[str, List[int]]]`): 
        
                           Tokenized inputs. Can represent one input (:class:`~transformers.BatchEncoding` or :obj:`Dict[str, 
        
                           List[int]]`) or a batch of tokenized inputs (list of :class:`~transformers.BatchEncoding`, `Dict[str, 
        
                           List[List[int]]]` or `List[Dict[str, List[int]]]`) so you can use this method during preprocessing as 
        
                           well as in a PyTorch Dataloader collate function. 
        
                           Instead of :obj:`List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), 
        
                           see the note above for the return type. 
        
                       padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`): 
        
                            Select a strategy to pad the returned sequences (according to the model's padding side and padding 
        
                            index) among:

Fixes #8837 (issue)

I'm also curious why this method has default padding=True? Other methods (prepare_for_model, encode, call, encode_plus, batch_encode_plus) have padding=False.

Its default means the DataCollatorForLanguageModeling pads input examples which means it can't be simply switched with the default collator in the example script without breaking the attention mask.

transformers/src/transformers/data/data_collator.py

Line 253 in 610cb10

batch = self.tokenizer.pad(examples, return_tensors="pt")

@mfuntowicz
@LysandreJik
@sgugger

Related issue: huggingface#8837

sgugger

Thanks for fixing!

Related issue: #8837

Related issue: huggingface#8837

Correct docstring.

af4e363

Related issue: huggingface#8837

sgugger approved these changes Nov 30, 2020

View reviewed changes

sgugger merged commit cc983cd into huggingface:master Nov 30, 2020

LysandreJik pushed a commit that referenced this pull request Nov 30, 2020

Correct docstring. (#8845)

dc0dea3

Related issue: #8837

stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020

Correct docstring. (huggingface#8845)

022304c

Related issue: huggingface#8837

Fraser-Greenlee deleted the patch-1 branch December 11, 2020 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct docstring. #8845

Correct docstring. #8845

Fraser-Greenlee commented Nov 30, 2020 •

edited

sgugger left a comment

	padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
	Select a strategy to pad the returned sequences (according to the model's padding side and padding

	def pad(
	self,
	encoded_inputs: Union[
	BatchEncoding,
	List[BatchEncoding],
	Dict[str, EncodedInput],
	Dict[str, List[EncodedInput]],
	List[Dict[str, EncodedInput]],
	],
	padding: Union[bool, str, PaddingStrategy] = True,
	max_length: Optional[int] = None,
	pad_to_multiple_of: Optional[int] = None,
	return_attention_mask: Optional[bool] = None,
	return_tensors: Optional[Union[str, TensorType]] = None,
	verbose: bool = True,
	) -> BatchEncoding:
	"""
	Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length
	in the batch.

	Padding side (left/right) padding token ids are defined at the tokenizer level (with ``self.padding_side``,
	``self.pad_token_id`` and ``self.pad_token_type_id``)

	.. note::

	If the ``encoded_inputs`` passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the
	result will use the same type unless you provide a different tensor type with ``return_tensors``. In the
	case of PyTorch tensors, you will lose the specific device of your tensors however.

	Args:
	encoded_inputs (:class:`~transformers.BatchEncoding`, list of :class:`~transformers.BatchEncoding`, :obj:`Dict[str, List[int]]`, :obj:`Dict[str, List[List[int]]` or :obj:`List[Dict[str, List[int]]]`):
	Tokenized inputs. Can represent one input (:class:`~transformers.BatchEncoding` or :obj:`Dict[str,
	List[int]]`) or a batch of tokenized inputs (list of :class:`~transformers.BatchEncoding`, `Dict[str,
	List[List[int]]]` or `List[Dict[str, List[int]]]`) so you can use this method during preprocessing as
	well as in a PyTorch Dataloader collate function.

	Instead of :obj:`List[int]` you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors),
	see the note above for the return type.
	padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
	Select a strategy to pad the returned sequences (according to the model's padding side and padding
	index) among:

Correct docstring. #8845

Correct docstring. #8845

Conversation

Fraser-Greenlee commented Nov 30, 2020 • edited

What does this PR do?

sgugger left a comment

Choose a reason for hiding this comment

Fraser-Greenlee commented Nov 30, 2020 •

edited