Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BartTokenizer] add prepare s2s batch #6212

Merged
merged 11 commits into from
Aug 17, 2020

Conversation

patil-suraj
Copy link
Contributor

This PR adds prepare_seq2seq_batch method to BartTokenizer as per the proposal in #6080

@sshleifer

Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice tests!

return_tensors: str = "None",
**kwargs,
) -> BatchEncoding:
"""Prepare a batch that can be passed directly to an instance of BartModel.
Copy link
Collaborator

@sgugger sgugger Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Prepare a batch that can be passed directly to an instance of BartModel.
"""
Prepare a batch that can be passed directly to an instance of :class:`~transformers.BartModel`.

(nit)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Lots of nits on the docs: in general if the argument you are documenting is passed along to another method, don't hesitate to copy-paste the docstring from that method. And when documenting an argument, don't use abbreviations and make full sentences :-)

maximum length for the source text which defers to the config value of 1024 for facebook/bart*
max_target_length (:obj:`int`, `optional`):
maximum length for the target text which defers to the config value of 1024 for facebook/bart*
padding (:obj:`str`, `optional`, defaults to "longest"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be bool, string or PaddingStrategy I believe? See documentation of PreTrainedTokenizerBase.__call__:

            padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
                Activates and controls padding. Accepts the following values:

                * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
                  single sequence if provided).
                * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
                  maximum acceptable input length for the model if that argument is not provided.
                * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
                  different lengths).

maximum length for the target text which defers to the config value of 1024 for facebook/bart*
padding (:obj:`str`, `optional`, defaults to "longest"):
strategy for padding `input_ids` and `decoder_input_ids`. Should be "max_length" or "longest".
return_tensors (:obj:`str`, `optional`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be string or TensorType (same as above, just copy from PreTrainedTokenizerBase.__call__):

            return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`):
                If set, will return tensors instead of list of python integers. Acceptable values are:

                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.

return_tensors (:obj:`str`, `optional`):
Can be set to ‘tf’, ‘pt’ or ‘np’ to return respectively TensorFlow `tf.constant`, PyTorch `torch.Tensor` or Numpy :oj: np.ndarray instead of a list of python integers.
**kwargs:
passed to self.__call__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
passed to self.__call__
Additional keyword arguments passed along to :obj:`self.__call__`.

"""Prepare a batch that can be passed directly to an instance of BartModel.
Args:
src_texts (:obj:`List[str]`):
list of src texts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list of src texts
List of input texts.

src_texts (:obj:`List[str]`):
list of src texts
tgt_texts (:obj:`List[str]`, `optional`):
list of tgt texts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
list of tgt texts
List of target texts.

tgt_texts (:obj:`List[str]`, `optional`):
list of tgt texts
max_length (:obj:`int`, `optional`):
maximum length for the source text which defers to the config value of 1024 for facebook/bart*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maximum length for the source text which defers to the config value of 1024 for facebook/bart*
Maximum length for the source texts. If not provided, this will use the predefined model maximum length.

Don't mention a specific model here since several could be used.

max_length (:obj:`int`, `optional`):
maximum length for the source text which defers to the config value of 1024 for facebook/bart*
max_target_length (:obj:`int`, `optional`):
maximum length for the target text which defers to the config value of 1024 for facebook/bart*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maximum length for the target text which defers to the config value of 1024 for facebook/bart*
Maximum length for the target texts. If not provided, this will use the predefined model maximum length.

@patil-suraj
Copy link
Contributor Author

Thanks @sgugger for theses helpful suggestions!. Will keep these in mind for future PRs.

@patil-suraj
Copy link
Contributor Author

@sgugger , can you help me with the build_doc failure ? Thanks!

@sgugger
Copy link
Collaborator

sgugger commented Aug 3, 2020

Fixed, you needed to have the beginning of the docstrings on a new line for sphinx to understand the indentation.

@codecov
Copy link

codecov bot commented Aug 3, 2020

Codecov Report

Merging #6212 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6212      +/-   ##
==========================================
+ Coverage   78.29%   78.35%   +0.05%     
==========================================
  Files         146      146              
  Lines       26607    26619      +12     
==========================================
+ Hits        20832    20856      +24     
+ Misses       5775     5763      -12     
Impacted Files Coverage Δ
src/transformers/tokenization_bart.py 96.38% <100.00%> (+0.61%) ⬆️
src/transformers/modeling_tf_mobilebert.py 23.38% <0.00%> (-73.39%) ⬇️
src/transformers/data/processors/utils.py 27.63% <0.00%> (+1.31%) ⬆️
src/transformers/tokenization_xlnet.py 90.09% <0.00%> (+1.80%) ⬆️
src/transformers/generation_tf_utils.py 86.46% <0.00%> (+2.25%) ⬆️
src/transformers/tokenization_roberta.py 98.63% <0.00%> (+2.73%) ⬆️
src/transformers/training_args.py 81.00% <0.00%> (+14.00%) ⬆️
src/transformers/data/processors/glue.py 49.09% <0.00%> (+17.09%) ⬆️
src/transformers/modeling_tf_gpt2.py 95.32% <0.00%> (+23.67%) ⬆️
src/transformers/trainer.py 39.14% <0.00%> (+24.04%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8edfaaa...b05ee8f. Read the comment docs.

@patil-suraj
Copy link
Contributor Author

Fixed, you needed to have the beginning of the docstrings on a new line for sphinx to understand the indentation.

Thanks @sgugger !

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks useful! It would be nice to upstream it so that other sequence to sequence models may make use of it. Also, you added it to BartTokenizer and not the fast tokenizer, is there a reason for that?

If we consider this to be a conversion to an s2s task, I think this would be better suited in a s2s processor like we have for squad_convert_examples_to_features or glue_convert.... I don't see any reason of having it linked to BART especially.

Pinging @thomwolf

@patil-suraj
Copy link
Contributor Author

Hi @LysandreJik

and not the fast tokenizer, is there a reason for that?

No, just forgot to add that.

Upstream will be useful but we will need handle few cases differently for each seq2seq model i.e in case of t5 we manually need to add the deocder_start_token_id as T5 don't have a bos token. Also `eos' needs to be added manually. In case of mBart, it needs the language code as prefix token etc. And also AFAIK lot of people seem to be unfamiliar with the processors API

@patil-suraj
Copy link
Contributor Author

hi @sshleifer , @LysandreJik any update ?

* :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
**kwargs:
Additional keyword arguments passed along to :obj:`self.__call__`.
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a new docstring on master/ tokenization_utils_base.py that you may want to (a) reuse or (b) modify.

@patil-suraj
Copy link
Contributor Author

@sshleifer updated the docs.

Comment on lines 63 to 66
src_texts: (:obj:`list`):
list of documents to summarize or source language texts
tgt_texts: (:obj:`list`, `optional`):
list of tgt language texts or summaries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotations here were better before. The docstrings should not have abbreviations (and start with a capital and end with a full stop nit).

Copy link
Contributor Author

@patil-suraj patil-suraj Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, I blindly copy pasted, will make the changes. Also can you tell me where the doc error is coming from ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing new lines before your lists I'd say.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as-is (after the doc building test fixes), but we really should add the same method on the Fast tokenizer. Having parity on both tokenizers is one of our goals.

@patil-suraj
Copy link
Contributor Author

@LysandreJik I will add this for fast tokenizer too once this PR is merged.

@LysandreJik
Copy link
Member

Sounds good!

@patil-suraj
Copy link
Contributor Author

patil-suraj commented Aug 14, 2020

@LysandreJik , doc error is fixed, not sure if current failure is related to this PR.

@sshleifer sshleifer self-assigned this Aug 17, 2020
@sshleifer sshleifer merged commit 2a77813 into huggingface:master Aug 17, 2020
@patil-suraj patil-suraj deleted the bart-tok-s2s-batch branch August 19, 2020 08:01
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants