-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Add token type ids to CodeGenTokenizer #29265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add token type ids to CodeGenTokenizer #29265
Conversation
This PR is WIP because I will check documentation update is needed and CodeGenTokenizer output includes |
@ArthurZucker @younesbelkada Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! You might have to update the test_tokenizer_integration
as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thank you for taking the time to review the pull request! It seems that |
6d5e117
to
dc036c1
Compare
@ArthurZucker |
Will review tomorrow sorry ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just a small nit
d42b791
to
c39d626
Compare
I've addressed your comment and also added documentation for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently with the state of this PR:
In [1]: from transformers import AutoTokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
In [3]: tokenizer("Hey how are you")
Out[3]: {'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
this is not backward compatible. return_token_type_ids
should be set to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False) | |
self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False, return_token_type_ids=True) |
for BC we should make sure that return_token_type_ids
is set to False by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this test as you suggested.
c39d626
to
c472a57
Compare
@ArthurZucker >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hey how are you")
{'input_ids': [10814, 703, 389, 345], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("Hey how are you", return_token_type_ids=True)
{'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]} Cloud you review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to change the default return_token_type_ids
not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids
and return_token_type_ids=False
in the args!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# @slow | |
@slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forgot to uncomment it. I've fixed it!
c472a57
to
876566a
Compare
@ArthurZucker class CodeGenTokenizer(PreTrainedTokenizer):
def __init__(
self,
...,
return_token_type_ids=False,
**kwargs,
):
self.return_token_type_ids=return_token_type_ids When looking at other tokenizers in the repository, I found that none of them have Please let me know if there is any misunderstanding on my part. |
Could you take a look at this comment? Your feedback is greatly appreciated. |
@st81 Just to let you know @ArthurZucker is off for a week |
Thank you for letting me know about that! I'll wait for him to return. |
I think current changes are fine to make the default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, regarding:
When looking at other tokenizers in the repository, I found that none of them have return_token_type_ids as an argument in the init function, and set return_token_type_ids as a class variable of the tokenizer like self.return_token_type_ids = return_token_type_ids. Would it be problematic from the perspective of library consistency to make CodeGenTokenizer the only one with such a specification?
it is not problematic no, as this is just a means to make sure we respect backward compatibility.
Let's set it to False for this model only.
876566a
to
d3cb4a7
Compare
d3cb4a7
to
ce599f7
Compare
I've made changes based on your comment. Could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
if self.return_token_type_ids: | ||
self.model_input_names.append("token_type_ids") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we always add it does it make a difference? It's my first time seeing this fix, so a bit suprised that we have to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This actually controls whether the tokenizer returns token_type_ids
or not in here. The tokenizer doesn't return token_type_ids
without this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you are right! It's not really well made. I'll approve, but would be nice if we can fix self.model_iinput_names based on the value of return_token_type_ids
. For now only the reverse is True.
pad_token=None, | ||
add_prefix_space=False, | ||
add_bos_token=False, | ||
return_token_type_ids=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is the expected change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
if self.return_token_type_ids: | ||
self.model_input_names.append("token_type_ids") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you are right! It's not really well made. I'll approve, but would be nice if we can fix self.model_iinput_names based on the value of return_token_type_ids
. For now only the reverse is True.
Sorry for the delay, LGTM |
Thank you so much for reviewing! |
* Add create token type ids to CodeGenTokenizer * Fix inconsistent length of token type ids * Format source codes * Fix inconsistent order of methods * Update docstring * add test_tokenizer_integration test * Format source codes * Add `copied from` comment to CodeGenTokenizerFast * Add doc of create_token_type_ids_from_sequences * Make return_token_type_ids False by default * Make test_tokenizer_integration as slow test * Add return_token_type_ids to tokenizer init arg * Add test for tokenizer's init return_token_type_ids * Format source codes
What does this PR do?
Fixes #28098
Who can review?
@ArthurZucker @younesbelkada