Skip to content

Conversation

st81
Copy link
Contributor

@st81 st81 commented Feb 24, 2024

What does this PR do?

Fixes #28098

Who can review?

@ArthurZucker @younesbelkada

@st81
Copy link
Contributor Author

st81 commented Feb 24, 2024

This PR is WIP because I will check documentation update is needed and CodeGenTokenizer output includes token_type_ids.

@st81 st81 changed the title [WIP] Add token type ids to CodeGenTokenizer Add token type ids to CodeGenTokenizer Feb 25, 2024
@st81
Copy link
Contributor Author

st81 commented Feb 25, 2024

@ArthurZucker @younesbelkada
I've made the necessary changes to address the issue. Could someone please review the pull request at your earliest convenience? Your feedback is greatly appreciated.

Thank you!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You might have to update the test_tokenizer_integration as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@st81
Copy link
Contributor Author

st81 commented Feb 28, 2024

Thanks! You might have to update the test_tokenizer_integration as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!

@ArthurZucker

Thank you for taking the time to review the pull request! It seems that test_tokenizer_integration is missing from tests/models/codegen/test_tokenization_codegen.py. Would you like me to add test_tokenizer_integration to tests/models/codegen/test_tokenization_codegen.py similar to the other tokenizers tests?

@st81 st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from 6d5e117 to dc036c1 Compare February 28, 2024 10:17
@st81
Copy link
Contributor Author

st81 commented Feb 28, 2024

Thanks! You might have to update the test_tokenizer_integration as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!

@ArthurZucker
It seems that test_tokenizer_integration is missing from tests/models/codegen/test_tokenization_codegen.py. So I’ve added test_tokenizer_integration. Could you review again?

@st81 st81 requested a review from ArthurZucker February 28, 2024 11:20
@ArthurZucker
Copy link
Collaborator

Will review tomorrow sorry !

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a small nit

@st81 st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from d42b791 to c39d626 Compare March 5, 2024 13:05
@st81
Copy link
Contributor Author

st81 commented Mar 5, 2024

I've addressed your comment and also added documentation for CodeGenTokenizer.create_token_type_ids_from_sequences ! Could you please review it again?

@st81 st81 requested a review from ArthurZucker March 5, 2024 13:25
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently with the state of this PR:

In [1]:  from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")

In [3]:  tokenizer("Hey how are you")
Out[3]: {'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

this is not backward compatible. return_token_type_ids should be set to false

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False)
self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False, return_token_type_ids=True)

for BC we should make sure that return_token_type_ids is set to False by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this test as you suggested.

@st81 st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from c39d626 to c472a57 Compare March 6, 2024 12:45
@st81
Copy link
Contributor Author

st81 commented Mar 6, 2024

Currently with the state of this PR:

In [1]:  from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")

In [3]:  tokenizer("Hey how are you")
Out[3]: {'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

this is not backward compatible. return_token_type_ids should be set to false

@ArthurZucker
I've updated return_token_type_ids to be False by default.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hey how are you")
{'input_ids': [10814, 703, 389, 345], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("Hey how are you", return_token_type_ids=True)
{'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

Cloud you review again?

@st81 st81 requested a review from ArthurZucker March 6, 2024 13:10
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# @slow
@slow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to uncomment it. I've fixed it!

@st81 st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from c472a57 to 876566a Compare March 7, 2024 14:02
@st81
Copy link
Contributor Author

st81 commented Mar 7, 2024

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

@ArthurZucker
Do you mean that we should make changes like below?

class CodeGenTokenizer(PreTrainedTokenizer):
    def __init__(
        self,
        ...,
        return_token_type_ids=False,
        **kwargs,
    ):
        self.return_token_type_ids=return_token_type_ids

When looking at other tokenizers in the repository, I found that none of them have return_token_type_ids as an argument in the __init__ function, and set return_token_type_ids as a class variable of the tokenizer like self.return_token_type_ids = return_token_type_ids. Would it be problematic from the perspective of library consistency to make CodeGenTokenizer the only one with such a specification?

Please let me know if there is any misunderstanding on my part.

@st81
Copy link
Contributor Author

st81 commented Mar 11, 2024

@ArthurZucker

Could you take a look at this comment? Your feedback is greatly appreciated.

@st81 st81 requested a review from ArthurZucker March 11, 2024 12:06
@amyeroberts
Copy link
Contributor

@st81 Just to let you know @ArthurZucker is off for a week

@st81
Copy link
Contributor Author

st81 commented Mar 12, 2024

@st81 Just to let you know @ArthurZucker is off for a week

Thank you for letting me know about that! I'll wait for him to return.

@st81
Copy link
Contributor Author

st81 commented Mar 20, 2024

@ArthurZucker

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

I think current changes are fine to make the default return_token_type_ids is False because it looks like the default value of return_token_type_ids is controlled by whether self.model_input_names contains "token_type_ids" or not. What are your thoughts on this?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, regarding:

When looking at other tokenizers in the repository, I found that none of them have return_token_type_ids as an argument in the init function, and set return_token_type_ids as a class variable of the tokenizer like self.return_token_type_ids = return_token_type_ids. Would it be problematic from the perspective of library consistency to make CodeGenTokenizer the only one with such a specification?

it is not problematic no, as this is just a means to make sure we respect backward compatibility.

Let's set it to False for this model only.

@st81 st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from 876566a to d3cb4a7 Compare April 1, 2024 11:31
@st81 st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from d3cb4a7 to ce599f7 Compare April 2, 2024 10:12
@st81
Copy link
Contributor Author

st81 commented Apr 2, 2024

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

I've made changes based on your comment. Could you take a look?

@st81 st81 requested a review from ArthurZucker April 2, 2024 10:59
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

Comment on lines +164 to +165
if self.return_token_type_ids:
self.model_input_names.append("token_type_ids")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we always add it does it make a difference? It's my first time seeing this fix, so a bit suprised that we have to do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This actually controls whether the tokenizer returns token_type_ids or not in here. The tokenizer doesn't return token_type_ids without this line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you are right! It's not really well made. I'll approve, but would be nice if we can fix self.model_iinput_names based on the value of return_token_type_ids. For now only the reverse is True.

pad_token=None,
add_prefix_space=False,
add_bos_token=False,
return_token_type_ids=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is the expected change

@st81 st81 requested a review from ArthurZucker April 3, 2024 23:26
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Comment on lines +164 to +165
if self.return_token_type_ids:
self.model_input_names.append("token_type_ids")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you are right! It's not really well made. I'll approve, but would be nice if we can fix self.model_iinput_names based on the value of return_token_type_ids. For now only the reverse is True.

@ArthurZucker
Copy link
Collaborator

Sorry for the delay, LGTM

@ArthurZucker ArthurZucker merged commit 8d6b509 into huggingface:main Apr 17, 2024
@st81
Copy link
Contributor Author

st81 commented Apr 22, 2024

Thank you so much for reviewing!

ydshieh pushed a commit that referenced this pull request Apr 23, 2024
* Add create token type ids to CodeGenTokenizer

* Fix inconsistent length of token type ids

* Format source codes

* Fix inconsistent order of methods

* Update docstring

* add test_tokenizer_integration test

* Format source codes

* Add `copied from` comment to CodeGenTokenizerFast

* Add doc of create_token_type_ids_from_sequences

* Make return_token_type_ids False by default

* Make test_tokenizer_integration as slow test

* Add return_token_type_ids to tokenizer init arg

* Add test for tokenizer's init return_token_type_ids

* Format source codes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create create_token_type_ids_from_sequences for CodeGenTokenizer
4 participants