Add token type ids to CodeGenTokenizer #29265

st81 · 2024-02-24T02:14:31Z

What does this PR do?

Who can review?

st81 · 2024-02-24T02:16:37Z

This PR is WIP because I will check documentation update is needed and CodeGenTokenizer output includes token_type_ids.

st81 · 2024-02-25T09:00:47Z

@ArthurZucker @younesbelkada
I've made the necessary changes to address the issue. Could someone please review the pull request at your earliest convenience? Your feedback is greatly appreciated.

Thank you!

ArthurZucker

Thanks! You might have to update the test_tokenizer_integration as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!

HuggingFaceDocBuilderDev · 2024-02-28T02:01:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

st81 · 2024-02-28T02:57:31Z

Thanks! You might have to update the test_tokenizer_integration as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!

@ArthurZucker

Thank you for taking the time to review the pull request! It seems that test_tokenizer_integration is missing from tests/models/codegen/test_tokenization_codegen.py. Would you like me to add test_tokenizer_integration to tests/models/codegen/test_tokenization_codegen.py similar to the other tokenizers tests?

st81 · 2024-02-28T11:20:32Z

Thanks! You might have to update the test_tokenizer_integration as it will now return token_type_ids. Let's make sure slow tokenizer tests pass as well!

@ArthurZucker
It seems that test_tokenizer_integration is missing from tests/models/codegen/test_tokenization_codegen.py. So I’ve added test_tokenizer_integration. Could you review again?

ArthurZucker · 2024-02-29T11:40:21Z

Will review tomorrow sorry !

ArthurZucker

LGTM just a small nit

src/transformers/models/codegen/tokenization_codegen_fast.py

st81 · 2024-03-05T13:25:18Z

I've addressed your comment and also added documentation for CodeGenTokenizer.create_token_type_ids_from_sequences ! Could you please review it again?

ArthurZucker

Currently with the state of this PR:

In [1]:  from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")

In [3]:  tokenizer("Hey how are you")
Out[3]: {'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

this is not backward compatible. return_token_type_ids should be set to false

src/transformers/models/codegen/tokenization_codegen_fast.py

ArthurZucker · 2024-03-06T03:09:51Z

tests/models/codegen/test_tokenization_codegen.py

Suggested change

self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False)

self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False, return_token_type_ids=True)

for BC we should make sure that return_token_type_ids is set to False by default

I've updated this test as you suggested.

st81 · 2024-03-06T13:10:40Z

Currently with the state of this PR:

In [1]:  from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")

In [3]:  tokenizer("Hey how are you")
Out[3]: {'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

this is not backward compatible. return_token_type_ids should be set to false

@ArthurZucker
I've updated return_token_type_ids to be False by default.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
>>> tokenizer("Hey how are you")
{'input_ids': [10814, 703, 389, 345], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("Hey how are you", return_token_type_ids=True)
{'input_ids': [10814, 703, 389, 345], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

Cloud you review again?

ArthurZucker

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

ArthurZucker · 2024-03-07T05:11:59Z

tests/models/codegen/test_tokenization_codegen.py

Suggested change

# @slow

@slow

Sorry I forgot to uncomment it. I've fixed it!

st81 · 2024-03-07T14:07:37Z

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

@ArthurZucker
Do you mean that we should make changes like below?

class CodeGenTokenizer(PreTrainedTokenizer):
    def __init__(
        self,
        ...,
        return_token_type_ids=False,
        **kwargs,
    ):
        self.return_token_type_ids=return_token_type_ids

When looking at other tokenizers in the repository, I found that none of them have return_token_type_ids as an argument in the __init__ function, and set return_token_type_ids as a class variable of the tokenizer like self.return_token_type_ids = return_token_type_ids. Would it be problematic from the perspective of library consistency to make CodeGenTokenizer the only one with such a specification?

Please let me know if there is any misunderstanding on my part.

st81 · 2024-03-11T12:06:37Z

@ArthurZucker

Could you take a look at this comment? Your feedback is greatly appreciated.

amyeroberts · 2024-03-11T14:12:10Z

@st81 Just to let you know @ArthurZucker is off for a week

st81 · 2024-03-12T05:17:28Z

@st81 Just to let you know @ArthurZucker is off for a week

Thank you for letting me know about that! I'll wait for him to return.

st81 · 2024-03-20T22:21:20Z

@ArthurZucker

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

I think current changes are fine to make the default return_token_type_ids is False because it looks like the default value of return_token_type_ids is controlled by whether self.model_input_names contains "token_type_ids" or not. What are your thoughts on this?

ArthurZucker

Hey, regarding:

When looking at other tokenizers in the repository, I found that none of them have return_token_type_ids as an argument in the init function, and set return_token_type_ids as a class variable of the tokenizer like self.return_token_type_ids = return_token_type_ids. Would it be problematic from the perspective of library consistency to make CodeGenTokenizer the only one with such a specification?

it is not problematic no, as this is just a means to make sure we respect backward compatibility.

Let's set it to False for this model only.

st81 · 2024-04-02T10:59:33Z

We have to change the default return_token_type_ids not in the test (it's alright to keep it) but most importantly in the init of both tokenizers, set it to self.return_token_type_ids = return_token_type_ids and return_token_type_ids=False in the args!

I've made changes based on your comment. Could you take a look?

ArthurZucker

LGTM otherwise

ArthurZucker · 2024-04-02T14:49:32Z

src/transformers/models/codegen/tokenization_codegen.py

+        if self.return_token_type_ids:
+            self.model_input_names.append("token_type_ids")


if we always add it does it make a difference? It's my first time seeing this fix, so a bit suprised that we have to do that

Yes. This actually controls whether the tokenizer returns token_type_ids or not in here. The tokenizer doesn't return token_type_ids without this line.

Sorry you are right! It's not really well made. I'll approve, but would be nice if we can fix self.model_iinput_names based on the value of return_token_type_ids. For now only the reverse is True.

ArthurZucker · 2024-04-02T14:49:45Z

src/transformers/models/codegen/tokenization_codegen.py

        pad_token=None,
        add_prefix_space=False,
        add_bos_token=False,
+        return_token_type_ids=False,


that is the expected change

ArthurZucker

Thanks

ArthurZucker · 2024-04-17T10:17:44Z

src/transformers/models/codegen/tokenization_codegen.py

+        if self.return_token_type_ids:
+            self.model_input_names.append("token_type_ids")


Sorry you are right! It's not really well made. I'll approve, but would be nice if we can fix self.model_iinput_names based on the value of return_token_type_ids. For now only the reverse is True.

ArthurZucker · 2024-04-17T10:19:13Z

Sorry for the delay, LGTM

st81 · 2024-04-22T09:06:45Z

Thank you so much for reviewing!

* Add create token type ids to CodeGenTokenizer * Fix inconsistent length of token type ids * Format source codes * Fix inconsistent order of methods * Update docstring * add test_tokenizer_integration test * Format source codes * Add `copied from` comment to CodeGenTokenizerFast * Add doc of create_token_type_ids_from_sequences * Make return_token_type_ids False by default * Make test_tokenizer_integration as slow test * Add return_token_type_ids to tokenizer init arg * Add test for tokenizer's init return_token_type_ids * Format source codes

st81 changed the title ~~[WIP] Add token type ids to CodeGenTokenizer~~ Add token type ids to CodeGenTokenizer Feb 25, 2024

ArthurZucker reviewed Feb 28, 2024

View reviewed changes

st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from 6d5e117 to dc036c1 Compare February 28, 2024 10:17

st81 requested a review from ArthurZucker February 28, 2024 11:20

ArthurZucker reviewed Mar 5, 2024

View reviewed changes

src/transformers/models/codegen/tokenization_codegen_fast.py Outdated Show resolved Hide resolved

st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from d42b791 to c39d626 Compare March 5, 2024 13:05

st81 requested a review from ArthurZucker March 5, 2024 13:25

ArthurZucker reviewed Mar 6, 2024

View reviewed changes

st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from c39d626 to c472a57 Compare March 6, 2024 12:45

st81 requested a review from ArthurZucker March 6, 2024 13:10

ArthurZucker reviewed Mar 7, 2024

View reviewed changes

st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from c472a57 to 876566a Compare March 7, 2024 14:02

st81 requested a review from ArthurZucker March 11, 2024 12:06

ArthurZucker reviewed Mar 25, 2024

View reviewed changes

st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from 876566a to d3cb4a7 Compare April 1, 2024 11:31

st81 added 2 commits April 2, 2024 19:03

Add create token type ids to CodeGenTokenizer

220cd00

Fix inconsistent length of token type ids

cec6ba4

st81 added 11 commits April 2, 2024 19:03

Format source codes

428b0e8

Fix inconsistent order of methods

a92d1ba

Update docstring

fe6855e

add test_tokenizer_integration test

ec35162

Format source codes

92ca86d

Add copied from comment to CodeGenTokenizerFast

a611eef

Add doc of create_token_type_ids_from_sequences

91a9f87

Make return_token_type_ids False by default

067720a

Make test_tokenizer_integration as slow test

2ed831c

Add return_token_type_ids to tokenizer init arg

bbbdebc

Add test for tokenizer's init return_token_type_ids

ce599f7

st81 force-pushed the add_token_type_ids_to_code_gen_tokenizer branch from d3cb4a7 to ce599f7 Compare April 2, 2024 10:12

Format source codes

8fa1d28

st81 requested a review from ArthurZucker April 2, 2024 10:59

ArthurZucker reviewed Apr 2, 2024

View reviewed changes

st81 requested a review from ArthurZucker April 3, 2024 23:26

ArthurZucker approved these changes Apr 17, 2024

View reviewed changes

ArthurZucker merged commit 8d6b509 into huggingface:main Apr 17, 2024

	self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False)
	self.tokenizer_integration_test_util(expected_encoding, "Salesforce/codegen-350M-mono", padding=False, return_token_type_ids=True)

		if self.return_token_type_ids:
		self.model_input_names.append("token_type_ids")

Add token type ids to CodeGenTokenizer #29265

Add token type ids to CodeGenTokenizer #29265

Uh oh!

Conversation

st81 commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

st81 commented Feb 24, 2024

Uh oh!

st81 commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Feb 28, 2024

Uh oh!

st81 commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

st81 commented Feb 28, 2024

Uh oh!

ArthurZucker commented Feb 29, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

st81 commented Mar 5, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

st81 commented Mar 6, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

st81 commented Mar 7, 2024

Uh oh!

st81 commented Mar 11, 2024

Uh oh!

amyeroberts commented Mar 11, 2024

Uh oh!

st81 commented Mar 12, 2024

Uh oh!

st81 commented Mar 20, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

st81 commented Apr 2, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

st81 commented Feb 24, 2024 •

edited

Loading

st81 commented Feb 25, 2024 •

edited

Loading

st81 commented Feb 28, 2024 •

edited

Loading