Making `ConvBert Tokenizer` independent from `bert Tokenizer` #19347

IMvision12 · 2022-10-05T12:56:18Z

What does this PR do?

Fixes #19303

Added BertTokenizer class in tokenization_convbert.py and BertTokenizerFast in tokenization_convbert_fast.py

HuggingFaceDocBuilderDev · 2022-10-05T13:18:35Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for working on this! Left a couple of pointers on how to finish the work :-)

src/transformers/models/convbert/tokenization_convbert.py

src/transformers/models/convbert/tokenization_convbert_fast.py

IMvision12 · 2022-10-05T14:02:36Z

Done! @sgugger do i need to do same for convbert_fast?

…nvBert

sgugger

Thanks for iterating! We're on the right path.
Now we just need to make sure the util scanning the copies is happy (e.g., the copies should match the original). I've left comments in that direction and you can check locally if the util is happy by running make repo-consistency

src/transformers/models/convbert/tokenization_convbert.py

src/transformers/models/convbert/tokenization_convbert_fast.py

IMvision12 · 2022-10-05T15:58:27Z

@sgugger Thanks for quick feeback,
For tokenization_convbert.py i have change the comment to # Copied from transformers.models.bert.tokenization_bert.BertTokenizer with ConvBertTokenizer->BertTokenizer

and for tokenization_convbert_fast.py i have change the comment to # Copied from transformers.models.bert.tokenization_bert_fast.BertTokenizerFast with ConvBertTokenizerFast->ConvBertTokenizer

for convbert_fast make repo-consistency gives error : - src/transformers\models\convbert\tokenization_convbert_fast.py: copy does not match models.bert.tokenization_bert_fast.BertTokenizerFast at line 55

sgugger · 2022-10-05T16:00:16Z

You'll need to add broader patterns than just the full name of the tokenizer as BERT is used in the docstrings for instance. To see what the copy utils wants to modify, you can run make fix-copies locally :-)

IMvision12 · 2022-10-05T16:04:08Z

After running make fix-copies it changes slow_tokenizer_class = BertTokenizer but it should be ConvBertTokenizer in tokenization_convbert_fast.py

sgugger · 2022-10-05T16:06:29Z

Yes, that's why I made the suggestions above.

IMvision12 · 2022-10-05T16:20:47Z

So, do i need to copy BertTokenizer and BertTokenizerFast class to tokenization_convbert_fast.py as after adding those classes it passes all tests

sgugger · 2022-10-05T16:24:31Z

Just accept the suggestions above.

IMvision12 · 2022-10-05T16:45:31Z

Done @sgugger are there any more changes?

src/transformers/models/convbert/tokenization_convbert.py

sgugger

I'm not following what you're doing in the fast tokenizer file. The goal is to have ConvBertTokenizerFast stop depending on BertTokenizerFast by copying the code from that file. Not put the code of BertTokenizer there.

sgugger

Thanks a lot for iterating! This is almost ready to merge, just a couple of last nits.

src/transformers/models/convbert/tokenization_convbert_fast.py

src/transformers/models/convbert/tokenization_convbert.py

sgugger

Thanks a lot for iterating on this PR! Looks perfect now.

IMvision12 added 2 commits October 5, 2022 18:20

ConvBert

a987743

added comment

0ef3a79

IMvision12 added 2 commits October 5, 2022 18:56

Updated

c1e53f0

Final_updates

eafbdf3

sgugger reviewed Oct 5, 2022

View reviewed changes

IMvision12 added 4 commits October 5, 2022 19:17

Update tokenization_convbert.py

1b7c928

Update tokenization_convbert_fast.py

df26e25

Update tokenization_convbert.py

65e6a9a

Update tokenization_convbert.py

48885be

IMvision12 added 2 commits October 5, 2022 19:41

Update tokenization_convbert_fast.py

3b20999

Update tokenization_convbert.py

c6bdb7e

IMvision12 requested a review from sgugger October 5, 2022 15:07

IMvision12 added 2 commits October 5, 2022 21:09

Update tokenization_convbert_fast.py

9eb67c6

Merge branch 'ConvBert' of github.com:IMvision12/transformers into Co…

b87bccb

…nvBert

sgugger reviewed Oct 5, 2022

View reviewed changes

src/transformers/models/convbert/tokenization_convbert.py Show resolved Hide resolved

src/transformers/models/convbert/tokenization_convbert_fast.py Outdated Show resolved Hide resolved

IMvision12 added 2 commits October 5, 2022 21:56

Updates

6b3550e

Updates

0aa7867

IMvision12 requested a review from sgugger October 5, 2022 16:38

sgugger reviewed Oct 5, 2022

View reviewed changes

src/transformers/models/convbert/tokenization_convbert.py Outdated Show resolved Hide resolved

sgugger reviewed Oct 5, 2022

View reviewed changes

Updated

77ca7d2

IMvision12 requested a review from sgugger October 5, 2022 19:44

sgugger approved these changes Oct 5, 2022

View reviewed changes

src/transformers/models/convbert/tokenization_convbert_fast.py Outdated Show resolved Hide resolved

src/transformers/models/convbert/tokenization_convbert.py Outdated Show resolved Hide resolved

Final Updates

5521ba8

IMvision12 requested a review from sgugger October 6, 2022 01:34

sgugger approved these changes Oct 7, 2022

View reviewed changes

sgugger merged commit 7e348aa into huggingface:main Oct 7, 2022

IMvision12 deleted the ConvBert branch October 7, 2022 12:08

Making ConvBert Tokenizer independent from bert Tokenizer #19347

Making ConvBert Tokenizer independent from bert Tokenizer #19347

Uh oh!

Conversation

IMvision12 commented Oct 5, 2022

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IMvision12 commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

IMvision12 commented Oct 5, 2022

Uh oh!

sgugger commented Oct 5, 2022

Uh oh!

IMvision12 commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Oct 5, 2022

Uh oh!

IMvision12 commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Oct 5, 2022

Uh oh!

IMvision12 commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Making `ConvBert Tokenizer` independent from `bert Tokenizer` #19347

Making `ConvBert Tokenizer` independent from `bert Tokenizer` #19347

HuggingFaceDocBuilderDev commented Oct 5, 2022 •

edited

Loading

IMvision12 commented Oct 5, 2022 •

edited

Loading

IMvision12 commented Oct 5, 2022 •

edited

Loading

IMvision12 commented Oct 5, 2022 •

edited

Loading

IMvision12 commented Oct 5, 2022 •

edited

Loading