-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Making ConvBert Tokenizer independent from bert Tokenizer
#19347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Left a couple of pointers on how to finish the work :-)
|
Done! @sgugger do i need to do same for convbert_fast? |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! We're on the right path.
Now we just need to make sure the util scanning the copies is happy (e.g., the copies should match the original). I've left comments in that direction and you can check locally if the util is happy by running make repo-consistency
|
@sgugger Thanks for quick feeback, and for tokenization_convbert_fast.py i have change the comment to for convbert_fast |
|
You'll need to add broader patterns than just the full name of the tokenizer as BERT is used in the docstrings for instance. To see what the copy utils wants to modify, you can run |
|
After running |
|
Yes, that's why I made the suggestions above. |
|
So, do i need to copy |
|
Just accept the suggestions above. |
|
Done @sgugger are there any more changes? |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what you're doing in the fast tokenizer file. The goal is to have ConvBertTokenizerFast stop depending on BertTokenizerFast by copying the code from that file. Not put the code of BertTokenizer there.
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for iterating! This is almost ready to merge, just a couple of last nits.
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for iterating on this PR! Looks perfect now.
What does this PR do?
Fixes #19303
Added
BertTokenizerclass in tokenization_convbert.py andBertTokenizerFastin tokenization_convbert_fast.py