Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to globals() #8667

Merged
merged 4 commits into from
Nov 20, 2020
Merged

Alternative to globals() #8667

merged 4 commits into from
Nov 20, 2020

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Nov 19, 2020

What does this PR do?

This PR does two things:

  • remove some tokenizer classes that were used in a dictionary before being erased which was super weird
  • add a function that goes from tokenizer class name to tokenizer class to avoid using globals

Comment on lines -188 to -189
(RobertaConfig, (BertweetTokenizer, None)),
(RobertaConfig, (PhobertTokenizer, None)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were not in the values() of this dict since they are replaced in the next line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point.
I feel like in the end this issues comes from the fact that we should have a tokenizer config instead of piggy-backing on the model config. What. do you think @LysandreJik @julien-c ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. This is an issue I'll try to solve next week.

@@ -195,7 +193,6 @@
(LayoutLMConfig, (LayoutLMTokenizer, LayoutLMTokenizerFast)),
(DPRConfig, (DPRQuestionEncoderTokenizer, DPRQuestionEncoderTokenizerFast)),
(SqueezeBertConfig, (SqueezeBertTokenizer, SqueezeBertTokenizerFast)),
(BertConfig, (HerbertTokenizer, HerbertTokenizerFast)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were not in the values() of this dict since they are replaced in the next line.

SLOW_TOKENIZER_MAPPING = {
k: (v[0] if v[0] is not None else v[1])
for k, v in TOKENIZER_MAPPING.items()
if (v[0] is not None or v[1] is not None)
}
SLOW_TOKENIZER_MAPPING[HerbertTokenizer] = HerbertTokenizerFast
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not in TOKENIZER_MAPPING.items() so not set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As seen offline, probably unnecessary.

SLOW_TOKENIZER_MAPPING[HerbertTokenizer] = HerbertTokenizerFast


def tokenizer_class_from_name(class_name: str):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can also be transformed to a dict if we prefer.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

SLOW_TOKENIZER_MAPPING = {
k: (v[0] if v[0] is not None else v[1])
for k, v in TOKENIZER_MAPPING.items()
if (v[0] is not None or v[1] is not None)
}
SLOW_TOKENIZER_MAPPING[HerbertTokenizer] = HerbertTokenizerFast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As seen offline, probably unnecessary.

@julien-c julien-c merged commit 8090b0f into improve-bert-japanese Nov 20, 2020
@julien-c julien-c deleted the classname_to_class branch November 20, 2020 02:26
julien-c added a commit that referenced this pull request Nov 23, 2020
* Make ci fail

* Try to make tests actually run?

* CI finally failing?

* Fix CI

* Revert "Fix CI"

This reverts commit ca7923b.

* Ooops wrong one

* one more try

* Ok ok let's move this elsewhere

* Alternative to globals() (#8667)

* Alternative to globals()

* Error is raised later so return None

* Sentencepiece not installed make some tokenizers None

* Apply Lysandre wisdom

* Slightly clearer comment?

cc @sgugger

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants