Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make MLuke tokenizer tests slow #14690

Merged
merged 1 commit into from
Dec 8, 2021
Merged

Make MLuke tokenizer tests slow #14690

merged 1 commit into from
Dec 8, 2021

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Dec 8, 2021

What does this PR do?

To avoid the timeouts in the test suites, this PR makes all mLuke tokenizer tests slow until we figure out a way to speed them up.

@sgugger sgugger merged commit e621932 into master Dec 8, 2021
@sgugger sgugger deleted the mluke_slow branch December 8, 2021 20:59
@LysandreJik
Copy link
Member

cc @NielsRogge

@ryokan0123
Copy link
Contributor

Hi, I suspect that the slowness is from reading the large entity_vocab.json in MLukeTokenizer.__init__(), which has 418 MB (studio-ousia/mluke-base).
When I tried on my local PC, only reading the entity vocab file in __init__() took around 20s.

If this is the problem, we can reduce the file size.
The entity vocabulary has 120M entries, but now we naively represent the same entities from different languages separate entries, like

{
	“en:Japan”: 13,
	“de:Japan”: 13,
	'ko:일본’: 13,
	“ja:日本”: 13,
	...
}

We could instead represent the entities with Wikidata QID (e.g., Q17 for Japan), which would significantly reduce the file size. Using QID may look cryptic at first, but now we think this may be the right way to handle entities in multilingual settings.
Do you have any suggestions on this?

@sgugger
Copy link
Collaborator Author

sgugger commented Dec 9, 2021

For tests we don't need a real file (apart from the integration test). All other tokenizers are tested on a vocab of 10-15 tokens, so ideally building something that just have the strict minimum to test would be great.

@ryokan0123
Copy link
Contributor

Sure, I will work on it.

Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants