Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional dictionaries for BERT Japanese tokenizers #6515

Merged
merged 3 commits into from
Aug 17, 2020
Merged

Support additional dictionaries for BERT Japanese tokenizers #6515

merged 3 commits into from
Aug 17, 2020

Conversation

singletongue
Copy link
Contributor

This PR is to support additional dictionaries for BERT Japanese tokenizers.

Specifically, we add support for unidic_lite and unidic dictionaries.
Both dictionaries are pip-installable like ipadic and compatible with the fugashi package introduced in #6086 by @polm.

(We are going to release newly pre-trained BERT models using these dictionaries as well.)

Copy link
Contributor

@JetRunner JetRunner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great to me! Since @polm has already reacted with 👍 I think it's good then.
The tests failed so could you have a look?

@singletongue
Copy link
Contributor Author

Thank you, @JetRunner and @polm.

I've fixed the test-related issues and it should be OK now.

@JetRunner JetRunner merged commit 48c6c61 into huggingface:master Aug 17, 2020
@singletongue singletongue deleted the update_bert_japanese_tokenizers branch September 3, 2020 03:16
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
…face#6515)

* Update BERT Japanese tokenizers

* Update CircleCI config to download unidic

* Specify to use the latest dictionary packages
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants