Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace mecab-python3 with fugashi for Japanese tokenization #6086

Merged
merged 6 commits into from
Jul 31, 2020

Conversation

polm
Copy link
Contributor

@polm polm commented Jul 28, 2020

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with cursor,
token_start, and token_end variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed. For reference, since I had trouble figuring it out, this is needed to run the tests:

RUN_CUSTOM_TOKENIZERS=yes RUN_SLOW=1 pytest -rs tests/test_tokenization_bert_japanese.py

This is a followup to #5375 .

It's not in this PR, but because installing MeCab separately is not required, it might be a good idea to have the docs changed to point directly to fugashi instead of MeCab.

polm added 2 commits July 28, 2020 16:03
This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.
@polm polm changed the title Fix japanese Replace mecab-python3 with fugashi for Japanese tokenization Jul 28, 2020
Forgot to delete this...
@codecov
Copy link

codecov bot commented Jul 28, 2020

Codecov Report

Merging #6086 into master will increase coverage by 1.14%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6086      +/-   ##
==========================================
+ Coverage   78.35%   79.50%   +1.14%     
==========================================
  Files         146      146              
  Lines       26454    26450       -4     
==========================================
+ Hits        20729    21030     +301     
+ Misses       5725     5420     -305     
Impacted Files Coverage Δ
src/transformers/tokenization_bert_japanese.py 32.05% <0.00%> (+1.56%) ⬆️
src/transformers/tokenization_bart.py 60.56% <0.00%> (-35.22%) ⬇️
src/transformers/modeling_tf_gpt2.py 65.42% <0.00%> (-29.91%) ⬇️
src/transformers/generation_tf_utils.py 83.45% <0.00%> (-2.76%) ⬇️
src/transformers/modeling_tf_mobilebert.py 96.77% <0.00%> (+73.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91cb954...3f97763. Read the comment docs.

Copy link
Contributor

@JetRunner JetRunner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As long as the tokenization results are the same (it should according to your note).
Maybe @LysandreJik wants to make sure the dependency is good?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well and since the tests pass, the tokenization looks the same. Thanks a lot for tackling this!

Only place the mecab github is linked is in pretrained_models.rst from a quick search. Maybe we can add the change of link to this PR to make it perfectly clean? I can force-push on your branch if you prefer me to do it.

@polm
Copy link
Contributor Author

polm commented Jul 29, 2020

My pleasure. It'd be great if you could edit the docs 👍 I think I gave edit permissions so you should be able to push normally.

setup.py Outdated
@@ -97,7 +97,7 @@
"isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
"flake8",
]
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3<1", "scikit-learn", "tensorflow<=2.2", "torch"]
extras["dev"] = extras["testing"] + extras["quality"] + ["fugashi>=1.0,<2.0", "ipadic", "scikit-learn", "tensorflow<=2.2", "torch"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: the deps here are not with the same version tags as the deps up there.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM! Thanks for taking care of it @polm

@LysandreJik LysandreJik merged commit cf3cf30 into huggingface:master Jul 31, 2020
@BramVanroy
Copy link
Collaborator

Awesome, this has been a small but sore thorn in some issues posted here. Thanks a lot @polm!

@polm
Copy link
Contributor Author

polm commented Jul 31, 2020

If you ever have any other issues with it please feel free to tag me any time.

Mehrad0711 pushed a commit to Mehrad0711/transformers that referenced this pull request Aug 3, 2020
…face#6086)

* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants