Skip to content

Fix Japanese handling#107

Merged
ozancaglayan merged 4 commits intomjpost:masterfrom
polm:japanese-fix
Jul 29, 2020
Merged

Fix Japanese handling#107
ozancaglayan merged 4 commits intomjpost:masterfrom
polm:japanese-fix

Conversation

@polm
Copy link
Copy Markdown
Contributor

@polm polm commented Jul 25, 2020

This changes the Japanese tokenizer to use versions of mecab-python3 1.0
or greater. This means the package will work on Windows.

However, since the Japanese tokenizer pulls in heavy dependencies and
isn't necessary unless you're dealing with Japanese, I moved it to
optional dependencies. You can install sacrebleu with Japanese support
like below:

pip install sacrebleu[ja]

That will install mecab-python3 and ipadic.

This also includes basic tests to check that the tokenization is as
exepcted for IPAdic.

You may need to make changes to your automated testing setup to install the Japanese dependencies.

@polm
Copy link
Copy Markdown
Contributor Author

polm commented Jul 25, 2020

I made the necessary changes to Travis and it looks like the tests are passing now.

polm added 2 commits July 26, 2020 01:48
This changes the Japanese tokenizer to use versions of mecab-python3 1.0
or greater. This means the package will work on Windows.

However, since the Japanese tokenizer pulls in heavy dependencies and
isn't necessary unless you're dealing with Japanese, I moved it to
optional dependencies. You can install sacrebleu with Japanese support
like below:

    pip install sacrebleu[ja]

That will install mecab-python3 and ipadic.

This also includes basic tests to check that the tokenization is as
exepcted for IPAdic.
Copy link
Copy Markdown
Collaborator

@ozancaglayan ozancaglayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this nice patch! I made a small comment. So pip install sacrebleu[ja] does a full install with mecab, right?

self.tagger = MeCab.Tagger(ipadic.MECAB_ARGS + " -Owakati")

# make sure the dictionary is IPA
# sacreBLEU is only compatible with 0.996.5 for now
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove these comments related to 0.996.5 then?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you rebase your branch as master moved forward by 1 commit in the meanwhile?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ozancaglayan I think you can merge master into this polm:japanese-fix branch yourself, e.g. by clicking on the "Update branch" button on GitHub. @mjpost prefers to squash all commits in the PR, so in the end it does not matter if you rebase the PR or merge in master.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comments and updated the branch.

@polm
Copy link
Copy Markdown
Contributor Author

polm commented Jul 29, 2020

So pip install sacrebleu[ja] does a full install with mecab, right?

That's correct.

@ozancaglayan ozancaglayan merged commit 6e663d6 into mjpost:master Jul 29, 2020
@ozancaglayan
Copy link
Copy Markdown
Collaborator

Thanks for the PR. I will test this tomorrow a bit further and then @mjpost could do a release to fix Windows installation issue.

@ozancaglayan
Copy link
Copy Markdown
Collaborator

@polm why does tagger.version() still returns 0.996? Shouldn't this reflect MeCab' version?

@polm
Copy link
Copy Markdown
Contributor Author

polm commented Jul 30, 2020

That's the version of the underlying C++ library, which is not the same as the version number used for mecab-python3, the Python wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants