Fix Japanese handling#107
Conversation
|
I made the necessary changes to Travis and it looks like the tests are passing now. |
This changes the Japanese tokenizer to use versions of mecab-python3 1.0
or greater. This means the package will work on Windows.
However, since the Japanese tokenizer pulls in heavy dependencies and
isn't necessary unless you're dealing with Japanese, I moved it to
optional dependencies. You can install sacrebleu with Japanese support
like below:
pip install sacrebleu[ja]
That will install mecab-python3 and ipadic.
This also includes basic tests to check that the tokenization is as
exepcted for IPAdic.
ozancaglayan
left a comment
There was a problem hiding this comment.
Thank you for this nice patch! I made a small comment. So pip install sacrebleu[ja] does a full install with mecab, right?
| self.tagger = MeCab.Tagger(ipadic.MECAB_ARGS + " -Owakati") | ||
|
|
||
| # make sure the dictionary is IPA | ||
| # sacreBLEU is only compatible with 0.996.5 for now |
There was a problem hiding this comment.
Can you remove these comments related to 0.996.5 then?
There was a problem hiding this comment.
Also, can you rebase your branch as master moved forward by 1 commit in the meanwhile?
There was a problem hiding this comment.
@ozancaglayan I think you can merge master into this polm:japanese-fix branch yourself, e.g. by clicking on the "Update branch" button on GitHub. @mjpost prefers to squash all commits in the PR, so in the end it does not matter if you rebase the PR or merge in master.
There was a problem hiding this comment.
I removed the comments and updated the branch.
That's correct. |
|
Thanks for the PR. I will test this tomorrow a bit further and then @mjpost could do a release to fix Windows installation issue. |
|
@polm why does |
|
That's the version of the underlying C++ library, which is not the same as the version number used for |
This changes the Japanese tokenizer to use versions of mecab-python3 1.0
or greater. This means the package will work on Windows.
However, since the Japanese tokenizer pulls in heavy dependencies and
isn't necessary unless you're dealing with Japanese, I moved it to
optional dependencies. You can install sacrebleu with Japanese support
like below:
That will install mecab-python3 and ipadic.
This also includes basic tests to check that the tokenization is as
exepcted for IPAdic.
You may need to make changes to your automated testing setup to install the Japanese dependencies.