BERT tokenizerの逆斜線 #25

watanabe2362 · 2020-09-23T00:31:06Z

Windows10、AnacondaでBERTを使用していたら、fugashiに代わっているようで下記のエラーとなった。

from sentence_transformers import SentenceTransformer
from sentence_transformers import models
transformer = models.BERT('cl-tohoku/bert-base-japanese-whole-word-masking')
で

------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-d', b'C:UsersnwAnaconda3envsPyTorchCUDA10_1libsite-packagesipadicdicdir', b'-r', b'C:UsersnwAnaconda3envsPyTorchCUDA10_1libsite-packagesipadicdicdirmecabrc']
error message: param.cpp(69) [ifs] no such file or directory: C:UsersnwAnaconda3envsPyTorchCUDA10_1libsite-packagesipadicdicdirmecabrc
----------------------------------------------------------
RuntimeError: Failed initializing MeCab

となった。

そこで、transformersパッケージのtokenization_bert_japanese.pyの252行目に
mecabrc = os.path.join(dic_dir, "mecabrc")
mecab_option = "-d {} -r {} ".format(dic_dir, mecabrc) + mecab_option
mecab_option = mecab_option.replace('\','/')
replaceを追加し、事無きを得たように見える。
修正はこれでよいのだろうか。

The text was updated successfully, but these errors were encountered:

polm · 2020-09-23T02:42:40Z

ご報告ありがとうございます。

問題の原因はshlexにあるみたいです。quoteされてない\が消えてしまいます。パスがquoteされてないせいで他のエラーも起きたので、修正PRは既にtransformersの方でマージされていますがリリースはまだです。

huggingface/transformers#7142

以上で起動できたらそれでも問題ありませんがPRの修正内容は下記です。これだとスペースのあるパスでも正しく処理されます。

mecab_option = '-d "{}" -r "{}" '.format(dic_dir, mecabrc) + mecab_option

polm · 2020-09-24T14:12:28Z

修正PRは既にマージされたので一応クローズします。

polm closed this as completed Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT tokenizerの逆斜線 #25

BERT tokenizerの逆斜線 #25

watanabe2362 commented Sep 23, 2020 •

edited by polm

polm commented Sep 23, 2020

polm commented Sep 24, 2020

BERT tokenizerの逆斜線 #25

BERT tokenizerの逆斜線 #25

Comments

watanabe2362 commented Sep 23, 2020 • edited by polm

polm commented Sep 23, 2020

polm commented Sep 24, 2020

watanabe2362 commented Sep 23, 2020 •

edited by polm