Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenMeCab と TokenFilterStopWord とオプションつきの TokenFilterNFKC100を組みあわせると StopWordが効かない #399

Closed
masudakz opened this issue Mar 25, 2021 · 6 comments · Fixed by groonga/groonga#1175

Comments

@masudakz
Copy link

添付 test.txt を実行すると TokenFilterStopWordのチュートリアル の期待する結果になりません

DROP TABLE IF EXISTS memos, terms;
CREATE TABLE terms (
  term VARCHAR(64) NOT NULL PRIMARY KEY,
  is_stop_word BOOL NOT NULL
) Engine=Mroonga
COMMENT='tokenizer "TokenMeCab", normalizer "NormalizerNFKC100",
 token_filters "TokenFilterStopWord,TokenFilterNFKC100(''unify_kana'', true)"'
DEFAULT CHARSET=utf8;
CREATE TABLE `memos` (
  `id` int(11) NOT NULL,
  `content` text NOT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `content` (`content`) COMMENT 'table "terms"'
 ) ENGINE=Mroonga DEFAULT CHARSET=utf8;
INSERT INTO terms VALUES ("AND", true);
INSERT INTO memos VALUES (1, "Hello"),(2, "Hello and Good-bye"),(3, "Good-bye");
SELECT * FROM memos WHERE MATCH (content) AGAINST ('+"Hello and"' IN BOOLEAN MODE);
>source test.txt
...
+----+--------------------+
| id | content            |
+----+--------------------+
|  2 | Hello and Good-bye |
+----+--------------------+
1 row in set (0.034 sec)

期待する結果は

+----+--------------------+
| id | content            |
+----+--------------------+
|  1 | Hello              |
|  2 | Hello and Good-bye |
+----+--------------------+

TokenFilterNFKC100 のオプション指定を外した test0.txt なら期待する結果になります。
token_filters "TokenFilterStopWord,TokenFilterNFKC100"

test.txt
test0.txt

@masudakz
Copy link
Author

こちらの環境情報です

Server version: 10.4.17-MariaDB-log MariaDB Server
> show variables like 'mroonga_version';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| mroonga_version | 10.10 |
+-----------------+-------+

@komainu8
Copy link
Member

GroongaでMroongaから受け取った"TokenFilterStopWord,TokenFilterNFKC100(''unify_kana'', true)"をパースする箇所に問題がある。

token_filters "TokenFilterStopWord, TokenFilterNFKC130(''unify_kana'', true)"'

#<expr
  vars:{
    $1:#<record:pat:terms id:(no value)>
  },
  codes:{
    0:<push n_args:1, flags:0, modify:0, value:#<proc:token-filter TokenFilterStopWord arguments:[]>>,
    1:<push n_args:1, flags:0, modify:3, value:#<proc:token-filter TokenFilterNFKC130 arguments:[]>>,
    2:<push n_args:1, flags:0, modify:0, value:"unify_kana">,
    3:<push n_args:1, flags:0, modify:0, value:true>,
    4:<call n_args:3, flags:0, modify:0, value:(NULL)>,
    5:<comma n_args:2, flags:0, modify:0, value:(NULL)>
  }

Groonga内で"TokenFilterStopWord,TokenFilterNFKC100(''unify_kana'', true)"は上記のようにパースされるが、以下のように現状では、callが必ず最後にでてくることを想定した作りになっているため、無効な引数(GRN_INVALID_ARGUMENT)扱いになり、TokenFilterがGroongaのテーブルに登録されない。

  for (; codes < codes_end; codes++) {
    switch (codes[0].op) {
    case GRN_OP_PUSH :
      break;
    case GRN_OP_CALL :
      if (codes + 1 != codes_end) {
        return GRN_FALSE;
      }
      break;
    case GRN_OP_COMMA :
      break;
    default :
      return GRN_FALSE;
    }
  }

@komainu8
Copy link
Member

Groongaのコードを修正して手元では動作するようになった。他の箇所へ影響がないかCIで確認中。
groonga/groonga#1175

komainu8 added a commit to komainu8/groonga that referenced this issue Apr 14, 2021
GitHub: fix mroonga/mroonga#399.

Reported by MASUDA Kazuhiro. Thanks!!!
@komainu8
Copy link
Member

komainu8 commented Apr 14, 2021

CIをパスしたのでPRした。
groonga/groonga#1175

komainu8 added a commit to komainu8/groonga that referenced this issue Apr 15, 2021
GitHub: fix mroonga/mroonga#399.

Reported by MASUDA Kazuhiro. Thanks!!!
kou added a commit to groonga/groonga that referenced this issue Apr 20, 2021
GitHub: fix mroonga/mroonga#399.

Reported by MASUDA Kazuhiro. Thanks!!!

Co-authored-by: Sutou Kouhei <kou@clear-code.com>
@komainu8
Copy link
Member

@masudakz 修正しました。
次のリリース(2021/4/29リリース予定)のGroongaをお使いいただければと思います!

@masudakz
Copy link
Author

改修ありがとうございます。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants