ii: remove redudant normalize on search as much as possible #1421

komainu8 · 2022-09-28T06:35:33Z

Some normalizes such as NormalizerTable aren't idempotent. So we should not normalize the input
multiple times.

The current token_info_open() related codes normalize the input multiple times. This change avoids
multiple normalizations as much as possible. If a tokenizer enables report_source_location, we can
avoid multiple normalizations all time. But if a tokenizer doesn't enable report_source_location, we may
normalize the input multiple times.

See the added tests for problem cases.

komainu8 · 2022-09-28T22:03:02Z

@kou @HashidaTKS Could you review this PR?

kou · 2022-09-28T23:22:43Z

lib/db.c

@@ -3748,6 +3748,34 @@ grn_table_add_by_key(grn_ctx *ctx, grn_obj *table, grn_obj *key, int *added)
  return id;
 }

+grn_id
+grn_table_get_without_normalize(grn_ctx *ctx,


I don't want to add this. If we add this, we want to add grn_table_cursor_open_without_nromalize().

I think that this is a ii problem. ii should not re-get token from token id. ii should use source of a token instead of re-getting token. We may be able to use grn_token_get_source_offset() and grn_token_get_source_length() for it. (We can add grn_token_get_source() for convenient.)

Currently, a search keyword execute normalize twice. Therefore, the search keyword is normalized involuntarily depending on settings NormalizerTable. For example, If we define NormalizerTable as below "BCD" is normalized like "BCD" -> normalize -> "bcd" -> tokenize -> "bc"/"cd"/"d" -> normlizer -> "bk"/"cd"/"d". load --table ColumnNormalizations [ {"target_column": "c", "normalized": "k"}, {"target_column": "ＣＤ", "normalized": "cd"} ] Therefore, if we use a search keyword "BCD", "ABCD" is not hit. If we use a search keyword "CD", "ABCD" is hit.

komainu8 marked this pull request as ready for review September 28, 2022 22:01

kou reviewed Sep 28, 2022

View reviewed changes

kou force-pushed the remove-redundant-normalize branch from f7b9883 to 17069d0 Compare September 30, 2022 05:50

komainu8 and others added 3 commits September 30, 2022 15:19

Fix typos

5bab0f1

Don't normalize multiple times as much as possible

6e078c9

kou force-pushed the remove-redundant-normalize branch from 17069d0 to 6e078c9 Compare September 30, 2022 06:19

kou changed the title ~~Remove redudant normalize~~ ii: remove redudant normalize on search as much as possible Sep 30, 2022

komainu8 merged commit 864d863 into master Sep 30, 2022

komainu8 deleted the remove-redundant-normalize branch September 30, 2022 12:03

github-actions bot mentioned this pull request May 22, 2023

nginx: update bundles version 1.23.4 komainu8/groonga#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ii: remove redudant normalize on search as much as possible #1421

ii: remove redudant normalize on search as much as possible #1421

komainu8 commented Sep 28, 2022 •

edited by kou

Loading

komainu8 commented Sep 28, 2022

kou Sep 28, 2022

ii: remove redudant normalize on search as much as possible #1421

ii: remove redudant normalize on search as much as possible #1421

Conversation

komainu8 commented Sep 28, 2022 • edited by kou Loading

komainu8 commented Sep 28, 2022

kou Sep 28, 2022

Choose a reason for hiding this comment

komainu8 commented Sep 28, 2022 •

edited by kou

Loading