[WIP] Near search: modify interval calculation #1519

HashidaTKS · 2023-02-09T09:03:04Z

WIP

HashidaTKS · 2023-02-09T10:36:02Z

I think this is not a good approach to adjust n_token_infos, but I don't have any other good idea...

kou · 2023-02-10T00:14:49Z

lib/ii.c

@@ -10217,7 +10231,7 @@ token_info_build_near_phrase(grn_ctx *ctx,
    }

    uint32_t n_before = data->n_token_infos;
-    rc = token_info_build_phrase(ctx, data, phrase, phrase_len);
+    rc = token_info_build_phrase(ctx, data, phrase, phrase_len, GRN_TOKENIZE_ADD);


We should not use GRN_TOKENIZE_ADD in search. It may add new tokens.

Hmm... I see.

A current problem is that when a type of tokenizer is N-gram, n_token_infos seems be not expected value.

The interval for near phrase series is calculated as:

interval = interval between tops of two words - n_token_infos of the left word + 1.

For example, thinking about a case that a target is abcdefg and executing * NP "abc ef":

interval between tops of two words: interval between a and e = 4
n_token_infos of the left word: token num of abc = num of ab bc by Bi-gram and GET_TOKENIZE_GET = 2

As a result, interval is 3.

But... abcdefg is normalized to ab bc cd de ef fg by Bi-gram.
I think abc should be considered as having token cd and having 3 tokens (ab bc cd) for interval calculation because it contains part of abc (c), but abc is normalized to ab bc by Bi-gram and GET_TOKENIZE_GET.
The reason why I use GET_TOKENIZE_ADD is because when GET_TOKENIZE_ADD is specified, abc is normalized to ab bc c and the number of token matches expected. It's just that as a result, it doesn't seem to be semantically correct though...

Is there the way to tokenize abc to ab bc c by Bi-gram and GET_TOKENIZE_GET?
I would like to get suffix tokens shorter than N-gram's N (e.g. N = 2 when Bi-gram, and N = 3 when Tri-gram).

Or, do you have any other good idea for calculating interval ?
I can't think of any good ideas...

We should not use c for search because it's redundant. It causes extra needless search. It has a performance penalty.

grn_token_have_overlap() may help us...

Fix tokenizer mode

f546102

HashidaTKS marked this pull request as draft February 9, 2023 09:03

temporary: adjust n_token_infos

42c5a0d

kou reviewed Feb 10, 2023

View reviewed changes

HashidaTKS changed the title ~~[WIP] Near search: enable to specify min interval~~ [WIP] Near search: modify interval calculation Feb 10, 2023

This was referenced Feb 21, 2023

ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

Merged

near search: add min_interval #1520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Near search: modify interval calculation #1519

[WIP] Near search: modify interval calculation #1519

HashidaTKS commented Feb 9, 2023

HashidaTKS commented Feb 9, 2023

kou Feb 10, 2023

HashidaTKS Feb 10, 2023 •

edited

HashidaTKS Feb 10, 2023

kou Feb 10, 2023

[WIP] Near search: modify interval calculation #1519

Are you sure you want to change the base?

[WIP] Near search: modify interval calculation #1519

Conversation

HashidaTKS commented Feb 9, 2023

HashidaTKS commented Feb 9, 2023

kou Feb 10, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 10, 2023 • edited

Choose a reason for hiding this comment

HashidaTKS Feb 10, 2023

Choose a reason for hiding this comment

kou Feb 10, 2023

Choose a reason for hiding this comment

HashidaTKS Feb 10, 2023 •

edited