New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Near search: modify interval calculation #1519
base: main
Are you sure you want to change the base?
Conversation
I think this is not a good approach to adjust n_token_infos, but I don't have any other good idea... |
@@ -10217,7 +10231,7 @@ token_info_build_near_phrase(grn_ctx *ctx, | |||
} | |||
|
|||
uint32_t n_before = data->n_token_infos; | |||
rc = token_info_build_phrase(ctx, data, phrase, phrase_len); | |||
rc = token_info_build_phrase(ctx, data, phrase, phrase_len, GRN_TOKENIZE_ADD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use GRN_TOKENIZE_ADD
in search. It may add new tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I see.
A current problem is that when a type of tokenizer is N-gram, n_token_infos
seems be not expected value.
The interval
for near phrase
series is calculated as:
interval
= interval between tops of two words
- n_token_infos of the left word
+ 1.
For example, thinking about a case that a target is abcdefg
and executing * NP "abc ef"
:
interval between tops of two words
: interval between a and e = 4
n_token_infos of the left word
: token num of abc
= num of ab bc
by Bi-gram and GET_TOKENIZE_GET
= 2
As a result, interval
is 3.
But... abcdefg
is normalized to ab bc cd de ef fg
by Bi-gram.
I think abc
should be considered as having token cd
and having 3 tokens (ab bc cd
) for interval
calculation because it contains part of abc
(c
), but abc
is normalized to ab bc
by Bi-gram and GET_TOKENIZE_GET
.
The reason why I use GET_TOKENIZE_ADD
is because when GET_TOKENIZE_ADD
is specified, abc
is normalized to ab bc c
and the number of token matches expected. It's just that as a result, it doesn't seem to be semantically correct though...
Is there the way to tokenize abc
to ab bc c
by Bi-gram and GET_TOKENIZE_GET
?
I would like to get suffix tokens shorter than N-gram's N (e.g. N = 2 when Bi-gram, and N = 3 when Tri-gram).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, do you have any other good idea for calculating interval
?
I can't think of any good ideas...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use c
for search because it's redundant. It causes extra needless search. It has a performance penalty.
grn_token_have_overlap()
may help us...
WIP