Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Near search: modify interval calculation #1519

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

HashidaTKS
Copy link
Contributor

WIP

@HashidaTKS HashidaTKS marked this pull request as draft February 9, 2023 09:03
@HashidaTKS
Copy link
Contributor Author

I think this is not a good approach to adjust n_token_infos, but I don't have any other good idea...

@@ -10217,7 +10231,7 @@ token_info_build_near_phrase(grn_ctx *ctx,
}

uint32_t n_before = data->n_token_infos;
rc = token_info_build_phrase(ctx, data, phrase, phrase_len);
rc = token_info_build_phrase(ctx, data, phrase, phrase_len, GRN_TOKENIZE_ADD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use GRN_TOKENIZE_ADD in search. It may add new tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I see.

A current problem is that when a type of tokenizer is N-gram, n_token_infos seems be not expected value.

The interval for near phrase series is calculated as:

interval = interval between tops of two words - n_token_infos of the left word + 1.

For example, thinking about a case that a target is abcdefg and executing * NP "abc ef":

interval between tops of two words: interval between a and e = 4
n_token_infos of the left word: token num of abc = num of ab bc by Bi-gram and GET_TOKENIZE_GET = 2

As a result, interval is 3.

But... abcdefg is normalized to ab bc cd de ef fg by Bi-gram.
I think abc should be considered as having token cd and having 3 tokens (ab bc cd) for interval calculation because it contains part of abc (c), but abc is normalized to ab bc by Bi-gram and GET_TOKENIZE_GET.
The reason why I use GET_TOKENIZE_ADD is because when GET_TOKENIZE_ADD is specified, abc is normalized to ab bc c and the number of token matches expected. It's just that as a result, it doesn't seem to be semantically correct though...

Is there the way to tokenize abc to ab bc c by Bi-gram and GET_TOKENIZE_GET?
I would like to get suffix tokens shorter than N-gram's N (e.g. N = 2 when Bi-gram, and N = 3 when Tri-gram).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, do you have any other good idea for calculating interval ?
I can't think of any good ideas...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use c for search because it's redundant. It causes extra needless search. It has a performance penalty.

grn_token_have_overlap() may help us...

@HashidaTKS HashidaTKS changed the title [WIP] Near search: enable to specify min interval [WIP] Near search: modify interval calculation Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants