Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

near search: add min_interval #1520

Merged
merged 52 commits into from Feb 27, 2023
Merged

near search: add min_interval #1520

merged 52 commits into from Feb 27, 2023

Conversation

HashidaTKS
Copy link
Contributor

@HashidaTKS HashidaTKS commented Feb 10, 2023

Add an argument to specify min_interval for the near search family.
The intervals of phrases(words) should be greater or equals to min_interval.

The default value of min_interval is INT32_MIN(-2147483648).

New syntax:

*N${MAX_INTERVAL},${MAX_TOKEN_INTERVAL_1}|${MAX_TOKEN_INTERVAL_2}|...,${MIN_INTERVAL} "word1 word2 ..."
*NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "phrase1 phrase2 ..."
*NPP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "(phrase1-1 phrase1-2 ...) (phrase2-1 phrase2-2 ...) ..."
*ONP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "phrase1 phrase2 ..."
*ONPP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "(phrase1-1 phrase1-2 ...) (phrase2-1 phrase2-2 ...) ..."

We should also specify ${MAX_INTERVAL}, ${ADDITIONAL_LAST_INTERVAL}, ${MAX_PHRASE_INTERVAL_1}( exclude *N) in order to specify ${MIN_INTERVAL} because it is added as the 4th argument.

Example:


table_create Entries TABLE_NO_KEY
[[0,0.0,0.0],true]
column_create Entries content COLUMN_SCALAR Text
[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer 'TokenNgram("unify_alphabet", false,                                   "unify_digit", false)'   --normalizer NormalizerNFKC121
[[0,0.0,0.0],true]
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
[[0,0.0,0.0],true]
load --table Entries
[
{"content": "abc123456789def"}
]
[[0,0.0,0.0],1]
select Entries   --filter 'content *NP-1,0,,12"abc def"'   --output_columns '_score, content'
[[0,0.0,0.0],[[[0],[["_score","Int32"],["content","Text"]]]]]
select Entries   --filter 'content *NP-1,0,,11"abc def"'   --output_columns '_score, content'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "_score",
          "Int32"
        ],
        [
          "content",
          "Text"
        ]
      ],
      [
        1,
        "abc123456789def"
      ]
    ]
  ]
]

In the case above, the interval between abc and def of abc123456789def is 11.
So *NP-1,0,,11 matches and *NP-1,0,,12 doesn't match.

@HashidaTKS
Copy link
Contributor Author

Currently, some tests are failing.
As far as I confirmed, data.max_interval and this.max_interval are correctly the input value , -1 for example, in

def apply(data)
but after that, in
sis[i] = si;
, si->max_interval is always 0...

@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 13, 2023

This behaviour is caused by my changes, but I can't understand what of my changes does actually affect as I didn't touch max_interval...

@HashidaTKS
Copy link
Contributor Author

HashidaTKS commented Feb 13, 2023

Hmm... I have idiotically misspelled the function name...

static mrb_value
mrb_grn_scan_info_set_min_interval(mrb_state *mrb, mrb_value self)
{
  scan_info *si;
  mrb_int min_interval;

  mrb_get_args(mrb, "i", &min_interval);
  si = DATA_PTR(self);
  grn_scan_info_set_max_interval(si, (int)min_interval); ***should be grn_scan_info_set_min_interval
  return self;
}

@HashidaTKS
Copy link
Contributor Author

@kou

Would you please review this?

@HashidaTKS HashidaTKS changed the title [WIP] near search: add min_interval near search: add min_interval Feb 22, 2023
@HashidaTKS HashidaTKS marked this pull request as ready for review February 22, 2023 06:07
{"content": "abc123456789def"}
]
[[0,0.0,0.0],1]
select Entries --filter 'content *NP-1,0,12,11"abc def"' --output_columns '_score, content'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, when max_elements_intervals specified, the interval between abc and def for abc123456789def is regarded as:

11 when checking min_interval.
12 when checking max interval specified by max_element_intervals.

This is because when checking max_interval and min_interval, the interval is calculated as 11 (

groonga/lib/ii.c

Line 12031 in b50a988

interval -= (data->token_info->n_tokens_in_phrase - 1);
)
And when checking max_element_intervals and min_interval, the interval is calculated as 12. (

groonga/lib/ii.c

Line 11766 in b50a988

int32_t interval = pos - previous_pos;
)
The diff of these intervals come from we don't subtract tokens in a phrase when calculating interval for checking max_element_intervals and min_interval.

I think this is a bug and I plan to work to fix this #1519.

lib/expr.c Show resolved Hide resolved
lib/expr.c Outdated Show resolved Hide resolved
lib/expr.c Outdated Show resolved Hide resolved
Comment on lines 407 to 411
int min_interval;
GRN_INT32_POP(&efsi->min_interval_stack, min_interval);
grn_expr_append_const_int(efsi->ctx, efsi->e, min_interval,
GRN_OP_PUSH, 1);
n_args++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct?
Does this mean if and only if max_element_intervals is specified, min_interval must be specified too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is:

  • Users must also specify max_element_intervals in order to specify min_interval.
    • They can omit the actual value of max_element_intervals like *NP-1,0,,10 "hoge hoge"
  • Users can omit min_interval.
  • if this is out of if (max_element_intervals), min_interval could be the 3rd argument, but I couldn't get any good idea to decide that the 3rd argument is min_interval or max_element_intervals.
    • e.g. *NP-1,0,3 "hoge hoge", we can't decide that users intend 3 is min_interval or max_element_intervals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_interval must be specified too?

As the comment above, we can omit min_interval in the current implementation.

For example, in the current implementation, if we specify the filter or query as below, min_interval is the default value INT32_MIN.

*NP-1,0,2|1 "a 2 c"

(I confirmed that with a debugger.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify this as term of readability.
Please wait a while...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always a value to efsi->min_interval_stack but this pop conditionally.
Does this work with --filter 'X *N "A B" && X *N,1,2,3 "A B"? Does the second *N use DEFAULT_MIN_INTERVAL not 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I confirmed, it works fine. The second *N uses 3.
I have added tests for those cases.

lib/ii.c Outdated Show resolved Hide resolved
lib/mrb/scripts/scan_info_data.rb Outdated Show resolved Hide resolved
lib/mrb/scripts/scan_info_data.rb Outdated Show resolved Hide resolved
@HashidaTKS
Copy link
Contributor Author

@kou

Would you re-review this?

@HashidaTKS
Copy link
Contributor Author

Hmm, I noticed that we have no test case for using this with ADDITIONAL_LAST_INTERVAL.
I will add them.

@HashidaTKS
Copy link
Contributor Author

Hmm, I noticed that we have no test case for using this with ADDITIONAL_LAST_INTERVAL.
I will add them.

Added.

@kou
Copy link
Member

kou commented Feb 27, 2023

We always a value to efsi->min_interval_stack but this pop conditionally.
Does this work with --filter 'X *N "A B" && X *N,1,2,3 "A B"? Does the second *N use DEFAULT_MIN_INTERVAL not 3?

As far as I confirmed, it works fine. The second *N uses 3. I am adding tests for those cases...

私もgdbで確認しましたが、値は正しいものを使いますがefsi->min_interval_stackにはINT32_MINが残りました。

POPは必ずしませんか?

diff --git a/lib/grn_ecmascript.lemon b/lib/grn_ecmascript.lemon
index 33f2feccd..e8baadacc 100644
--- a/lib/grn_ecmascript.lemon
+++ b/lib/grn_ecmascript.lemon
@@ -399,6 +399,8 @@ relational_expression ::= relational_expression NEAR shift_expression. {
     grn_obj *max_element_intervals;
     GRN_PTR_POP(&efsi->max_element_intervals_stack,
                 max_element_intervals);
+    int min_interval;
+    GRN_INT32_POP(&efsi->min_interval_stack, min_interval);
     if (!max_element_intervals) {
       break;
     }
@@ -406,8 +408,6 @@ relational_expression ::= relational_expression NEAR shift_expression. {
                         GRN_OP_PUSH, 1);
     grn_expr_take_obj(efsi->ctx, efsi->e, max_element_intervals);
     n_args++;
-    int min_interval;
-    GRN_INT32_POP(&efsi->min_interval_stack, min_interval);
     grn_expr_append_const_int(efsi->ctx, efsi->e, min_interval,
                               GRN_OP_PUSH, 1);
     n_args++;

@kou
Copy link
Member

kou commented Feb 27, 2023

あと、この変更の問題ではないんですが、 grn_ecmascript.lemonquery_element ::= RELATIVE_OP query_element.にもquery_element ::= IDENTIFIER RELATIVE_OP query_element.でやっているようなPOPがないとスタックにゴミが溜まっていってしまっていました。

Always pop `min_interval`
@HashidaTKS
Copy link
Contributor Author

私もgdbで確認しましたが、値は正しいものを使いますがefsi->min_interval_stackにはINT32_MINが残りました。

POPは必ずしませんか?

なるほど、確認不足でした。。。
ご指摘の通り修正し、手元で近傍検索系の全てのテストが通ることを確認しました。

@HashidaTKS
Copy link
Contributor Author

あと、この変更の問題ではないんですが、 grn_ecmascript.lemonquery_element ::= RELATIVE_OP query_element.にもquery_element ::= IDENTIFIER RELATIVE_OP query_element.でやっているようなPOPがないとスタックにゴミが溜まっていってしまっていました。

こちらについては、issue #1528 に起票しましたので、別のタスクとして実施させてください。

grn_expr_append_const_int(efsi->ctx, efsi->e, min_interval,
GRN_OP_PUSH, 1);
n_args++;
} while(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} while(0);
} while (false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修正しました。

lib/grn_ii.h Outdated
@@ -200,6 +202,7 @@ struct _grn_select_optarg {
int additional_last_interval;
grn_obj *query_options;
grn_obj *max_element_intervals;
int *min_interval;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grn_select_optargは公開インターフェイスじゃないのでintでいい気がします。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intに直しました。

lib/ii.c Outdated
Comment on lines 11729 to 11731
return (interval <=
(data->max_interval + data->additional_last_interval) &&
interval_without_last_token <= data->max_interval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

私はカッコが残っていたほうが好みですが、カッコを取りたいなら中途半端に取らないでもっと取りませんか?

Suggested change
return (interval <=
(data->max_interval + data->additional_last_interval) &&
interval_without_last_token <= data->max_interval);
return interval <= (data->max_interval + data->additional_last_interval) &&
interval_without_last_token <= data->max_interval;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ここにmin_intervalを入れていたときに追加した変更で、現在は不要な変更なので、ここの変更自体をリバーとしました。

lib/proc.c Outdated
@@ -4350,6 +4350,7 @@ selector_in_values(grn_ctx *ctx, grn_obj *table, grn_obj *index,
search_options.similarity_threshold = 0;
search_options.max_interval = 0;
search_options.additional_last_interval = 0;
search_options.min_interval = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grn_search_optargでのメンバーの定義順と同じ順番にしませんか?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修正しました。

lib/table_selector.c Outdated Show resolved Hide resolved
{"content": "abc123456789def"}
]
[[0,0.0,0.0],1]
select Entries --filter 'content *NP12,1,,10"abc def$"' --output_columns '_score, content'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10じゃなくて11がボーダーじゃないですか?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

おっしゃるとおりですので、11に直しました。
(色々試しているときに10を入れてそのままになっていました)

@HashidaTKS
Copy link
Contributor Author

@kou

度々のご確認ありがとうございます。ご指摘の事項に対応しました。

@kou kou merged commit c6107af into master Feb 27, 2023
@kou kou deleted the add-min-interval branch February 27, 2023 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants