ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

HashidaTKS · 2023-02-21T16:22:28Z

When max_element_intervalals is specified and a specified word containes two or more tokens in it, the interval is miscalculated.

How to reproduce.

table_create Entries TABLE_NO_KEY
[[0,0.0,0.0],true]
column_create Entries content COLUMN_SCALAR Text
[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer 'TokenNgram("unify_alphabet", false,                                   "unify_digit", false)'   --normalizer NormalizerNFKC121
[[0,0.0,0.0],true]
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
[[0,0.0,0.0],true]
load --table Entries
[
{"content": "abcXYZdef"},
{"content": "abebcdXYZdef"},
{"content": "abcdef"},
{"content": "defXYZabc"},
{"content": "XYZabc"},
{"content": "abc123456789def"},
{"content": "abc12345678def"},
{"content": "abc1de2def"}
]
[[0,0.0,0.0],8]
select Entries --filter 'content *ONP-1,0,3 "abc def"' --output_columns '_score, content'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        5
      ],
      [
        [
          "_score",
          "Int32"
        ],
        [
          "content",
          "Text"
        ]
      ],
      [
        1,
        "abcXYZdef"
      ],
      [
        1,
        "abcdef"
      ],
      [
        1,
        "abc123456789def"
      ],
      [
        1,
        "abc12345678def"
      ],
      [
        1,
        "abc1de2def"
      ]
    ]
  ]
]

The expected result of *ONP-1,0,3 "abc def" is only "abcdef", but the query above hits all records that contains abc and def.
This is because interval is regarded as 0.

For example, in case of target is abcXYZdef and query is *ONP-1,0,3 "abc def".

data->token_infos[0]: ab (pos = 0, offset = 0)
data->token_infos[1]: bc (pos = 0, offset = 1)
data->token_infos[2]: de (pos = 6, offset = 0)
n: 2 (because if (n > n_max_element_intervals + 1) is satisfied.)

As a result, the interval is calculated as only data->token_infos[1].pos - data->token_infos[0].pos = 0.
This is because abcXYZdef is matched.

When `max_element_intervalals` was specified and a specified word contained two or more tokens in it, interval was miscalculated.

HashidaTKS · 2023-02-21T23:59:08Z

The interval of abc and def for abcdef should be 0, but currently it's calculated as 3.
However, that is out of scope of this PR.
I plan to work to resolve it with #1519.

HashidaTKS · 2023-02-21T23:59:22Z

@kou @komainu8

Would you review this?

kou · 2023-02-22T01:18:44Z

lib/ii.c

-    for (i = 1; i < n; i++) {
+    uint32_t n_checked_elements = 0;
+
+    for (i = 1; i < n && n_checked_elements < n_max_element_intervals; i++) {


We should compare the first token in the i-1th phrase and the first token in the ith phrase.
Does this change do it?

How about this?

} else if (data->mode == GRN_OP_ORDERED_NEAR_PHRASE) { uint32_t i_token_info; uint32_t n_token_infos = data->n_token_infos; uint32_t i_phrase; uint32_t n_phrases = data->n_phrases; if (n_phrases > n_max_element_intervals + 1) { n_phrases = n_max_element_intervals + 1; } uint32_t previous_phrase_id = data->token_infos[0]->phrase_id; int32_t previous_pos = data->token_infos[0]->pos; for (i_token_info = 1, i_phrase = 0; i_token_info < n_token_infos && i_phrase < n_phrases; i_token_info++) { if (data->token_infos[i_token_info]->phrase_id == previous_phrase_id) { continue; } int32_t pos = data->token_infos[i_token_info]->pos; int32_t max_element_interval = GRN_INT32_VALUE_AT(data->max_element_intervals, i_phrase); if (max_element_interval >= 0 && (pos - previous_pos) > max_element_interval) { return false; } previous_pos = pos; i_phrase++; } return true;

Sounds great!
I didn't noticed that we have phrase_id.

Ah, the code is buggy. n_max_element_intervals + 1 should be n_max_element_intervals.

I have added a little modification about n_phrases, and it works fine!
Thanks!

kou · 2023-02-22T01:56:30Z

test/command/suite/select/filter/ordered_near_phrase/max_element_intervals/ngram.test

+{"content": "abcXYZdef"},
+{"content": "abebcdXYZdef"},
+{"content": "abcdef"},
+{"content": "defXYZabc"},
+{"content": "XYZabc"},
+{"content": "abc123456789def"},
+{"content": "abc12345678def"},
+{"content": "abc1de2def"}


Can we use more meaningful test data?
For example, if we focus on the max interval 3, we should use border data such as interval 2, 3 and 4 data.

HashidaTKS · 2023-02-22T02:43:52Z

@kou

Thank you for your comments!
I have addressed your comments, would you please re-review this?

kou · 2023-02-22T02:48:30Z

test/command/suite/select/filter/ordered_near_phrase/max_element_intervals/ngram.expected

+{"content": "abc123def"}
+]
+[[0,0.0,0.0],3]
+select Entries --filter 'content *ONP-1,0,5 "abc def"' --output_columns '_score, content'


Could you also add a separated test for n_phrases > n_max_element_intervals case?

This case satisfies n_phrases > n_max_element_intervals because n_phrases is 2 (abc and def), n_max_element_intervals is 1 (5 of *ONP-1,0,5).

Did you mean n_phrases <= n_max_element_intervals case?

Hmm, I feel like n_phrases > n_max_element_intervals + 1 is correct in implementation.

Added tests for n_phrases > n_max_element_intervals + 1 and n_phrases == n_max_element_intervals + 1 and n_phrases < n_max_element_intervals + 1.

HashidaTKS · 2023-02-22T03:20:51Z

@kou

Thank you.
I have addressed your comment.

kou · 2023-02-22T04:51:19Z

lib/ii.c

+    uint32_t i_phrase;
+    uint32_t n_phrases = data->n_phrases;
+    if (n_phrases > n_max_element_intervals + 1) {
+      n_phrases = n_max_element_intervals;
+    }


Ah, we should improve variable name. How about this?

Suggested change

uint32_t i_phrase;

uint32_t n_phrases = data->n_phrases;

if (n_phrases > n_max_element_intervals + 1) {

n_phrases = n_max_element_intervals;

}

uint32_t i_interval;

uint32_t n_intervals = data->n_phrases - 1;

if (n_intervals > n_max_element_intervals) {

n_intervals = n_max_element_intervals;

}

I adopted your suggestion.

Could you push it?

Sorry, I pushed it.

kou · 2023-02-22T04:52:47Z

...mmand/suite/select/filter/ordered_near_phrase/max_element_intervals/multi_token/all.expected

@@ -0,0 +1,62 @@
+table_create Entries TABLE_NO_KEY


Could you use multi_tokens?

kou · 2023-02-22T04:54:04Z

...mmand/suite/select/filter/ordered_near_phrase/max_element_intervals/multi_token/all.expected

+{"content": "abc123def123ghi"}
+]
+[[0,0.0,0.0],9]
+select Entries --filter 'content *ONP-1,0,5|5 "abc def ghi"' --output_columns '_score, content'


Could you use different values for max element intervals to check suitable value is really use? For example, 5|6?

Changed to use 5|6 (and fix test cases)

HashidaTKS · 2023-02-22T05:16:47Z

@kou

Thanks, I have addressed your comments.

HashidaTKS added 2 commits February 22, 2023 00:59

*ONP: Fix max_element_intervals to work correctly

87418c5

When `max_element_intervalals` was specified and a specified word contained two or more tokens in it, interval was miscalculated.

Use interval

fa84262

kou reviewed Feb 22, 2023

View reviewed changes

HashidaTKS added 2 commits February 22, 2023 11:38

Apply the feedback suggestion

fd46e06

Fix tests

f22582b

kou reviewed Feb 22, 2023

View reviewed changes

HashidaTKS added 2 commits February 22, 2023 12:05

Fix if condition

cd61be1

Modify test

bc659e7

Remove diff for debug

42ccc1c

kou reviewed Feb 22, 2023

View reviewed changes

HashidaTKS added 3 commits February 22, 2023 14:06

Rename multi_token to multi_tokens

828b99b

Rename i_phrase to i_interval

96b8c8e

Modify test cases

387efa2

kou changed the title ~~ordered_near_phrase: fix max_element_intervals to work correctly~~ ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase Feb 22, 2023

kou merged commit b4ce1e9 into master Feb 22, 2023

kou deleted the fix-ordered-near-phrase-bug branch February 22, 2023 05:32

github-actions bot mentioned this pull request May 22, 2023

nginx: update bundles version 1.23.4 komainu8/groonga#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

HashidaTKS commented Feb 21, 2023 •

edited by kou

Loading

HashidaTKS commented Feb 21, 2023

HashidaTKS commented Feb 21, 2023

kou Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

HashidaTKS commented Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

HashidaTKS Feb 22, 2023

HashidaTKS Feb 22, 2023

HashidaTKS commented Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

kou Feb 22, 2023

HashidaTKS Feb 22, 2023

HashidaTKS commented Feb 22, 2023

ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

Conversation

HashidaTKS commented Feb 21, 2023 • edited by kou Loading

HashidaTKS commented Feb 21, 2023

HashidaTKS commented Feb 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS commented Feb 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS commented Feb 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HashidaTKS commented Feb 22, 2023

HashidaTKS commented Feb 21, 2023 •

edited by kou

Loading