Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase #1527

Merged
merged 10 commits into from
Feb 22, 2023

Conversation

HashidaTKS
Copy link
Contributor

@HashidaTKS HashidaTKS commented Feb 21, 2023

When max_element_intervalals is specified and a specified word containes two or more tokens in it, the interval is miscalculated.

How to reproduce.

table_create Entries TABLE_NO_KEY
[[0,0.0,0.0],true]
column_create Entries content COLUMN_SCALAR Text
[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer 'TokenNgram("unify_alphabet", false,                                   "unify_digit", false)'   --normalizer NormalizerNFKC121
[[0,0.0,0.0],true]
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
[[0,0.0,0.0],true]
load --table Entries
[
{"content": "abcXYZdef"},
{"content": "abebcdXYZdef"},
{"content": "abcdef"},
{"content": "defXYZabc"},
{"content": "XYZabc"},
{"content": "abc123456789def"},
{"content": "abc12345678def"},
{"content": "abc1de2def"}
]
[[0,0.0,0.0],8]
select Entries --filter 'content *ONP-1,0,3 "abc def"' --output_columns '_score, content'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        5
      ],
      [
        [
          "_score",
          "Int32"
        ],
        [
          "content",
          "Text"
        ]
      ],
      [
        1,
        "abcXYZdef"
      ],
      [
        1,
        "abcdef"
      ],
      [
        1,
        "abc123456789def"
      ],
      [
        1,
        "abc12345678def"
      ],
      [
        1,
        "abc1de2def"
      ]
    ]
  ]
]

The expected result of *ONP-1,0,3 "abc def" is only "abcdef", but the query above hits all records that contains abc and def.
This is because interval is regarded as 0.

For example, in case of target is abcXYZdef and query is *ONP-1,0,3 "abc def".

data->token_infos[0]: ab (pos = 0, offset = 0)
data->token_infos[1]: bc (pos = 0, offset = 1)
data->token_infos[2]: de (pos = 6, offset = 0)
n: 2 (because if (n > n_max_element_intervals + 1) is satisfied.)

As a result, the interval is calculated as only data->token_infos[1].pos - data->token_infos[0].pos = 0.
This is because abcXYZdef is matched.

When `max_element_intervalals` was  specified and
a specified word contained two or more tokens in it,
interval was miscalculated.
@HashidaTKS
Copy link
Contributor Author

The interval of abc and def for abcdef should be 0, but currently it's calculated as 3.
However, that is out of scope of this PR.
I plan to work to resolve it with #1519.

@HashidaTKS
Copy link
Contributor Author

@kou @komainu8

Would you review this?

lib/ii.c Outdated
for (i = 1; i < n; i++) {
uint32_t n_checked_elements = 0;

for (i = 1; i < n && n_checked_elements < n_max_element_intervals; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should compare the first token in the i-1th phrase and the first token in the ith phrase.
Does this change do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

  } else if (data->mode == GRN_OP_ORDERED_NEAR_PHRASE) {
    uint32_t i_token_info;
    uint32_t n_token_infos = data->n_token_infos;
    uint32_t i_phrase;
    uint32_t n_phrases = data->n_phrases;
    if (n_phrases > n_max_element_intervals + 1) {
      n_phrases = n_max_element_intervals + 1;
    }
    uint32_t previous_phrase_id = data->token_infos[0]->phrase_id;
    int32_t previous_pos = data->token_infos[0]->pos;
    for (i_token_info = 1, i_phrase = 0;
         i_token_info < n_token_infos && i_phrase < n_phrases;
         i_token_info++) {
      if (data->token_infos[i_token_info]->phrase_id == previous_phrase_id) {
        continue;
      }
      int32_t pos = data->token_infos[i_token_info]->pos;
      int32_t max_element_interval =
        GRN_INT32_VALUE_AT(data->max_element_intervals, i_phrase);
      if (max_element_interval >= 0 &&
          (pos - previous_pos) > max_element_interval) {
        return false;
      }
      previous_pos = pos;
      i_phrase++;
    }
    return true;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!
I didn't noticed that we have phrase_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the code is buggy. n_max_element_intervals + 1 should be n_max_element_intervals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a little modification about n_phrases, and it works fine!
Thanks!

Comment on lines 12 to 19
{"content": "abcXYZdef"},
{"content": "abebcdXYZdef"},
{"content": "abcdef"},
{"content": "defXYZabc"},
{"content": "XYZabc"},
{"content": "abc123456789def"},
{"content": "abc12345678def"},
{"content": "abc1de2def"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use more meaningful test data?
For example, if we focus on the max interval 3, we should use border data such as interval 2, 3 and 4 data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@HashidaTKS
Copy link
Contributor Author

@kou

Thank you for your comments!
I have addressed your comments, would you please re-review this?

{"content": "abc123def"}
]
[[0,0.0,0.0],3]
select Entries --filter 'content *ONP-1,0,5 "abc def"' --output_columns '_score, content'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a separated test for n_phrases > n_max_element_intervals case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case satisfies n_phrases > n_max_element_intervals because n_phrases is 2 (abc and def), n_max_element_intervals is 1 (5 of *ONP-1,0,5).

Did you mean n_phrases <= n_max_element_intervals case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I feel like n_phrases > n_max_element_intervals + 1 is correct in implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for n_phrases > n_max_element_intervals + 1 and n_phrases == n_max_element_intervals + 1 and n_phrases < n_max_element_intervals + 1.

@HashidaTKS
Copy link
Contributor Author

@kou

Thank you.
I have addressed your comment.

lib/ii.c Outdated
Comment on lines 11771 to 11775
uint32_t i_phrase;
uint32_t n_phrases = data->n_phrases;
if (n_phrases > n_max_element_intervals + 1) {
n_phrases = n_max_element_intervals;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we should improve variable name. How about this?

Suggested change
uint32_t i_phrase;
uint32_t n_phrases = data->n_phrases;
if (n_phrases > n_max_element_intervals + 1) {
n_phrases = n_max_element_intervals;
}
uint32_t i_interval;
uint32_t n_intervals = data->n_phrases - 1;
if (n_intervals > n_max_element_intervals) {
n_intervals = n_max_element_intervals;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you push it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I pushed it.

@@ -0,0 +1,62 @@
table_create Entries TABLE_NO_KEY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use multi_tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

{"content": "abc123def123ghi"}
]
[[0,0.0,0.0],9]
select Entries --filter 'content *ONP-1,0,5|5 "abc def ghi"' --output_columns '_score, content'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use different values for max element intervals to check suitable value is really use? For example, 5|6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use 5|6 (and fix test cases)

@HashidaTKS
Copy link
Contributor Author

@kou

Thanks, I have addressed your comments.

@kou kou changed the title ordered_near_phrase: fix max_element_intervals to work correctly ordered_near_phrase: fix max interval caluculation with multiple tokens in a phrase Feb 22, 2023
@kou kou merged commit b4ce1e9 into master Feb 22, 2023
@kou kou deleted the fix-ordered-near-phrase-bug branch February 22, 2023 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants