-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update nbHits count with filtered documents #849
Conversation
@MarinPostma, Updated for placeholder search also. Let me know if I have missed anything. |
Hey @balajisivaraman, sorry I need time to look into this, I'll review this PR asap |
@MarinPostma, No worries! Thanks. |
meilisearch-core/src/bucket_sort.rs
Outdated
Some(key) => buf_distinct.register(key), | ||
None => buf_distinct.register_without_key(), | ||
}; | ||
|
||
if !distinct_accepted && !contains_key { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to check contain_key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see this is a mistake. I noticed that when I did the filter_accepted
check, I needed the contains_key
since the document IDs were repeated in the groups. If I didn't do the contains_key
for filter, then I got overall nbHits
as 0 since filtered_count
was high. I think I don't need this for distinct_accepted
like you pointed out. I will remove it.
meilisearch-core/src/bucket_sort.rs
Outdated
let contains_key = key_cache.contains_key(&document.id); | ||
let entry = key_cache.entry(document.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let contains_key = key_cache.contains_key(&document.id); | |
let entry = key_cache.entry(document.id); | |
let entry = key_cache.entry(document.id); | |
let contains_key = entry.is_some(); |
5693087
to
b49ab80
Compare
Hi Guys, anything i can contribute to this issue to get it merged? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Very sorry for being this long to review this PR. It still need some little changes, but it is a good job nonetheless. Thanks a lot for your contribution!
meilisearch-core/src/bucket_sort.rs
Outdated
} else if !contains_key { | ||
filtered_count += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else if !contains_key { | |
filtered_count += 1; | |
} |
meilisearch-core/src/bucket_sort.rs
Outdated
@@ -331,10 +333,16 @@ where | |||
let entry = key_cache.entry(document.id); | |||
let key = entry.or_insert_with(|| (distinct)(document.id).map(Rc::new)); | |||
|
|||
match key.clone() { | |||
let distinct_accepted = match key.clone() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let distinct_accepted = match key.clone() { | |
for document in group.iter() { | |
let filter_accepted = match &filter { | |
Some(filter) => { | |
let entry = filter_map.entry(document.id); | |
let accepted = *entry.or_insert_with(|| (filter)(document.id)); | |
if !accepted { | |
filtered_count += 1; | |
} | |
accepted | |
} | |
None => true, | |
}; | |
I would suggest this instead, this is more straightforward I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarinPostma, I finally got around to this. Just for clarification, this diff should be applied in place of Lines 322 to 333 right? And I should just get rid of the distinct_accepted
logic in the if filter_accepted
block. However, when I tried this originally (and again now after your suggestion), for the test search_with_filter
, I get 0 as the final count, presumably because it filters out everything in the sample set. When I tried this now, I got a 'attempt to subtract with overflow'
panic also. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right I have been a little too fast on this one, I have tried this one out and it seems to be working:
for group in group.binary_group_by_mut(|a, b| criterion.eq(&ctx, a, b)) {
// we must compute the real distinguished len of this sub-group
for document in group.iter() {
let filter_accepted = match &filter {
Some(filter) => {
let entry = filter_map.entry(document.id);
*entry.or_insert_with(|| {
let accepted = (filter)(document.id);
// we only want to count it out the first time we see it
if !accepted {
filtered_count += 1;
}
accepted
})
}
None => true,
};
if filter_accepted {
let entry = key_cache.entry(document.id);
let mut seen = true;
let key = entry.or_insert_with(|| {
seen = false;
(distinct)(document.id).map(Rc::new)
});
let distinct = match key.clone() {
Some(key) => buf_distinct.register(key),
None => buf_distinct.register_without_key(),
};
// we only want to count the document if it is the first time we see it and
// if it wasn't accepted by distinct
if !seen && !distinct {
filtered_count += 1;
}
}
// the requested range end is reached: stop computing distinct
if buf_distinct.len() >= range.end {
break;
}
}
documents_seen += group.len();
groups.push(group);
// if this sub-group does not overlap with the requested range
// we must update the distinct map and its start index
if buf_distinct.len() < range.start {
buf_distinct.transfert_to_internal();
distinct_raw_offset = documents_seen;
}
// we have sort enough documents if the last document sorted is after
// the end of the requested range, we can continue to the next criterion
if buf_distinct.len() >= range.end {
continue 'criteria;
}
}
these are the tests I have been trying it on:
#[actix_rt::test]
async fn test_filter_nb_hits_search_normal() {
let mut server = common::Server::with_uid("test");
let body = json!({
"uid": "test",
"primaryKey": "id",
});
server.create_index(body).await;
let documents = json!([
{
"id": 1,
"content": "a",
"color": "green",
"size": 1,
},
{
"id": 2,
"content": "a",
"color": "green",
"size": 2,
},
{
"id": 3,
"content": "a",
"color": "blue",
"size": 3,
},
]);
server.add_or_update_multiple_documents(documents).await;
let (response, _) = server.search_post(json!({"q": "a"})).await;
assert_eq!(response["nbHits"], 3);
let (response, _) = server.search_post(json!({"q": "a", "filters": "size = 1"})).await;
assert_eq!(response["nbHits"], 1);
server.update_distinct_attribute(json!("color")).await;
let (response, _) = server.search_post(json!({"q": "a"})).await;
assert_eq!(response["nbHits"], 2);
let (response, _) = server.search_post(json!({"q": "a", "filters": "size < 3"})).await;
println!("result: {}", response);
assert_eq!(response["nbHits"], 1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason for the subtracting with overflow is that we are counting some items more than once (as you've probably figured), we really want to count them only once we first see them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we basically have to try the same tests on placeholder search now
let is_filtered = (filter)(**item); | ||
if is_filtered { | ||
filtered_count += 1; | ||
} | ||
is_filtered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let is_filtered = (filter)(**item); | |
if is_filtered { | |
filtered_count += 1; | |
} | |
is_filtered | |
let accepted = (filter)(**item); | |
if !accepted { | |
filtered_count += 1; | |
} | |
accepted |
@MarinPostma, Thanks for the comments. Is it okay if I get to this in the next couple of days and close it out by the weekend? |
Yes take your time! ThankS for your help 🙂 |
we should test the behavior of #1039 to know if this PR fix it. |
b49ab80
to
75e22fc
Compare
@MarinPostma, Done. Thanks for the help on this one, I had trouble figuring some things out. I tested the |
Codecov Report
@@ Coverage Diff @@
## master #849 +/- ##
==========================================
+ Coverage 76.53% 76.96% +0.42%
==========================================
Files 104 104
Lines 12127 12179 +52
==========================================
+ Hits 9282 9374 +92
+ Misses 2845 2805 -40
Continue to review full report at Codecov.
|
bors try |
tryBuild succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thank you :)
bors r+ |
Build succeeded: |
Thanks so much for merging this, and the support on this one from @MarinPostma. |
Thank you! And sorry for taking so long :) |
Testing the new v0.17.0 release as we speak (great work), but it seems the patch does not work if there is a For exemple, with 3 documents in my index "events" (2 of them with a
|
hello @theo-lubert, this is perfectly normal since we lazily filter elements, since you only ask for one, and it is not filtered out, the other are still counted. |
Thanks @MarinPostma , so as I understand it, this is not a good fit for (filtered) pagination then. Any plans on that ? I don't find any feature related to that in the roadmap, should I create a new issue ? |
hi! is there any news on this front? having filters show the correct nbhits would be amazing! |
Hello @Vincent56, the product team is currently working on the expected behavior we want for |
My filters; nbHits = +76, but hit 15 |
Closes #764
close #1039
After discussing with @MarinPostma on Slack, this is my first attempt at implementing this for the basic flow that will go through
bucket_sort_with_distinct
.A few thoughts here:
filter_map.values().filter(|&&v| !v).count()
. In a few cases, this was the same as what I have now implemented. But I realised I couldn't do something similar fordistinct
. So for being consistent, I have implemented both in a similar fashion.contains_key
check to ensure we're not counting the same document ID twice.@MarinPostma also mentioned that this will be an approximation since the sort is lazy. In the test example that I've updated, the actual filtered count will be just 19 (for
male
records), but due to thelimit
in play, it returns 32 (filtering out 11 records overall).Please let me know if this is the kind of fix we are looking for, and I can implement it in the placeholder search also.