Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nbHits count with filtered documents #849

Merged
merged 2 commits into from
Nov 26, 2020

Conversation

balajisivaraman
Copy link
Contributor

@balajisivaraman balajisivaraman commented Jul 11, 2020

Closes #764
close #1039

After discussing with @MarinPostma on Slack, this is my first attempt at implementing this for the basic flow that will go through bucket_sort_with_distinct.

A few thoughts here:

  • For getting the count of filtered documents alone, I originally thought of using filter_map.values().filter(|&&v| !v).count(). In a few cases, this was the same as what I have now implemented. But I realised I couldn't do something similar for distinct. So for being consistent, I have implemented both in a similar fashion.
  • I also needed the contains_key check to ensure we're not counting the same document ID twice.

@MarinPostma also mentioned that this will be an approximation since the sort is lazy. In the test example that I've updated, the actual filtered count will be just 19 (for male records), but due to the limit in play, it returns 32 (filtering out 11 records overall).

Please let me know if this is the kind of fix we are looking for, and I can implement it in the placeholder search also.

@balajisivaraman
Copy link
Contributor Author

@MarinPostma, Updated for placeholder search also. Let me know if I have missed anything.

@MarinPostma
Copy link
Contributor

Hey @balajisivaraman, sorry I need time to look into this, I'll review this PR asap

@balajisivaraman
Copy link
Contributor Author

@MarinPostma, No worries! Thanks.

Some(key) => buf_distinct.register(key),
None => buf_distinct.register_without_key(),
};

if !distinct_accepted && !contains_key {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to check contain_key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see this is a mistake. I noticed that when I did the filter_accepted check, I needed the contains_key since the document IDs were repeated in the groups. If I didn't do the contains_key for filter, then I got overall nbHits as 0 since filtered_count was high. I think I don't need this for distinct_accepted like you pointed out. I will remove it.

Comment on lines 333 to 339
let contains_key = key_cache.contains_key(&document.id);
let entry = key_cache.entry(document.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let contains_key = key_cache.contains_key(&document.id);
let entry = key_cache.entry(document.id);
let entry = key_cache.entry(document.id);
let contains_key = entry.is_some();

meilisearch-core/src/query_builder.rs Outdated Show resolved Hide resolved
@neevany
Copy link

neevany commented Oct 5, 2020

Hi Guys, anything i can contribute to this issue to get it merged? Thanks

@ManyTheFish ManyTheFish added this to the 11/20 milestone Oct 5, 2020
Copy link
Contributor

@MarinPostma MarinPostma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Very sorry for being this long to review this PR. It still need some little changes, but it is a good job nonetheless. Thanks a lot for your contribution!

Comment on lines 344 to 345
} else if !contains_key {
filtered_count += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if !contains_key {
filtered_count += 1;
}

@@ -331,10 +333,16 @@ where
let entry = key_cache.entry(document.id);
let key = entry.or_insert_with(|| (distinct)(document.id).map(Rc::new));

match key.clone() {
let distinct_accepted = match key.clone() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let distinct_accepted = match key.clone() {
for document in group.iter() {
let filter_accepted = match &filter {
Some(filter) => {
let entry = filter_map.entry(document.id);
let accepted = *entry.or_insert_with(|| (filter)(document.id));
if !accepted {
filtered_count += 1;
}
accepted
}
None => true,
};

I would suggest this instead, this is more straightforward I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarinPostma, I finally got around to this. Just for clarification, this diff should be applied in place of Lines 322 to 333 right? And I should just get rid of the distinct_accepted logic in the if filter_accepted block. However, when I tried this originally (and again now after your suggestion), for the test search_with_filter, I get 0 as the final count, presumably because it filters out everything in the sample set. When I tried this now, I got a 'attempt to subtract with overflow' panic also. Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right I have been a little too fast on this one, I have tried this one out and it seems to be working:

            for group in group.binary_group_by_mut(|a, b| criterion.eq(&ctx, a, b)) {
                // we must compute the real distinguished len of this sub-group
                for document in group.iter() {
                    let filter_accepted = match &filter {
                        Some(filter) => {
                            let entry = filter_map.entry(document.id);
                            *entry.or_insert_with(|| { 
                                let accepted = (filter)(document.id);
                                // we only want to count it out the first time we see it
                                if !accepted {
                                    filtered_count += 1;
                                }
                                accepted
                            })
                        }
                        None => true,
                    };

                    if filter_accepted {
                        let entry = key_cache.entry(document.id);
                        let mut seen = true;
                        let key = entry.or_insert_with(|| {
                            seen = false;
                            (distinct)(document.id).map(Rc::new)
                        });

                        let distinct = match key.clone() {
                            Some(key) => buf_distinct.register(key),
                            None => buf_distinct.register_without_key(),
                        };

                        // we only want to count the document if it is the first time we see it and
                        // if it wasn't accepted by distinct
                        if !seen && !distinct {
                            filtered_count += 1;
                        }
                    } 
                    // the requested range end is reached: stop computing distinct
                    if buf_distinct.len() >= range.end {
                        break;
                    }
                }

                documents_seen += group.len();
                groups.push(group);

                // if this sub-group does not overlap with the requested range
                // we must update the distinct map and its start index
                if buf_distinct.len() < range.start {
                    buf_distinct.transfert_to_internal();
                    distinct_raw_offset = documents_seen;
                }

                // we have sort enough documents if the last document sorted is after
                // the end of the requested range, we can continue to the next criterion
                if buf_distinct.len() >= range.end {
                    continue 'criteria;
                }
            }

these are the tests I have been trying it on:

#[actix_rt::test]
async fn test_filter_nb_hits_search_normal() {
    let mut server = common::Server::with_uid("test");

    let body = json!({
        "uid": "test",
        "primaryKey": "id",
    });

    server.create_index(body).await;
    let documents = json!([
        {
            "id": 1,
            "content": "a",
            "color": "green",
            "size": 1,
        },
        {
            "id": 2,
            "content": "a",
            "color": "green",
            "size": 2,
        },
        {
            "id": 3,
            "content": "a",
            "color": "blue",
            "size": 3,
        },
    ]);

    server.add_or_update_multiple_documents(documents).await;
    let (response, _) = server.search_post(json!({"q": "a"})).await;
    assert_eq!(response["nbHits"], 3);

    let (response, _) = server.search_post(json!({"q": "a", "filters": "size = 1"})).await;
    assert_eq!(response["nbHits"], 1);

    server.update_distinct_attribute(json!("color")).await;

    let (response, _) = server.search_post(json!({"q": "a"})).await;
    assert_eq!(response["nbHits"], 2);

    let (response, _) = server.search_post(json!({"q": "a", "filters": "size < 3"})).await;
    println!("result: {}", response);
    assert_eq!(response["nbHits"], 1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for the subtracting with overflow is that we are counting some items more than once (as you've probably figured), we really want to count them only once we first see them :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we basically have to try the same tests on placeholder search now

Comment on lines 235 to 239
let is_filtered = (filter)(**item);
if is_filtered {
filtered_count += 1;
}
is_filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let is_filtered = (filter)(**item);
if is_filtered {
filtered_count += 1;
}
is_filtered
let accepted = (filter)(**item);
if !accepted {
filtered_count += 1;
}
accepted

@balajisivaraman
Copy link
Contributor Author

@MarinPostma, Thanks for the comments. Is it okay if I get to this in the next couple of days and close it out by the weekend?

@MarinPostma
Copy link
Contributor

Yes take your time! ThankS for your help 🙂

@ManyTheFish
Copy link
Member

we should test the behavior of #1039 to know if this PR fix it.

@MarinPostma MarinPostma removed this from the 11/20 milestone Oct 27, 2020
@MarinPostma MarinPostma removed the RFR label Oct 30, 2020
@balajisivaraman
Copy link
Contributor Author

@MarinPostma, Done. Thanks for the help on this one, I had trouble figuring some things out. I tested the placeholder search test with the fix removed, and it failed correctly. So I think we look okay here.

@balajisivaraman balajisivaraman marked this pull request as ready for review November 19, 2020 15:37
@codecov
Copy link

codecov bot commented Nov 19, 2020

Codecov Report

Merging #849 (75e22fc) into master (ef6b56d) will increase coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #849      +/-   ##
==========================================
+ Coverage   76.53%   76.96%   +0.42%     
==========================================
  Files         104      104              
  Lines       12127    12179      +52     
==========================================
+ Hits         9282     9374      +92     
+ Misses       2845     2805      -40     
Impacted Files Coverage Δ
meilisearch-core/src/bucket_sort.rs 82.48% <100.00%> (+1.09%) ⬆️
meilisearch-core/src/query_builder.rs 98.23% <100.00%> (+1.89%) ⬆️
meilisearch-http/tests/placeholder_search.rs 98.90% <100.00%> (+0.10%) ⬆️
meilisearch-http/tests/search.rs 99.40% <100.00%> (+0.03%) ⬆️
meilisearch-core/src/number.rs 31.78% <0.00%> (ø)
meilisearch-http/tests/common.rs 90.86% <0.00%> (+0.02%) ⬆️
meilisearch-core/src/filters/mod.rs 85.71% <0.00%> (+1.19%) ⬆️
meilisearch-core/src/store/mod.rs 66.87% <0.00%> (+1.91%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a67862...75e22fc. Read the comment docs.

@MarinPostma
Copy link
Contributor

bors try

bors bot added a commit that referenced this pull request Nov 24, 2020
@bors
Copy link
Contributor

bors bot commented Nov 24, 2020

try

Build succeeded:

Copy link
Contributor

@MarinPostma MarinPostma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thank you :)

@qdequele qdequele added this to the 12/2020 milestone Nov 26, 2020
@MarinPostma
Copy link
Contributor

bors r+

@MarinPostma MarinPostma removed the RFR label Nov 26, 2020
@bors
Copy link
Contributor

bors bot commented Nov 26, 2020

Build succeeded:

@bors bors bot merged commit f564a9c into meilisearch:master Nov 26, 2020
@balajisivaraman balajisivaraman deleted the wip_764 branch November 26, 2020 10:16
@balajisivaraman
Copy link
Contributor Author

Thanks so much for merging this, and the support on this one from @MarinPostma.

@MarinPostma
Copy link
Contributor

Thank you! And sorry for taking so long :)

@theo-lubert
Copy link

Testing the new v0.17.0 release as we speak (great work), but it seems the patch does not work if there is a limit

For exemple, with 3 documents in my index "events" (2 of them with a date in the past):

limit: 25 and date >= ${Date.now} gives you nbHits: 1 (correct)
but,
limit: 1 and date >= ${Date.now} gives you nbHits: 3 (incorrect)

@MarinPostma
Copy link
Contributor

MarinPostma commented Dec 9, 2020

hello @theo-lubert, this is perfectly normal since we lazily filter elements, since you only ask for one, and it is not filtered out, the other are still counted.

@theo-lubert
Copy link

Thanks @MarinPostma , so as I understand it, this is not a good fit for (filtered) pagination then. Any plans on that ? I don't find any feature related to that in the roadmap, should I create a new issue ?

@curquiza curquiza mentioned this pull request Dec 9, 2020
@Vincent56
Copy link

hi! is there any news on this front? having filters show the correct nbhits would be amazing!

@curquiza
Copy link
Member

curquiza commented May 11, 2021

Hello @Vincent56, the product team is currently working on the expected behavior we want for nbHits. We'll keep you informed 🙂
It might help us if you would like to share with us your usecase of nbHits

@celilsahin
Copy link

celilsahin commented May 28, 2021

Hello @Vincent56, the product team is currently working on the expected behavior we want for nbHits. We'll keep you informed 🙂
It might help us if you would like to share with us your usecase of nbHits

My filters;
{ "q":"", "limit":10, "offset":0, "filters": "city=test AND confirm!=DELETE AND authorId=T8jFa4" }
{ "q":"", "limit":10, "offset":0, "filters": "city=test AND confirm!=DELETE" }

nbHits = +76, but hit 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Question: Combine facets with filter Total result count (nbHits) not updated for filters
10 participants