Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort by distance can now optimize top-hits collection #112

Merged
merged 5 commits into from Apr 6, 2021

Conversation

iverase
Copy link
Collaborator

@iverase iverase commented Mar 29, 2021

Since apache/lucene-solr#1856, sort by distance can optimise top-hits and skip documents. This commit removes the test that expects no documents are skipped and it is making the benchmark to fail.

closes #111

// IndexSearcher can never optimize top-hits collection in that case,
// se we should get accurate hit counts
throw new AssertionError();
}
totHits += hits.totalHits.value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do something about totHits too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, total hits just tell us how many docs we have visited and we use it to compute the number of documents per second. Not sure exactly what we can do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can do like we do for nearest neigbors: in the final message we replace totHits with the number that really matters, e.g. in this case searcher.count(q)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see we are tripping this AssertionError! It means Lucene was indeed able to optimize this case thanks to LUCENE-9449, yay!

Yeah I don't think just removing the AssertionError check is enough, because the hits.totalHits will typically be a (often big) undercount of how many possible hits were "considered". Though, it is a fair count of how many hits it did in fact check?

Using searcher.count(q) seems good, except, we need this call to not be included in the benchmark runtime. I think what we need is a loop outside of the tStart - tEnd runtime, that sums up the searcher.count(q) for all queries tested. Or, maybe, we need to somehow disable this opto, so we can again test the raw performance of the query, forcing it to visit all matches? Seems not natural though.

Maybe we can add one ITER, and use that last iteration to do the counting, while skipping the timing measurement for it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a list to collect the queries in case of doDistanceSort. Then before computing the result, we do the actual count using the collected queries. wdyt?

}

@Override
public PointsReader fieldsReader(SegmentReadState readState) throws IOException {
return new Lucene86PointsReader(readState);
return new Lucene90PointsReader(readState);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We move the Points format to Lucene90

Copy link
Collaborator

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good to me.

@@ -100,6 +100,8 @@
import org.apache.lucene.document.Field;


import javax.management.Query;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDE playing tricks :) remove it

@jpountz jpountz merged commit 0ca7eee into mikemccand:master Apr 6, 2021
Copy link
Owner

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks great, thanks @iverase and @jpountz!

@mikemccand
Copy link
Owner

OK and the chart has a couple nice data points finally! https://home.apache.org/~mikemccand/geobench.html#search-sort

Showing a nice jump in "effective" M hits/sec!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Box Filter, Sort by Distance" benchmark failing since 26-08-2020
3 participants