New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort by distance can now optimize top-hits collection #112
Conversation
// IndexSearcher can never optimize top-hits collection in that case, | ||
// se we should get accurate hit counts | ||
throw new AssertionError(); | ||
} | ||
totHits += hits.totalHits.value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do something about totHits
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, total hits just tell us how many docs we have visited and we use it to compute the number of documents per second. Not sure exactly what we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can do like we do for nearest neigbors: in the final message we replace totHits with the number that really matters, e.g. in this case searcher.count(q)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see we are tripping this AssertionError
! It means Lucene was indeed able to optimize this case thanks to LUCENE-9449, yay!
Yeah I don't think just removing the AssertionError
check is enough, because the hits.totalHits
will typically be a (often big) undercount of how many possible hits were "considered". Though, it is a fair count of how many hits it did in fact check?
Using searcher.count(q)
seems good, except, we need this call to not be included in the benchmark runtime. I think what we need is a loop outside of the tStart - tEnd
runtime, that sums up the searcher.count(q)
for all queries tested. Or, maybe, we need to somehow disable this opto, so we can again test the raw performance of the query, forcing it to visit all matches? Seems not natural though.
Maybe we can add one ITER
, and use that last iteration to do the counting, while skipping the timing measurement for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a list to collect the queries in case of doDistanceSort
. Then before computing the result, we do the actual count using the collected queries. wdyt?
} | ||
|
||
@Override | ||
public PointsReader fieldsReader(SegmentReadState readState) throws IOException { | ||
return new Lucene86PointsReader(readState); | ||
return new Lucene90PointsReader(readState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We move the Points format to Lucene90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks good to me.
@@ -100,6 +100,8 @@ | |||
import org.apache.lucene.document.Field; | |||
|
|||
|
|||
import javax.management.Query; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDE playing tricks :) remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK and the chart has a couple nice data points finally! https://home.apache.org/~mikemccand/geobench.html#search-sort Showing a nice jump in "effective" M hits/sec! |
Since apache/lucene-solr#1856, sort by distance can optimise top-hits and skip documents. This commit removes the test that expects no documents are skipped and it is making the benchmark to fail.
closes #111