Sort by distance can now optimize top-hits collection #112

iverase · 2021-03-29T13:37:42Z

Since apache/lucene-solr#1856, sort by distance can optimise top-hits and skip documents. This commit removes the test that expects no documents are skipped and it is making the benchmark to fail.

closes #111

jpountz · 2021-03-29T14:07:46Z

src/main/perf/IndexAndSearchOpenStreetMaps.java

-                    	  // IndexSearcher can never optimize top-hits collection in that case,
-                    	  // se we should get accurate hit counts
-                    	  throw new AssertionError();
-                      }
                      totHits += hits.totalHits.value;


We should do something about totHits too?

yeah, total hits just tell us how many docs we have visited and we use it to compute the number of documents per second. Not sure exactly what we can do.

maybe we can do like we do for nearest neigbors: in the final message we replace totHits with the number that really matters, e.g. in this case searcher.count(q)?

I'm glad to see we are tripping this AssertionError! It means Lucene was indeed able to optimize this case thanks to LUCENE-9449, yay!

Yeah I don't think just removing the AssertionError check is enough, because the hits.totalHits will typically be a (often big) undercount of how many possible hits were "considered". Though, it is a fair count of how many hits it did in fact check?

Using searcher.count(q) seems good, except, we need this call to not be included in the benchmark runtime. I think what we need is a loop outside of the tStart - tEnd runtime, that sums up the searcher.count(q) for all queries tested. Or, maybe, we need to somehow disable this opto, so we can again test the raw performance of the query, forcing it to visit all matches? Seems not natural though.

Maybe we can add one ITER, and use that last iteration to do the counting, while skipping the timing measurement for it.

I added a list to collect the queries in case of doDistanceSort. Then before computing the result, we do the actual count using the collected queries. wdyt?

iverase · 2021-04-06T08:47:36Z

src/main/perf/IndexAndSearchOpenStreetMaps.java

            }

            @Override
            public PointsReader fieldsReader(SegmentReadState readState) throws IOException {
-              return new Lucene86PointsReader(readState);
+              return new Lucene90PointsReader(readState);


We move the Points format to Lucene90

jpountz

This approach looks good to me.

jpountz · 2021-04-06T08:51:10Z

src/main/perf/IndexAndSearchOpenStreetMaps.java

@@ -100,6 +100,8 @@
 import org.apache.lucene.document.Field;


+import javax.management.Query;


IDE playing tricks :) remove it

mikemccand

Change looks great, thanks @iverase and @jpountz!

mikemccand · 2021-04-06T19:40:57Z

OK and the chart has a couple nice data points finally! https://home.apache.org/~mikemccand/geobench.html#search-sort

Showing a nice jump in "effective" M hits/sec!

iverase added 2 commits March 29, 2021 15:33

Remove unnecessary check

cdd7a3e

Merge branch 'master' into fixBoxSort

a112667

jpountz reviewed Mar 29, 2021

View reviewed changes

iverase added 2 commits April 6, 2021 10:35

fix totHits

ae521b0

typo

bfa9c87

iverase commented Apr 6, 2021

View reviewed changes

jpountz approved these changes Apr 6, 2021

View reviewed changes

wrong import

9b0cfc6

jpountz merged commit 0ca7eee into mikemccand:master Apr 6, 2021

mikemccand reviewed Apr 6, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort by distance can now optimize top-hits collection #112

Sort by distance can now optimize top-hits collection #112

iverase commented Mar 29, 2021

jpountz Mar 29, 2021

iverase Mar 29, 2021

jpountz Mar 29, 2021

mikemccand Apr 5, 2021

iverase Apr 6, 2021

iverase Apr 6, 2021

jpountz left a comment

jpountz Apr 6, 2021

iverase Apr 6, 2021

mikemccand left a comment

mikemccand commented Apr 6, 2021

		@@ -100,6 +100,8 @@
		import org.apache.lucene.document.Field;


		import javax.management.Query;

Sort by distance can now optimize top-hits collection #112

Sort by distance can now optimize top-hits collection #112

Conversation

iverase commented Mar 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Apr 6, 2021