Switch Lucene DocValues codec to Mode.BEST_SPEED #86

gautamworah96 · 2020-11-17T20:55:09Z

We have noticed a big drop in the throughput from faceting for all Day of Year facets. One possible explanation for this could be a change to add two modes to the Lucene codec: BEST_SPEED, BEST_COMPRESSION.

It is possible that using the BEST_COMPRESSION mode caused a drop in the faceting throughput. This issue is to confirm if FACET_FIELD_DV_FORMAT_DEFAULT='Lucene80' by default uses BEST_COMPRESSION and if yes does changing it to BEST_SPEED improve the throughput to the previous rate.

The text was updated successfully, but these errors were encountered:

gautamworah96 · 2020-11-17T23:57:12Z

The default format for Lucene's Lucene80DocValuesFormat() call is BEST_SPEED as returned here which means that there is a very high possibility that the drop was due to this change. I will be re-running benchmarks on my change to confirm if that was the commit that caused the regression

jpountz · 2020-11-18T15:57:56Z

@gautamworah96 We had seen an increase of performance for these tasks when introducing compression because it seems that the compressed format performs better for linear scans and worse for selective queries. So it's not unlikely that your change did not slow down these tasks.

gautamworah96 · 2020-11-19T08:04:47Z

Results from luceneutil benchmarks comparing LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733 (commit 3f8f84f9b063277e9017221bfc5e80fb901fc1ce) with mainline commit 06877b2c6e47bc481a79d7bedd8ea4fb099f1b4c (which contains the changes for enabling compression)

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
            OrHighNotMed      481.89      (4.6%)      462.06      (6.0%)   -4.1% ( -14% -    6%) 0.015
       HighTermMonthSort       44.99     (14.5%)       43.94     (11.0%)   -2.3% ( -24% -   27%) 0.567
              TermDTSort       17.80     (10.6%)       17.56      (9.3%)   -1.3% ( -19% -   20%) 0.676
                 Prefix3       49.40      (4.2%)       48.79      (4.3%)   -1.2% (  -9% -    7%) 0.358
           OrNotHighHigh      451.33      (4.8%)      446.17      (6.6%)   -1.1% ( -11% -   10%) 0.532
               MedPhrase       32.96      (2.9%)       32.68      (3.2%)   -0.8% (  -6% -    5%) 0.395
            OrNotHighMed      420.25      (4.4%)      416.95      (4.4%)   -0.8% (  -9% -    8%) 0.574
               LowPhrase       47.90      (3.1%)       47.54      (3.9%)   -0.7% (  -7% -    6%) 0.502
    HighIntervalsOrdered        8.29      (2.7%)        8.23      (2.6%)   -0.6% (  -5% -    4%) 0.444
   HighTermDayOfYearSort       27.30     (13.9%)       27.12     (10.2%)   -0.6% ( -21% -   27%) 0.870
               OrHighLow      372.74      (4.2%)      370.68      (4.8%)   -0.6% (  -9% -    8%) 0.698
            OrNotHighLow      337.47      (4.2%)      335.75      (3.5%)   -0.5% (  -7% -    7%) 0.678
                Wildcard       27.52      (3.5%)       27.39      (3.4%)   -0.5% (  -7% -    6%) 0.653
                 Respell       37.86      (2.4%)       37.68      (2.4%)   -0.5% (  -5% -    4%) 0.543
   BrowseMonthSSDVFacets        2.95      (0.5%)        2.94      (0.5%)   -0.3% (  -1% -    0%) 0.055
                PKLookup      130.30      (2.0%)      129.98      (2.1%)   -0.2% (  -4% -    3%) 0.702
   BrowseMonthTaxoFacets        0.76      (6.5%)        0.76      (6.3%)   -0.2% ( -12% -   13%) 0.913
         MedSloppyPhrase        2.52      (5.0%)        2.51      (5.2%)   -0.2% (  -9% -   10%) 0.918
              AndHighMed       62.40      (4.4%)       62.30      (3.6%)   -0.2% (  -7% -    8%) 0.900
                 MedTerm      809.18      (5.3%)      807.88      (7.1%)   -0.2% ( -11% -   12%) 0.936
              HighPhrase      182.49      (6.9%)      182.29      (6.9%)   -0.1% ( -12% -   14%) 0.959
                  IntNRQ       17.63     (22.3%)       17.62     (22.3%)   -0.1% ( -36% -   57%) 0.994
BrowseDayOfYearSSDVFacets        2.76      (0.9%)        2.76      (0.8%)   -0.1% (  -1% -    1%) 0.837
    BrowseDateTaxoFacets        0.72      (6.4%)        0.72      (6.3%)   -0.0% ( -11% -   13%) 0.980
                  Fuzzy1       52.90      (6.8%)       52.88      (7.8%)   -0.0% ( -13% -   15%) 0.987
        HighSloppyPhrase       11.12      (5.0%)       11.12      (5.0%)   -0.0% (  -9% -   10%) 0.991
    HighTermTitleBDVSort       24.85     (14.6%)       24.85     (14.6%)    0.0% ( -25% -   34%) 0.998
BrowseDayOfYearTaxoFacets        0.72      (6.4%)        0.72      (6.3%)    0.0% ( -11% -   13%) 0.985
                  Fuzzy2       43.74      (8.3%)       43.77      (6.2%)    0.1% ( -13% -   15%) 0.977
            OrHighNotLow      524.18      (6.2%)      524.87      (7.6%)    0.1% ( -12% -   14%) 0.953
         LowSloppyPhrase        4.07      (3.6%)        4.09      (3.8%)    0.3% (  -6% -    8%) 0.801
            HighSpanNear        4.77      (2.4%)        4.79      (2.2%)    0.3% (  -4% -    5%) 0.640
             AndHighHigh       64.34      (4.2%)       64.64      (4.4%)    0.5% (  -7% -    9%) 0.732
               OrHighMed       24.49      (2.1%)       24.63      (1.7%)    0.6% (  -3% -    4%) 0.368
             LowSpanNear        6.52      (2.7%)        6.56      (2.2%)    0.7% (  -4% -    5%) 0.395
             MedSpanNear        2.73      (2.0%)        2.75      (2.0%)    0.7% (  -3% -    4%) 0.240
              AndHighLow      387.05      (4.1%)      390.05      (5.2%)    0.8% (  -8% -   10%) 0.601
              OrHighHigh       14.39      (1.9%)       14.51      (1.9%)    0.8% (  -2% -    4%) 0.163
                 LowTerm      800.22      (5.1%)      807.69      (6.2%)    0.9% (  -9% -   12%) 0.601
           OrHighNotHigh      397.07      (5.7%)      401.22      (5.0%)    1.0% (  -9% -   12%) 0.537
                HighTerm      767.23      (5.4%)      779.87      (7.7%)    1.6% ( -10% -   15%) 0.434

mikemccand · 2021-01-09T14:40:01Z

I think all we need to do for this issue is re-enable BINARY doc values compression for taxonomy index in the nightly benchmarks?

Also, in my $day job (Amazon's customer facing product search) we also found that enabling compression for the taxonomy index, despite that we count much fewer hits than the "pure browse" tests here which count facets for every document in the index, was performance neutral and shrank the taxonomy index.

Net/net compression for Lucene's facets seems to be a good thing!

gautamworah96 · 2021-01-12T07:49:28Z

I am running two benchmarks now: one with changes in luceneutil and one with the mainline luceneutil.

luceneutil diff:

diff --git a/src/main/perf/Indexer.java b/src/main/perf/Indexer.java
index 65be6ec..0c8e156 100644
--- a/src/main/perf/Indexer.java
+++ b/src/main/perf/Indexer.java
@@ -37,6 +37,7 @@ import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.codecs.Codec;
 import org.apache.lucene.codecs.DocValuesFormat;
 import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.lucene80.Lucene80DocValuesFormat;
 import org.apache.lucene.codecs.lucene90.Lucene90Codec;
 import org.apache.lucene.facet.FacetsConfig;
 import org.apache.lucene.facet.taxonomy.TaxonomyWriter;
@@ -405,7 +406,7 @@ public final class Indexer {
                                         idFieldPostingsFormat : defaultPostingsFormat);
         }
 
-        private final DocValuesFormat facetsDVFormat = DocValuesFormat.forName(facetDVFormatName);
+        private final DocValuesFormat facetsDVFormat = new Lucene80DocValuesFormat(Lucene80DocValuesFormat.Mode.BEST_COMPRESSION);
         //private final DocValuesFormat lucene42DVFormat = DocValuesFormat.forName("Lucene42");
         //private final DocValuesFormat diskDVFormat = DocValuesFormat.forName("Disk");
 //        private final DocValuesFormat lucene45DVFormat = DocValuesFormat.forName("Lucene45");

Edit: benchmarks results before and after this change show no difference. I'll try directly sending in the Lucene80DocValuesFormat.Mode.BEST_COMPRESSION mode in the Lucene90Codec constructor here

gautamworah96 · 2021-01-13T19:13:43Z

git diff which disregards the idFieldPostingsFormat and facetsDVformat sent in from the command line args:

diff --git a/src/main/perf/Indexer.java b/src/main/perf/Indexer.java
index 65be6ec..beca090 100644
--- a/src/main/perf/Indexer.java
+++ b/src/main/perf/Indexer.java
@@ -423,7 +423,7 @@ public final class Indexer {
         }
       };
 
-    iwc.setCodec(codec);
+    iwc.setCodec(new Lucene90Codec(Lucene90Codec.Mode.BEST_COMPRESSION));
 
     System.out.println("IW config=" + iwc);

gives the following results:

Task	Mainline	Updated codec
BrowseMonthSSDVFacets	2.94	2.94
BrowseDayOfYearSSDVFacets	2.76	2.77
BrowseDayOfYearTaxoFacets	0.71	1.22
BrowseDateTaxoFacets	0.71	1.22
BrowseMonthTaxoFacets	0.74	1.4

The next step is to just remove the getDocValuesFormatForField override function and set the default mode to BEST_COMPRESSION. The command line facetsDVFormat param will also have to be removed (since it is redundant now).

mikemccand · 2021-01-15T12:18:04Z

gives the following results:

Whoa, great! This confirms that enabling compression substantially speeds up pure browse faceting use cases based on the taxonomy index. SSDV facets are not impacted, which is expected because they do not even use the taxonomy index.

The command line facetsDVFormat param will also have to be removed (since it is redundant now).

OK.

mikemccand · 2021-01-15T12:18:14Z

Thanks for digging @gautamworah96!

gautamworah96 · 2021-01-25T17:37:52Z

PR for this issue was merged

mikemccand · 2021-01-26T14:19:55Z

This gave an impressive bump in QPS in nightly benchmarks, e.g. pure browse facets for last-modified year/month/day. Looks like ~30% speedup, much more than the previous ~5% speedup. Confused :)

mikemccand · 2021-01-26T14:21:50Z

Hmm, and facet performance for a simple TermQuery plummeted, ~60% drop!?

I will add an annotation.

gautamworah96 closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Lucene DocValues codec to Mode.BEST_SPEED #86

Switch Lucene DocValues codec to Mode.BEST_SPEED #86

gautamworah96 commented Nov 17, 2020

gautamworah96 commented Nov 17, 2020 •

edited

jpountz commented Nov 18, 2020

gautamworah96 commented Nov 19, 2020 •

edited

mikemccand commented Jan 9, 2021

gautamworah96 commented Jan 12, 2021 •

edited

gautamworah96 commented Jan 13, 2021

mikemccand commented Jan 15, 2021

mikemccand commented Jan 15, 2021

gautamworah96 commented Jan 25, 2021

mikemccand commented Jan 26, 2021

mikemccand commented Jan 26, 2021

Switch Lucene DocValues codec to Mode.BEST_SPEED #86

Switch Lucene DocValues codec to Mode.BEST_SPEED #86

Comments

gautamworah96 commented Nov 17, 2020

gautamworah96 commented Nov 17, 2020 • edited

jpountz commented Nov 18, 2020

gautamworah96 commented Nov 19, 2020 • edited

mikemccand commented Jan 9, 2021

gautamworah96 commented Jan 12, 2021 • edited

gautamworah96 commented Jan 13, 2021

mikemccand commented Jan 15, 2021

mikemccand commented Jan 15, 2021

gautamworah96 commented Jan 25, 2021

mikemccand commented Jan 26, 2021

mikemccand commented Jan 26, 2021

gautamworah96 commented Nov 17, 2020 •

edited

gautamworah96 commented Nov 19, 2020 •

edited

gautamworah96 commented Jan 12, 2021 •

edited