Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Lucene DocValues codec to Mode.BEST_SPEED #86

Closed
gautamworah96 opened this issue Nov 17, 2020 · 11 comments
Closed

Switch Lucene DocValues codec to Mode.BEST_SPEED #86

gautamworah96 opened this issue Nov 17, 2020 · 11 comments

Comments

@gautamworah96
Copy link
Contributor

We have noticed a big drop in the throughput from faceting for all Day of Year facets. One possible explanation for this could be a change to add two modes to the Lucene codec: BEST_SPEED, BEST_COMPRESSION.

It is possible that using the BEST_COMPRESSION mode caused a drop in the faceting throughput. This issue is to confirm if FACET_FIELD_DV_FORMAT_DEFAULT='Lucene80' by default uses BEST_COMPRESSION and if yes does changing it to BEST_SPEED improve the throughput to the previous rate.

@gautamworah96
Copy link
Contributor Author

gautamworah96 commented Nov 17, 2020

The default format for Lucene's Lucene80DocValuesFormat() call is BEST_SPEED as returned here which means that there is a very high possibility that the drop was due to this change. I will be re-running benchmarks on my change to confirm if that was the commit that caused the regression

@jpountz
Copy link
Collaborator

jpountz commented Nov 18, 2020

@gautamworah96 We had seen an increase of performance for these tasks when introducing compression because it seems that the compressed format performs better for linear scans and worse for selective queries. So it's not unlikely that your change did not slow down these tasks.

@gautamworah96
Copy link
Contributor Author

gautamworah96 commented Nov 19, 2020

Results from luceneutil benchmarks comparing LUCENE-9450 Use BinaryDocValues in the taxonomy writer #1733 (commit 3f8f84f9b063277e9017221bfc5e80fb901fc1ce) with mainline commit 06877b2c6e47bc481a79d7bedd8ea4fb099f1b4c (which contains the changes for enabling compression)

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
            OrHighNotMed      481.89      (4.6%)      462.06      (6.0%)   -4.1% ( -14% -    6%) 0.015
       HighTermMonthSort       44.99     (14.5%)       43.94     (11.0%)   -2.3% ( -24% -   27%) 0.567
              TermDTSort       17.80     (10.6%)       17.56      (9.3%)   -1.3% ( -19% -   20%) 0.676
                 Prefix3       49.40      (4.2%)       48.79      (4.3%)   -1.2% (  -9% -    7%) 0.358
           OrNotHighHigh      451.33      (4.8%)      446.17      (6.6%)   -1.1% ( -11% -   10%) 0.532
               MedPhrase       32.96      (2.9%)       32.68      (3.2%)   -0.8% (  -6% -    5%) 0.395
            OrNotHighMed      420.25      (4.4%)      416.95      (4.4%)   -0.8% (  -9% -    8%) 0.574
               LowPhrase       47.90      (3.1%)       47.54      (3.9%)   -0.7% (  -7% -    6%) 0.502
    HighIntervalsOrdered        8.29      (2.7%)        8.23      (2.6%)   -0.6% (  -5% -    4%) 0.444
   HighTermDayOfYearSort       27.30     (13.9%)       27.12     (10.2%)   -0.6% ( -21% -   27%) 0.870
               OrHighLow      372.74      (4.2%)      370.68      (4.8%)   -0.6% (  -9% -    8%) 0.698
            OrNotHighLow      337.47      (4.2%)      335.75      (3.5%)   -0.5% (  -7% -    7%) 0.678
                Wildcard       27.52      (3.5%)       27.39      (3.4%)   -0.5% (  -7% -    6%) 0.653
                 Respell       37.86      (2.4%)       37.68      (2.4%)   -0.5% (  -5% -    4%) 0.543
   BrowseMonthSSDVFacets        2.95      (0.5%)        2.94      (0.5%)   -0.3% (  -1% -    0%) 0.055
                PKLookup      130.30      (2.0%)      129.98      (2.1%)   -0.2% (  -4% -    3%) 0.702
   BrowseMonthTaxoFacets        0.76      (6.5%)        0.76      (6.3%)   -0.2% ( -12% -   13%) 0.913
         MedSloppyPhrase        2.52      (5.0%)        2.51      (5.2%)   -0.2% (  -9% -   10%) 0.918
              AndHighMed       62.40      (4.4%)       62.30      (3.6%)   -0.2% (  -7% -    8%) 0.900
                 MedTerm      809.18      (5.3%)      807.88      (7.1%)   -0.2% ( -11% -   12%) 0.936
              HighPhrase      182.49      (6.9%)      182.29      (6.9%)   -0.1% ( -12% -   14%) 0.959
                  IntNRQ       17.63     (22.3%)       17.62     (22.3%)   -0.1% ( -36% -   57%) 0.994
BrowseDayOfYearSSDVFacets        2.76      (0.9%)        2.76      (0.8%)   -0.1% (  -1% -    1%) 0.837
    BrowseDateTaxoFacets        0.72      (6.4%)        0.72      (6.3%)   -0.0% ( -11% -   13%) 0.980
                  Fuzzy1       52.90      (6.8%)       52.88      (7.8%)   -0.0% ( -13% -   15%) 0.987
        HighSloppyPhrase       11.12      (5.0%)       11.12      (5.0%)   -0.0% (  -9% -   10%) 0.991
    HighTermTitleBDVSort       24.85     (14.6%)       24.85     (14.6%)    0.0% ( -25% -   34%) 0.998
BrowseDayOfYearTaxoFacets        0.72      (6.4%)        0.72      (6.3%)    0.0% ( -11% -   13%) 0.985
                  Fuzzy2       43.74      (8.3%)       43.77      (6.2%)    0.1% ( -13% -   15%) 0.977
            OrHighNotLow      524.18      (6.2%)      524.87      (7.6%)    0.1% ( -12% -   14%) 0.953
         LowSloppyPhrase        4.07      (3.6%)        4.09      (3.8%)    0.3% (  -6% -    8%) 0.801
            HighSpanNear        4.77      (2.4%)        4.79      (2.2%)    0.3% (  -4% -    5%) 0.640
             AndHighHigh       64.34      (4.2%)       64.64      (4.4%)    0.5% (  -7% -    9%) 0.732
               OrHighMed       24.49      (2.1%)       24.63      (1.7%)    0.6% (  -3% -    4%) 0.368
             LowSpanNear        6.52      (2.7%)        6.56      (2.2%)    0.7% (  -4% -    5%) 0.395
             MedSpanNear        2.73      (2.0%)        2.75      (2.0%)    0.7% (  -3% -    4%) 0.240
              AndHighLow      387.05      (4.1%)      390.05      (5.2%)    0.8% (  -8% -   10%) 0.601
              OrHighHigh       14.39      (1.9%)       14.51      (1.9%)    0.8% (  -2% -    4%) 0.163
                 LowTerm      800.22      (5.1%)      807.69      (6.2%)    0.9% (  -9% -   12%) 0.601
           OrHighNotHigh      397.07      (5.7%)      401.22      (5.0%)    1.0% (  -9% -   12%) 0.537
                HighTerm      767.23      (5.4%)      779.87      (7.7%)    1.6% ( -10% -   15%) 0.434

@mikemccand
Copy link
Owner

I think all we need to do for this issue is re-enable BINARY doc values compression for taxonomy index in the nightly benchmarks?

Also, in my $day job (Amazon's customer facing product search) we also found that enabling compression for the taxonomy index, despite that we count much fewer hits than the "pure browse" tests here which count facets for every document in the index, was performance neutral and shrank the taxonomy index.

Net/net compression for Lucene's facets seems to be a good thing!

@gautamworah96
Copy link
Contributor Author

gautamworah96 commented Jan 12, 2021

I am running two benchmarks now: one with changes in luceneutil and one with the mainline luceneutil.

luceneutil diff:

diff --git a/src/main/perf/Indexer.java b/src/main/perf/Indexer.java
index 65be6ec..0c8e156 100644
--- a/src/main/perf/Indexer.java
+++ b/src/main/perf/Indexer.java
@@ -37,6 +37,7 @@ import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.codecs.Codec;
 import org.apache.lucene.codecs.DocValuesFormat;
 import org.apache.lucene.codecs.PostingsFormat;
+import org.apache.lucene.codecs.lucene80.Lucene80DocValuesFormat;
 import org.apache.lucene.codecs.lucene90.Lucene90Codec;
 import org.apache.lucene.facet.FacetsConfig;
 import org.apache.lucene.facet.taxonomy.TaxonomyWriter;
@@ -405,7 +406,7 @@ public final class Indexer {
                                         idFieldPostingsFormat : defaultPostingsFormat);
         }
 
-        private final DocValuesFormat facetsDVFormat = DocValuesFormat.forName(facetDVFormatName);
+        private final DocValuesFormat facetsDVFormat = new Lucene80DocValuesFormat(Lucene80DocValuesFormat.Mode.BEST_COMPRESSION);
         //private final DocValuesFormat lucene42DVFormat = DocValuesFormat.forName("Lucene42");
         //private final DocValuesFormat diskDVFormat = DocValuesFormat.forName("Disk");
 //        private final DocValuesFormat lucene45DVFormat = DocValuesFormat.forName("Lucene45");

Edit: benchmarks results before and after this change show no difference. I'll try directly sending in the Lucene80DocValuesFormat.Mode.BEST_COMPRESSION mode in the Lucene90Codec constructor here

@gautamworah96
Copy link
Contributor Author

git diff which disregards the idFieldPostingsFormat and facetsDVformat sent in from the command line args:

diff --git a/src/main/perf/Indexer.java b/src/main/perf/Indexer.java
index 65be6ec..beca090 100644
--- a/src/main/perf/Indexer.java
+++ b/src/main/perf/Indexer.java
@@ -423,7 +423,7 @@ public final class Indexer {
         }
       };
 
-    iwc.setCodec(codec);
+    iwc.setCodec(new Lucene90Codec(Lucene90Codec.Mode.BEST_COMPRESSION));
 
     System.out.println("IW config=" + iwc);
 

gives the following results:

Task Mainline Updated codec
BrowseMonthSSDVFacets 2.94 2.94
BrowseDayOfYearSSDVFacets 2.76 2.77
BrowseDayOfYearTaxoFacets 0.71 1.22
BrowseDateTaxoFacets 0.71 1.22
BrowseMonthTaxoFacets 0.74 1.4

The next step is to just remove the getDocValuesFormatForField override function and set the default mode to BEST_COMPRESSION. The command line facetsDVFormat param will also have to be removed (since it is redundant now).

@mikemccand
Copy link
Owner

gives the following results:

Whoa, great! This confirms that enabling compression substantially speeds up pure browse faceting use cases based on the taxonomy index. SSDV facets are not impacted, which is expected because they do not even use the taxonomy index.

The command line facetsDVFormat param will also have to be removed (since it is redundant now).

OK.

@mikemccand
Copy link
Owner

Thanks for digging @gautamworah96!

@gautamworah96
Copy link
Contributor Author

PR for this issue was merged

@mikemccand
Copy link
Owner

This gave an impressive bump in QPS in nightly benchmarks, e.g. pure browse facets for last-modified year/month/day. Looks like ~30% speedup, much more than the previous ~5% speedup. Confused :)

@mikemccand
Copy link
Owner

Hmm, and facet performance for a simple TermQuery plummeted, ~60% drop!?

I will add an annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants