Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseline and modified version showing different benchmark result even if the codebase is same #142

Open
praveennish opened this issue Oct 11, 2021 · 6 comments

Comments

@praveennish
Copy link

Hi @mikemccand,

I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same)
I saw there are performance difference after running command:

python3 src/python/localrun.py -source wikimedium10k

1st Output table shows

  		TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                     Prefix3      407.18      (0.0%)      314.24      (0.0%)  -22.8% ( -22% -  -22%) 1.000
                 LowSpanNear     1073.76      (0.0%)      852.06      (0.0%)  -20.6% ( -20% -  -20%) 1.000
             MedSloppyPhrase     1140.22      (0.0%)      927.42      (0.0%)  -18.7% ( -18% -  -18%) 1.000
                   MedPhrase      964.51      (0.0%)      848.50      (0.0%)  -12.0% ( -12% -  -12%) 1.000
        HighIntervalsOrdered     1002.98      (0.0%)      884.65      (0.0%)  -11.8% ( -11% -  -11%) 1.000
           HighTermMonthSort     4017.92      (0.0%)     3660.73      (0.0%)   -8.9% (  -8% -   -8%) 1.000
                     Respell      512.33      (0.0%)      467.72      (0.0%)   -8.7% (  -8% -   -8%) 1.000
                HighSpanNear      893.76      (0.0%)      821.69      (0.0%)   -8.1% (  -8% -   -8%) 1.000
                      IntNRQ     1828.06      (0.0%)     1682.03      (0.0%)   -8.0% (  -7% -   -7%) 1.000
                    HighTerm     5614.10      (0.0%)     5200.05      (0.0%)   -7.4% (  -7% -   -7%) 1.000
       BrowseMonthTaxoFacets     4142.06      (0.0%)     3870.82      (0.0%)   -6.5% (  -6% -   -6%) 1.000
       HighTermDayOfYearSort     3782.61      (0.0%)     3538.93      (0.0%)   -6.4% (  -6% -   -6%) 1.000
   BrowseDayOfYearSSDVFacets     2665.19      (0.0%)     2514.64      (0.0%)   -5.6% (  -5% -   -5%) 1.000
                     LowTerm     6806.33      (0.0%)     6460.07      (0.0%)   -5.1% (  -5% -   -5%) 1.000
            HighSloppyPhrase      886.16      (0.0%)      845.10      (0.0%)   -4.6% (  -4% -   -4%) 1.000
                   OrHighMed      898.26      (0.0%)      858.97      (0.0%)   -4.4% (  -4% -   -4%) 1.000
                   LowPhrase      988.79      (0.0%)      947.64      (0.0%)   -4.2% (  -4% -   -4%) 1.000
                   OrHighLow     1171.10      (0.0%)     1124.50      (0.0%)   -4.0% (  -3% -   -3%) 1.000
        BrowseDateTaxoFacets     3796.98      (0.0%)     3648.76      (0.0%)   -3.9% (  -3% -   -3%) 1.000
                    PKLookup      326.99      (0.0%)      315.53      (0.0%)   -3.5% (  -3% -   -3%) 1.000
       BrowseMonthSSDVFacets     3212.18      (0.0%)     3110.22      (0.0%)   -3.2% (  -3% -   -3%) 1.000
                  AndHighLow     2763.74      (0.0%)     2691.30      (0.0%)   -2.6% (  -2% -   -2%) 1.000
                 MedSpanNear      634.86      (0.0%)      624.48      (0.0%)   -1.6% (  -1% -   -1%) 1.000
                    Wildcard      581.94      (0.0%)      572.55      (0.0%)   -1.6% (  -1% -   -1%) 1.000
                  HighPhrase      729.77      (0.0%)      720.61      (0.0%)   -1.3% (  -1% -   -1%) 1.000
   BrowseDayOfYearTaxoFacets     3111.47      (0.0%)     3073.01      (0.0%)   -1.2% (  -1% -   -1%) 1.000
                  OrHighHigh      430.85      (0.0%)      426.77      (0.0%)   -0.9% (   0% -    0%) 1.000
                 AndHighHigh     1029.49      (0.0%)     1028.71      (0.0%)   -0.1% (   0% -    0%) 1.000
             LowSloppyPhrase     1351.24      (0.0%)     1365.14      (0.0%)    1.0% (   1% -    1%) 1.000
                      Fuzzy2       70.31      (0.0%)       71.83      (0.0%)    2.2% (   2% -    2%) 1.000
                      Fuzzy1      324.58      (0.0%)      338.44      (0.0%)    4.3% (   4% -    4%) 1.000
         LowIntervalsOrdered     1721.13      (0.0%)     1807.65      (0.0%)    5.0% (   5% -    5%) 1.000
                     MedTerm     5749.70      (0.0%)     6042.57      (0.0%)    5.1% (   5% -    5%) 1.000
         MedIntervalsOrdered     1291.17      (0.0%)     1382.36      (0.0%)    7.1% (   7% -    7%) 1.000
                  AndHighMed     1322.11      (0.0%)     1575.31      (0.0%)   19.2% (  19% -   19%) 1.000

My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?

Thanks!

@msokolov
Copy link
Collaborator

msokolov commented Oct 12, 2021 via email

@praveennish
Copy link
Author

Thanks for your input! Is there a wiki or any documentation which specifies which fields to look for measuring performance of baseline vs modified ?

@praveennish
Copy link
Author

Hi @msokolov

I am still observing difference in stats of baseline vs modified though they are same lucene 9 code

Following are my parameters:

corpus - wikimediumall
taskRepeatCount = 20
jvmCount = 64

This is 63rd iteration value

  		TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
         LowIntervalsOrdered       18.02     (18.8%)       16.76     (18.2%)   -7.0% ( -37% -   37%) 0.034
                  OrHighHigh        3.89     (21.6%)        3.63     (20.6%)   -6.6% ( -40% -   45%) 0.081
         MedIntervalsOrdered       32.85     (21.9%)       30.82     (19.0%)   -6.2% ( -38% -   44%) 0.091
       BrowseMonthTaxoFacets        0.65     (12.6%)        0.61     (11.9%)   -5.1% ( -26% -   22%) 0.019
                     Respell       21.47     (16.4%)       20.48     (14.3%)   -4.6% ( -30% -   31%) 0.092
                     Prefix3       25.37     (19.6%)       24.49     (19.1%)   -3.5% ( -35% -   43%) 0.310
       BrowseMonthSSDVFacets        2.49     (24.8%)        2.41     (27.1%)   -3.5% ( -44% -   64%) 0.454
                   LowPhrase       21.06     (18.1%)       20.39     (16.3%)   -3.2% ( -31% -   38%) 0.298
            HighSloppyPhrase        6.87     (19.8%)        6.69     (19.7%)   -2.6% ( -35% -   45%) 0.468
               OrHighNotHigh      258.43     (13.4%)      252.31     (12.3%)   -2.4% ( -24% -   26%) 0.301
                OrHighNotLow      381.47     (14.0%)      372.64      (9.7%)   -2.3% ( -22% -   24%) 0.280
    AndHighHighDayTaxoFacets        4.06     (15.8%)        3.97     (14.8%)   -2.2% ( -28% -   33%) 0.412
                    PKLookup       55.24     (10.8%)       54.02      (8.9%)   -2.2% ( -19% -   19%) 0.211
                  TermDTSort       40.58     (29.2%)       39.80     (26.4%)   -1.9% ( -44% -   75%) 0.699
                      Fuzzy1       33.16      (8.6%)       32.63      (7.8%)   -1.6% ( -16% -   16%) 0.276
           HighTermMonthSort       16.28     (28.9%)       16.03     (29.1%)   -1.5% ( -46% -   79%) 0.773
     AndHighMedDayTaxoFacets        5.57     (14.2%)        5.49     (14.8%)   -1.3% ( -26% -   32%) 0.602
             MedSloppyPhrase        1.74     (17.9%)        1.72     (17.8%)   -1.2% ( -31% -   42%) 0.713
               OrNotHighHigh      326.82     (11.5%)      323.05      (9.8%)   -1.2% ( -20% -   22%) 0.545
                HighSpanNear        5.84     (21.5%)        5.78     (22.5%)   -1.0% ( -37% -   54%) 0.789
                   OrHighMed        9.65     (22.9%)        9.56     (22.1%)   -1.0% ( -37% -   56%) 0.809
        MedTermDayTaxoFacets        1.52     (22.0%)        1.50     (20.8%)   -0.9% ( -35% -   53%) 0.808
      OrHighMedDayTaxoFacets        2.15     (21.9%)        2.13     (22.0%)   -0.8% ( -36% -   55%) 0.833
                    Wildcard       39.49     (23.2%)       39.22     (22.1%)   -0.7% ( -37% -   58%) 0.866
                     MedTerm      468.66     (15.5%)      465.63     (14.4%)   -0.6% ( -26% -   34%) 0.808
                  HighPhrase       47.44     (18.7%)       47.36     (16.1%)   -0.2% ( -29% -   42%) 0.954
                      Fuzzy2       32.58     (10.8%)       32.54      (9.9%)   -0.1% ( -18% -   23%) 0.948
                   OrHighLow      112.38     (15.1%)      112.64     (15.8%)    0.2% ( -26% -   36%) 0.935
                  AndHighLow      225.77     (16.1%)      226.74     (14.8%)    0.4% ( -26% -   37%) 0.876
                 AndHighHigh        7.80     (17.7%)        7.84     (16.1%)    0.5% ( -28% -   41%) 0.869
             LowSloppyPhrase       16.10     (18.2%)       16.19     (16.8%)    0.6% ( -29% -   43%) 0.848
                  AndHighMed       40.40     (18.6%)       40.65     (16.4%)    0.6% ( -28% -   43%) 0.845
        BrowseDateTaxoFacets        0.58     (11.5%)        0.59     (11.9%)    0.7% ( -20% -   27%) 0.745
                      IntNRQ       12.90     (17.6%)       12.99     (18.6%)    0.7% ( -30% -   44%) 0.819
        HighTermTitleBDVSort       14.81     (29.6%)       14.93     (28.9%)    0.9% ( -44% -   84%) 0.869
                OrNotHighMed      319.00      (9.5%)      322.15      (9.6%)    1.0% ( -16% -   22%) 0.563
                     LowTerm      667.97      (7.3%)      674.60      (6.5%)    1.0% ( -11% -   15%) 0.419
                    HighTerm      534.32     (13.9%)      541.61     (15.9%)    1.4% ( -24% -   36%) 0.609
                   MedPhrase        8.58     (18.2%)        8.72     (19.1%)    1.6% ( -30% -   47%) 0.627
   BrowseDayOfYearTaxoFacets        0.59     (10.7%)        0.60     (11.0%)    1.8% ( -17% -   26%) 0.353
                OrHighNotMed      341.92     (11.7%)      350.59     (13.4%)    2.5% ( -20% -   31%) 0.257
                OrNotHighLow      257.81     (10.8%)      264.55     (10.3%)    2.6% ( -16% -   26%) 0.165
       HighTermDayOfYearSort       10.83     (31.5%)       11.15     (32.5%)    2.9% ( -46% -   97%) 0.613
                 MedSpanNear       13.81     (14.7%)       14.22     (17.4%)    3.0% ( -25% -   41%) 0.293
        HighIntervalsOrdered        1.29     (19.7%)        1.33     (18.8%)    3.4% ( -29% -   52%) 0.329
                 LowSpanNear        3.91     (18.9%)        4.06     (17.2%)    3.9% ( -27% -   49%) 0.231
   BrowseDayOfYearSSDVFacets        2.11     (26.4%)        2.21     (28.6%)    4.6% ( -39% -   80%) 0.348

and this is 64th iteration

  		TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
         LowIntervalsOrdered       17.98     (18.9%)       16.72     (18.2%)   -7.0% ( -37% -   37%) 0.033
                  OrHighHigh        3.88     (21.6%)        3.63     (20.5%)   -6.5% ( -39% -   45%) 0.081
         MedIntervalsOrdered       32.84     (21.7%)       30.83     (18.9%)   -6.1% ( -38% -   44%) 0.088
                     Respell       21.49     (16.3%)       20.51     (14.2%)   -4.6% ( -30% -   30%) 0.091
       BrowseMonthTaxoFacets        0.64     (13.0%)        0.62     (12.1%)   -4.4% ( -26% -   23%) 0.046
       BrowseMonthSSDVFacets        2.50     (24.6%)        2.39     (27.1%)   -4.2% ( -44% -   63%) 0.355
               OrHighNotHigh      259.60     (13.7%)      251.79     (12.3%)   -3.0% ( -25% -   26%) 0.190
                OrHighNotLow      382.71     (14.1%)      371.91      (9.7%)   -2.8% ( -23% -   24%) 0.187
                   LowPhrase       21.04     (18.0%)       20.47     (16.5%)   -2.7% ( -31% -   38%) 0.375
            HighSloppyPhrase        6.85     (19.8%)        6.67     (19.8%)   -2.6% ( -35% -   46%) 0.459
                     Prefix3       25.30     (19.6%)       24.65     (19.7%)   -2.6% ( -34% -   45%) 0.460
    AndHighHighDayTaxoFacets        4.06     (15.7%)        3.97     (14.7%)   -2.2% ( -28% -   33%) 0.408
                    PKLookup       55.19     (10.8%)       53.99      (8.9%)   -2.2% ( -19% -   19%) 0.213
                  TermDTSort       40.41     (29.3%)       39.58     (26.6%)   -2.0% ( -44% -   76%) 0.681
     AndHighMedDayTaxoFacets        5.59     (14.3%)        5.48     (14.7%)   -1.8% ( -26% -   31%) 0.472
                      Fuzzy1       33.21      (8.6%)       32.61      (7.7%)   -1.8% ( -16% -   15%) 0.210
           HighTermMonthSort       16.21     (29.0%)       15.95     (29.3%)   -1.6% ( -46% -   80%) 0.758
                HighSpanNear        5.84     (21.3%)        5.75     (22.6%)   -1.5% ( -37% -   53%) 0.698
             MedSloppyPhrase        1.74     (17.8%)        1.72     (17.9%)   -1.5% ( -31% -   41%) 0.639
               OrNotHighHigh      326.70     (11.4%)      322.82      (9.7%)   -1.2% ( -20% -   22%) 0.527
                   OrHighMed        9.63     (22.8%)        9.53     (22.0%)   -1.0% ( -37% -   56%) 0.801
      OrHighMedDayTaxoFacets        2.14     (21.9%)        2.12     (22.1%)   -1.0% ( -36% -   55%) 0.804
        MedTermDayTaxoFacets        1.51     (22.0%)        1.50     (20.7%)   -0.5% ( -35% -   54%) 0.886
                    Wildcard       39.26     (23.7%)       39.08     (22.2%)   -0.5% ( -37% -   59%) 0.907
                     MedTerm      467.59     (15.5%)      465.43     (14.3%)   -0.5% ( -26% -   34%) 0.862
                      Fuzzy2       32.53     (10.7%)       32.45     (10.0%)   -0.2% ( -18% -   23%) 0.893
                  HighPhrase       47.37     (18.6%)       47.29     (16.0%)   -0.2% ( -29% -   42%) 0.957
                   OrHighLow      112.24     (15.1%)      112.47     (15.7%)    0.2% ( -26% -   36%) 0.941
                  AndHighMed       40.34     (18.5%)       40.53     (16.5%)    0.5% ( -29% -   43%) 0.879
             LowSloppyPhrase       16.06     (18.1%)       16.14     (16.8%)    0.5% ( -29% -   43%) 0.871
        HighTermTitleBDVSort       14.78     (29.5%)       14.88     (28.8%)    0.7% ( -44% -   83%) 0.888
                      IntNRQ       12.87     (17.6%)       12.96     (18.6%)    0.7% ( -30% -   44%) 0.820
        BrowseDateTaxoFacets        0.58     (11.4%)        0.59     (11.9%)    0.8% ( -20% -   27%) 0.704
                   MedPhrase        8.63     (18.4%)        8.70     (19.0%)    0.8% ( -30% -   46%) 0.804
                OrNotHighMed      318.81      (9.5%)      321.48      (9.7%)    0.8% ( -16% -   22%) 0.620
                  AndHighLow      225.43     (16.1%)      227.47     (14.9%)    0.9% ( -25% -   37%) 0.741
                 AndHighHigh        7.79     (17.7%)        7.86     (16.1%)    0.9% ( -27% -   42%) 0.755
                     LowTerm      667.09      (7.3%)      674.22      (6.5%)    1.1% ( -11% -   16%) 0.380
                    HighTerm      533.13     (14.0%)      540.24     (15.9%)    1.3% ( -25% -   36%) 0.615
       HighTermDayOfYearSort       10.90     (31.4%)       11.08     (32.4%)    1.7% ( -47% -   95%) 0.765
   BrowseDayOfYearTaxoFacets        0.59     (10.7%)        0.60     (11.1%)    2.2% ( -17% -   26%) 0.256
                OrNotHighLow      257.38     (10.8%)      263.97     (10.4%)    2.6% ( -16% -   26%) 0.173
                OrHighNotMed      341.35     (11.7%)      350.53     (13.3%)    2.7% ( -19% -   31%) 0.224
        HighIntervalsOrdered        1.29     (20.0%)        1.33     (18.5%)    2.9% ( -29% -   51%) 0.393
                 MedSpanNear       13.78     (14.7%)       14.19     (17.4%)    2.9% ( -25% -   41%) 0.302
                 LowSpanNear        3.93     (19.0%)        4.08     (17.3%)    3.8% ( -27% -   49%) 0.238
   BrowseDayOfYearSSDVFacets        2.12     (26.2%)        2.20     (28.4%)    3.9% ( -40% -   79%) 0.422

Kindly educate me why it is happening and what conclusion we can draw from this ?

@mikemccand
Copy link
Owner

These are surprisingly/depressingly noisy results. Are you sure the are A/A? Exactly same git clone of Lucene 9 being compared to itself? Which JVM / Java CL flags are you passing? How much RAM does the box have? Enough to keep the whole index hot?

@praveennish
Copy link
Author

I am very sorry @mikemccand for the late reply!

I wanted to retest today but after latest pull i am getting FileNotFoundException for this file
enwiki-20120502-lines-1k-fixed-utf8-with-random-label.txt in data folder

From where i can download this file please?

@original-brownbear
Copy link
Contributor

I think this is partly explained by #307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants