-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
baseline and modified version showing different benchmark result even if the codebase is same #142
Comments
There is no statistical difference here. The final column, p value, tells
you the probability that the difference you are observing is due to random
chance. It's one.
You can observe the absolute values of these random differences reduced by
running larger test samples. More iterations, larger indexes, more queries.
What your a/a test shows you is the magnitude off noise on the system given
your sample size.
…On Mon, Oct 11, 2021, 5:17 AM praveennish ***@***.***> wrote:
Hi @mikemccand <https://github.com/mikemccand>,
I have cloned Lucene 9 code in both baseline and candidate folder( so
codebase is 100% same)
I saw there are performance difference after running command:
python3 src/python/localrun.py -source wikimedium10k
1st Output table shows
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Prefix3 407.18 (0.0%) 314.24 (0.0%) -22.8% ( -22% - -22%) 1.000
LowSpanNear 1073.76 (0.0%) 852.06 (0.0%) -20.6% ( -20% - -20%) 1.000
MedSloppyPhrase 1140.22 (0.0%) 927.42 (0.0%) -18.7% ( -18% - -18%) 1.000
MedPhrase 964.51 (0.0%) 848.50 (0.0%) -12.0% ( -12% - -12%) 1.000
HighIntervalsOrdered 1002.98 (0.0%) 884.65 (0.0%) -11.8% ( -11% - -11%) 1.000
HighTermMonthSort 4017.92 (0.0%) 3660.73 (0.0%) -8.9% ( -8% - -8%) 1.000
Respell 512.33 (0.0%) 467.72 (0.0%) -8.7% ( -8% - -8%) 1.000
HighSpanNear 893.76 (0.0%) 821.69 (0.0%) -8.1% ( -8% - -8%) 1.000
IntNRQ 1828.06 (0.0%) 1682.03 (0.0%) -8.0% ( -7% - -7%) 1.000
HighTerm 5614.10 (0.0%) 5200.05 (0.0%) -7.4% ( -7% - -7%) 1.000
BrowseMonthTaxoFacets 4142.06 (0.0%) 3870.82 (0.0%) -6.5% ( -6% - -6%) 1.000
HighTermDayOfYearSort 3782.61 (0.0%) 3538.93 (0.0%) -6.4% ( -6% - -6%) 1.000
BrowseDayOfYearSSDVFacets 2665.19 (0.0%) 2514.64 (0.0%) -5.6% ( -5% - -5%) 1.000
LowTerm 6806.33 (0.0%) 6460.07 (0.0%) -5.1% ( -5% - -5%) 1.000
HighSloppyPhrase 886.16 (0.0%) 845.10 (0.0%) -4.6% ( -4% - -4%) 1.000
OrHighMed 898.26 (0.0%) 858.97 (0.0%) -4.4% ( -4% - -4%) 1.000
LowPhrase 988.79 (0.0%) 947.64 (0.0%) -4.2% ( -4% - -4%) 1.000
OrHighLow 1171.10 (0.0%) 1124.50 (0.0%) -4.0% ( -3% - -3%) 1.000
BrowseDateTaxoFacets 3796.98 (0.0%) 3648.76 (0.0%) -3.9% ( -3% - -3%) 1.000
PKLookup 326.99 (0.0%) 315.53 (0.0%) -3.5% ( -3% - -3%) 1.000
BrowseMonthSSDVFacets 3212.18 (0.0%) 3110.22 (0.0%) -3.2% ( -3% - -3%) 1.000
AndHighLow 2763.74 (0.0%) 2691.30 (0.0%) -2.6% ( -2% - -2%) 1.000
MedSpanNear 634.86 (0.0%) 624.48 (0.0%) -1.6% ( -1% - -1%) 1.000
Wildcard 581.94 (0.0%) 572.55 (0.0%) -1.6% ( -1% - -1%) 1.000
HighPhrase 729.77 (0.0%) 720.61 (0.0%) -1.3% ( -1% - -1%) 1.000
BrowseDayOfYearTaxoFacets 3111.47 (0.0%) 3073.01 (0.0%) -1.2% ( -1% - -1%) 1.000
OrHighHigh 430.85 (0.0%) 426.77 (0.0%) -0.9% ( 0% - 0%) 1.000
AndHighHigh 1029.49 (0.0%) 1028.71 (0.0%) -0.1% ( 0% - 0%) 1.000
LowSloppyPhrase 1351.24 (0.0%) 1365.14 (0.0%) 1.0% ( 1% - 1%) 1.000
Fuzzy2 70.31 (0.0%) 71.83 (0.0%) 2.2% ( 2% - 2%) 1.000
Fuzzy1 324.58 (0.0%) 338.44 (0.0%) 4.3% ( 4% - 4%) 1.000
LowIntervalsOrdered 1721.13 (0.0%) 1807.65 (0.0%) 5.0% ( 5% - 5%) 1.000
MedTerm 5749.70 (0.0%) 6042.57 (0.0%) 5.1% ( 5% - 5%) 1.000
MedIntervalsOrdered 1291.17 (0.0%) 1382.36 (0.0%) 7.1% ( 7% - 7%) 1.000
AndHighMed 1322.11 (0.0%) 1575.31 (0.0%) 19.2% ( 19% - 19%) 1.000
My expectation was that both same code will perform the same way but you
can notice deviations. Can you please explain it ? Is it right way to run
benchmark?
Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#142>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHHUQO6HZOLZ5WE73XIQPDUGKTSTANCNFSM5FXX6M3Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thanks for your input! Is there a wiki or any documentation which specifies which fields to look for measuring performance of baseline vs modified ? |
Hi @msokolov I am still observing difference in stats of baseline vs modified though they are same lucene 9 code Following are my parameters: corpus - wikimediumall This is 63rd iteration value
and this is 64th iteration
Kindly educate me why it is happening and what conclusion we can draw from this ? |
These are surprisingly/depressingly noisy results. Are you sure the are A/A? Exactly same |
I am very sorry @mikemccand for the late reply! I wanted to retest today but after latest pull i am getting FileNotFoundException for this file From where i can download this file please? |
I think this is partly explained by #307 |
Hi @mikemccand,
I have cloned Lucene 9 code in both baseline and candidate folder( so codebase is 100% same)
I saw there are performance difference after running command:
python3 src/python/localrun.py -source wikimedium10k
1st Output table shows
My expectation was that both same code will perform the same way but you can notice deviations. Can you please explain it ? Is it right way to run benchmark?
Thanks!
The text was updated successfully, but these errors were encountered: