Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start using mach perftest VIEW for regression detection #162

Closed
mcomella opened this issue Sep 17, 2020 · 29 comments
Closed

Start using mach perftest VIEW for regression detection #162

mcomella opened this issue Sep 17, 2020 · 29 comments

Comments

@mcomella
Copy link
Contributor

We've hit diminishing returns in our validation investigation #141 (comment) and we believe it's ready to be used.

Along with this effort, we should consider including this data into the existing FNPRMS dashboards.

@mcomella mcomella self-assigned this Sep 17, 2020
@mcomella mcomella added this to Needs prioritization in Performance, front-end roadmap via automation Sep 17, 2020
@mcomella mcomella moved this from Needs prioritization to In progress in Performance, front-end roadmap Sep 17, 2020
@mcomella
Copy link
Contributor Author

mcomella commented Sep 18, 2020

I think the per-commit data is too noisy to identify a specific commit that caused a regression. The following graphs are using pixel 2 data and I generated them using sparky's script with these changes (waiting to merge into master).

Here's FNPRMS over the past 30 days, 25 replicates per data point:
image

There is a distinct trend upwards.

Here's the perftest's per commit data over roughly the same period, 14 replicates per data point:
image

You can't really see any trends and the data seems to regularly vary between 1300ms to 1360ms (not good enough to catch small regressions we've been seeing, ~10-40ms). To be fair, it's unclear if the perftest methodology has been changing during that period. (note: the drop at the end might be from us disabling conditioned profiles)

Here's a recent week:
image


Here's a graph where, for a given date, we take the median per-commit run per day:
image

The graph seems less noisy but is missing the gradual trend we see on FNPRMS. This is concerning: when perftest runs are viewed in isolation, they seem pretty similar to perftest but when viewed in aggregate, the trends we see in FNPRMS are missing. What could be the cause?

  • for per-commit graphs, could the difference of 14 replicates per data point compared to 25 replicates make such a big difference in noise? From 20-60ms down to <= 10ms?
  • for the 1 median per day data, since there are not an equal number of commits to run on per day, some of the results may be outliers
  • perftest goes to navStart instead of a full page load but theoretically that should reduce noise

Possible next steps:

  • increase the replicate count to 25: this is probably the easiest way to make progress and compare
  • the amount of noise in aggregate might change by us disabling conditioned profiles (could be the answer or could complicate further analysis because it breaks historical data)
  • wait until this is sheriffable (waiting on surfacing per-commit data in treeherder + alerting?) and potentially have someone else figure out how to make this data more usable

@acreskeyMoz Do you have any ideas about what we should next? To be explicit, I'm trying to get perftest to a place where we can use it to detect regressions per commit. Theoretically, this script should let us do that but there seems to be too much noise.

@mcomella
Copy link
Contributor Author

To be explicit: viewed from this perspective, perftest is not yet a good substitute for FNPRMS as a nightly measurement system, nevermind per-commit regression detection.

@gmierz
Copy link

gmierz commented Sep 19, 2020

One thing to note here is that these comparisons are between a small pool of devices (FNPRMS) versus a larger pool of devices (perftest). You might want to isolate a time series for each device in perftest and see if either of them, or some combination of them (if fnprms uses more than one) match up to FNPRMS. It's possible that the regressions detected in FNPRMS are biased because your sample/test-subject size is too small.

@acreskeyMoz
Copy link
Collaborator

This is my suggestion:

We take the raw results from these runs:

• perftest in CI with 25 iterations
• perftest in CI with 14 iterations
• FNPRMS run with 25 iterations
• (maybe?) FNPRMS run with 14 iterations

And for each we calculate:
• Std Dev
• Margin of Error

And compare.
Maybe n=30 is the sweet spot for these tests?

Perhaps we'll also want to compare variance. @gmierz was using a Levene test when looking at the perf tuning impact on noise.

I'm happy to do this unless someone else wants to.
I do have a couple of days work already scheduled ahead of me though.

@gmierz
Copy link

gmierz commented Sep 21, 2020

Levene's test should definitely be used here to compare variance between multiple data sources with potentially differing means. It will tell you if there is a statistically significant difference in the variance, and more specifically, it only tells you if they are equal or not. You can then use the %-change in std.dev./errors to determine which one is larger if there is a difference.

I still think that the difference in devices should be looked into, and we might be better off using pystones for these timings rather than raw milliseconds.

@mcomella
Copy link
Contributor Author

I'm happy to do this unless someone else wants to.
I do have a couple of days work already scheduled ahead of me though.

@acreskeyMoz If you have the cycles for it, I think it'd be ideal! :) I don't have any statistics background nor do I have much familiarity with how our CI infra is set up so I think it'd be time consuming for me and I'd need to bother you or sparky frequently anyway.

To be explicit: ideally, we could get this ASAP but I think it's fine to wait a few more days especially because I'll be implementing other regression prevention mechanisms while I'm waiting for your results.

I'll assume that you're going to do it but let me know if that's not the case. Thanks!

@acreskeyMoz
Copy link
Collaborator

I'm starting on this now -- I expect to have the initial summary for perftest by tomorrow.

I'm also including a variation where the multicommit fenix is replaced with the nightly.

@mcomella Can you provide me with individual results from FNPRMS runs on G5 and Pixel 2?
Ideally from a few runs.

@mcomella
Copy link
Contributor Author

@mcomella Can you provide me with individual results from FNPRMS runs on G5 and Pixel 2?
Ideally from a few runs.

Clarified the requirements async:

  • 6-10 times total
  • Each run on 25 iterations
  • Whichever Nightly build is fine
  • Just share the final durations, not the logs

I'm getting P2 numbers and MarcLeclair is getting G5.

@acreskeyMoz
Copy link
Collaborator

When I have all the data I'll summarize this, but FYI I'm collecting raw numbers and statistics on them here:
perftest and FNPRMS variance

@mcomella
Copy link
Contributor Author

@acreskeyMoz
Copy link
Collaborator

acreskeyMoz commented Sep 23, 2020

I've imported all the results (thank you Michael and Marc and CI).
So far this is what we're looking at:

Test Iterations +/- ms (95% Margin of Error, average from 10 runs)
perftest-g5 14 35.86
perftest-g5 25 21.59
perftest-p2 14 39.82
perftest-p2 25 22.75
fnprms-p2 25 19.11
fnprms-g5 25 27.90

@gmierz
Copy link

gmierz commented Sep 23, 2020

@acreskeyMoz is the difference in the mean/median between FNPRMS and perftest statistically insignificant or significant? If it's significant, we should be reporting % differences (error/mean) rather than raw ms values.

@acreskeyMoz
Copy link
Collaborator

@gmierz In #162 (comment) I'm just showing the calculated margin of error for each of those tests.
We can't directly compare FNPRMS and perftest means/medians because they test a different spot on the navigation timeline.

@gmierz
Copy link

gmierz commented Sep 23, 2020

Right but the size of the margin of error can be dependent on the mean/median or a larger mean could produce larger deviations in comparison to a smaller mean which would produce smaller deviations.

For instance, if one has a mean of 100 with a margin of error of 50 whereas another case has a mean of 200 with a margin of error of 50. Looking at only the margin of error without considering the mean shows that the margin of error is the same when it's not: in one case the normalized error is 50% (small mean) and the other is 25% (large mean).

@acreskeyMoz
Copy link
Collaborator

Right but the size of the margin of error can be dependent on the mean/median or a larger mean could produce larger deviations in comparison to a smaller mean which would produce smaller deviations.

For instance, if one has a mean of 100 with a margin of error of 50 whereas another case has a mean of 200 with a margin of error of 50. Looking at only the margin of error without considering the mean shows that the margin of error is the same when it's not: in one case the normalized error is 50% (small mean) and the other is 25% (large mean).

Ah yes, point taken.
We should not be directly comparing the fnprms and perftest margin of errors. (Their means are ~10% different on G5, ~20% different on P2).

But we can see that increasing perftest iterations to 25 (at least on G5, so far) gives us a margin of error that I believe is closer to what @mcomella is looking for.

@gmierz
Copy link

gmierz commented Sep 23, 2020

Yea I agree, perftest-g5 vs fnprms-g5 is looking promising! :)

@acreskeyMoz
Copy link
Collaborator

Added perftest-p2 results to #162 (comment)

@acreskeyMoz
Copy link
Collaborator

acreskeyMoz commented Sep 24, 2020

Some simple plots of the first two runs for each device and test. (25 iterations for all tests)
From here
g5

p2

@mcomella
Copy link
Contributor Author

@acreskeyMoz Another thing we can potentially do to reduce noise in perftest is to add delays. In my initial analysis when adding delays #141 (comment), I was using conditioned profiles so there was a delay built in as the profiles were copied over – this delay doesn't exist without conditioned profiles disabled and could be contributing noise.

@acreskeyMoz
Copy link
Collaborator

@acreskeyMoz Another thing we can potentially do to reduce noise in perftest is to add delays. In my initial analysis when adding delays #141 (comment), I was using conditioned profiles so there was a delay built in as the profiles were copied over – this delay doesn't exist without conditioned profiles disabled and could be contributing noise.

Ah yes, that's a good and simple change.

We can discuss today, but so far these look like good improvements:

  1. Add a short delay between tests (browsertime configuration)
  2. Increase perftest iteration count from 14 to 25

@acreskeyMoz
Copy link
Collaborator

We decided against prematurely adding delays with out data to show they yield improvements:

Increase iterations to 25:
https://bugzilla.mozilla.org/show_bug.cgi?id=1667238

@mcomella
Copy link
Contributor Author

The iteration change landed three days ago: the next action item for this bug is to look at the results (via the script and perhaps tree herder) to see if 1) we can see similar trends to FNPRMS or 2) if the trends in FNPRMS don't accurately match what we might expect from users.

@mcomella
Copy link
Contributor Author

The script doesn't return results after 9/15 – I'll need to check with sparky about what's going on.

@mcomella
Copy link
Contributor Author

Sparky mentioned the Active Data system – an elastic search database used by some scripts – is not ingesting data correctly. I filed: https://bugzilla.mozilla.org/show_bug.cgi?id=1668327

To get fenix per-commit regression detection working, I think my options are to:

  1. Get this fixed somehow
  2. Wait for the perf sheriffing work to complete

However, I'm not sure the interface in 2 will allow us to validate that this is a good replacement for FNPRMS (i.e. the steps in #162 (comment)).

@gmierz
Copy link

gmierz commented Sep 30, 2020

@mcomella for the script that relied on AD, I'm going to try to port that visualization to grafana through some work tarek has been doing.

@mcomella mcomella moved this from In progress to Waiting in Performance, front-end roadmap Oct 1, 2020
@acreskeyMoz
Copy link
Collaborator

The iteration change landed three days ago: the next action item for this bug is to look at the results (via the script and perhaps tree herder) to see if 1) we can see similar trends to FNPRMS or 2) if the trends in FNPRMS don't accurately match what we might expect from users.

If we want to compare the FNPRMS trends more closely to perftest, one option is to push jobs to try where the perftest view test measures the time to loadEventEnd:

https://searchfox.org/mozilla-central/source/testing/performance/perftest_android_view.js#14-16

  const processLaunchToNavStart =
    browserScripts.pageinfo.navigationStartTime + browserScripts.timings.loadEventEnd -
    browserScripts.browser.processStartTime;

@mcomella
Copy link
Contributor Author

mcomella commented Oct 8, 2020

Status update: Tarek is working on inserting perftest data into grafana. Once that's complete, we should be able to download the data directly from grafana.

If that takes too long, Sparky has a workaround for the active data issue; it doesn't sound like active data is coming back.

@mcomella mcomella changed the title Figure out how to start using mach perftest VIEW for regression detection Start using mach perftest VIEW for regression detection Feb 9, 2021
@mcomella
Copy link
Contributor Author

Talking with pslawless, we're waiting for more cycles (perhaps from the perf test team) to complete this effort.

@mcomella
Copy link
Contributor Author

This bug isn't super actionable anymore and we're waiting on cycles from the perftest team to move forward with this. Let's reopen or open a new one when we have those cycles.

Performance, front-end roadmap automation moved this from Waiting to Done Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants