Start using mach perftest VIEW for regression detection #162

mcomella · 2020-09-17T19:26:43Z

We've hit diminishing returns in our validation investigation #141 (comment) and we believe it's ready to be used.

Along with this effort, we should consider including this data into the existing FNPRMS dashboards.

mcomella · 2020-09-18T22:38:28Z

I think the per-commit data is too noisy to identify a specific commit that caused a regression. The following graphs are using pixel 2 data and I generated them using sparky's script with these changes (waiting to merge into master).

Here's FNPRMS over the past 30 days, 25 replicates per data point:

There is a distinct trend upwards.

Here's the perftest's per commit data over roughly the same period, 14 replicates per data point:

You can't really see any trends and the data seems to regularly vary between 1300ms to 1360ms (not good enough to catch small regressions we've been seeing, ~10-40ms). To be fair, it's unclear if the perftest methodology has been changing during that period. (note: the drop at the end might be from us disabling conditioned profiles)

Here's a recent week:

Here's a graph where, for a given date, we take the median per-commit run per day:

The graph seems less noisy but is missing the gradual trend we see on FNPRMS. This is concerning: when perftest runs are viewed in isolation, they seem pretty similar to perftest but when viewed in aggregate, the trends we see in FNPRMS are missing. What could be the cause?

for per-commit graphs, could the difference of 14 replicates per data point compared to 25 replicates make such a big difference in noise? From 20-60ms down to <= 10ms?
for the 1 median per day data, since there are not an equal number of commits to run on per day, some of the results may be outliers
perftest goes to navStart instead of a full page load but theoretically that should reduce noise

Possible next steps:

increase the replicate count to 25: this is probably the easiest way to make progress and compare
the amount of noise in aggregate might change by us disabling conditioned profiles (could be the answer or could complicate further analysis because it breaks historical data)
wait until this is sheriffable (waiting on surfacing per-commit data in treeherder + alerting?) and potentially have someone else figure out how to make this data more usable

@acreskeyMoz Do you have any ideas about what we should next? To be explicit, I'm trying to get perftest to a place where we can use it to detect regressions per commit. Theoretically, this script should let us do that but there seems to be too much noise.

mcomella · 2020-09-18T23:26:18Z

To be explicit: viewed from this perspective, perftest is not yet a good substitute for FNPRMS as a nightly measurement system, nevermind per-commit regression detection.

gmierz · 2020-09-19T00:55:57Z

One thing to note here is that these comparisons are between a small pool of devices (FNPRMS) versus a larger pool of devices (perftest). You might want to isolate a time series for each device in perftest and see if either of them, or some combination of them (if fnprms uses more than one) match up to FNPRMS. It's possible that the regressions detected in FNPRMS are biased because your sample/test-subject size is too small.

acreskeyMoz · 2020-09-21T14:11:16Z

This is my suggestion:

We take the raw results from these runs:

• perftest in CI with 25 iterations
• perftest in CI with 14 iterations
• FNPRMS run with 25 iterations
• (maybe?) FNPRMS run with 14 iterations

And for each we calculate:
• Std Dev
• Margin of Error

And compare.
Maybe n=30 is the sweet spot for these tests?

Perhaps we'll also want to compare variance. @gmierz was using a Levene test when looking at the perf tuning impact on noise.

I'm happy to do this unless someone else wants to.
I do have a couple of days work already scheduled ahead of me though.

gmierz · 2020-09-21T17:36:36Z

Levene's test should definitely be used here to compare variance between multiple data sources with potentially differing means. It will tell you if there is a statistically significant difference in the variance, and more specifically, it only tells you if they are equal or not. You can then use the %-change in std.dev./errors to determine which one is larger if there is a difference.

I still think that the difference in devices should be looked into, and we might be better off using pystones for these timings rather than raw milliseconds.

mcomella · 2020-09-21T18:10:50Z

I'm happy to do this unless someone else wants to.
I do have a couple of days work already scheduled ahead of me though.

@acreskeyMoz If you have the cycles for it, I think it'd be ideal! :) I don't have any statistics background nor do I have much familiarity with how our CI infra is set up so I think it'd be time consuming for me and I'd need to bother you or sparky frequently anyway.

To be explicit: ideally, we could get this ASAP but I think it's fine to wait a few more days especially because I'll be implementing other regression prevention mechanisms while I'm waiting for your results.

I'll assume that you're going to do it but let me know if that's not the case. Thanks!

acreskeyMoz · 2020-09-21T21:44:21Z

I'm starting on this now -- I expect to have the initial summary for perftest by tomorrow.

I'm also including a variation where the multicommit fenix is replaced with the nightly.

@mcomella Can you provide me with individual results from FNPRMS runs on G5 and Pixel 2?
Ideally from a few runs.

mcomella · 2020-09-22T19:42:24Z

@mcomella Can you provide me with individual results from FNPRMS runs on G5 and Pixel 2?
Ideally from a few runs.

Clarified the requirements async:

6-10 times total
Each run on 25 iterations
Whichever Nightly build is fine
Just share the final durations, not the logs

I'm getting P2 numbers and MarcLeclair is getting G5.

acreskeyMoz · 2020-09-22T21:16:22Z

When I have all the data I'll summarize this, but FYI I'm collecting raw numbers and statistics on them here:
perftest and FNPRMS variance

mcomella · 2020-09-22T21:39:27Z

FNPRMS numbers for the P2 (zip)

acreskeyMoz · 2020-09-23T19:48:56Z

I've imported all the results (thank you Michael and Marc and CI).
So far this is what we're looking at:

Test	Iterations	+/- ms (95% Margin of Error, average from 10 runs)
perftest-g5	14	35.86
perftest-g5	25	21.59
perftest-p2	14	39.82
perftest-p2	25	22.75
fnprms-p2	25	19.11
fnprms-g5	25	27.90

gmierz · 2020-09-23T20:37:49Z

@acreskeyMoz is the difference in the mean/median between FNPRMS and perftest statistically insignificant or significant? If it's significant, we should be reporting % differences (error/mean) rather than raw ms values.

acreskeyMoz · 2020-09-23T20:41:57Z

@gmierz In #162 (comment) I'm just showing the calculated margin of error for each of those tests.
We can't directly compare FNPRMS and perftest means/medians because they test a different spot on the navigation timeline.

gmierz · 2020-09-23T21:00:07Z

Right but the size of the margin of error can be dependent on the mean/median or a larger mean could produce larger deviations in comparison to a smaller mean which would produce smaller deviations.

For instance, if one has a mean of 100 with a margin of error of 50 whereas another case has a mean of 200 with a margin of error of 50. Looking at only the margin of error without considering the mean shows that the margin of error is the same when it's not: in one case the normalized error is 50% (small mean) and the other is 25% (large mean).

acreskeyMoz · 2020-09-23T21:45:20Z

Right but the size of the margin of error can be dependent on the mean/median or a larger mean could produce larger deviations in comparison to a smaller mean which would produce smaller deviations.

For instance, if one has a mean of 100 with a margin of error of 50 whereas another case has a mean of 200 with a margin of error of 50. Looking at only the margin of error without considering the mean shows that the margin of error is the same when it's not: in one case the normalized error is 50% (small mean) and the other is 25% (large mean).

Ah yes, point taken.
We should not be directly comparing the fnprms and perftest margin of errors. (Their means are ~10% different on G5, ~20% different on P2).

But we can see that increasing perftest iterations to 25 (at least on G5, so far) gives us a margin of error that I believe is closer to what @mcomella is looking for.

gmierz · 2020-09-23T22:58:39Z

Yea I agree, perftest-g5 vs fnprms-g5 is looking promising! :)

acreskeyMoz · 2020-09-24T14:14:03Z

Added perftest-p2 results to #162 (comment)

acreskeyMoz · 2020-09-24T15:25:03Z

Some simple plots of the first two runs for each device and test. (25 iterations for all tests)
From here

mcomella · 2020-09-24T15:25:26Z

@acreskeyMoz Another thing we can potentially do to reduce noise in perftest is to add delays. In my initial analysis when adding delays #141 (comment), I was using conditioned profiles so there was a delay built in as the profiles were copied over – this delay doesn't exist without conditioned profiles disabled and could be contributing noise.

acreskeyMoz · 2020-09-24T15:28:35Z

@acreskeyMoz Another thing we can potentially do to reduce noise in perftest is to add delays. In my initial analysis when adding delays #141 (comment), I was using conditioned profiles so there was a delay built in as the profiles were copied over – this delay doesn't exist without conditioned profiles disabled and could be contributing noise.

Ah yes, that's a good and simple change.

We can discuss today, but so far these look like good improvements:

Add a short delay between tests (browsertime configuration)
Increase perftest iteration count from 14 to 25

acreskeyMoz · 2020-09-25T13:40:27Z

We decided against prematurely adding delays with out data to show they yield improvements:

Increase iterations to 25:
https://bugzilla.mozilla.org/show_bug.cgi?id=1667238

mcomella · 2020-09-28T20:33:32Z

The iteration change landed three days ago: the next action item for this bug is to look at the results (via the script and perhaps tree herder) to see if 1) we can see similar trends to FNPRMS or 2) if the trends in FNPRMS don't accurately match what we might expect from users.

mcomella · 2020-09-28T20:47:52Z

The script doesn't return results after 9/15 – I'll need to check with sparky about what's going on.

mcomella · 2020-09-30T16:43:53Z

Sparky mentioned the Active Data system – an elastic search database used by some scripts – is not ingesting data correctly. I filed: https://bugzilla.mozilla.org/show_bug.cgi?id=1668327

To get fenix per-commit regression detection working, I think my options are to:

Get this fixed somehow
Wait for the perf sheriffing work to complete

However, I'm not sure the interface in 2 will allow us to validate that this is a good replacement for FNPRMS (i.e. the steps in #162 (comment)).

gmierz · 2020-09-30T20:34:53Z

@mcomella for the script that relied on AD, I'm going to try to port that visualization to grafana through some work tarek has been doing.

acreskeyMoz · 2020-10-01T20:10:09Z

The iteration change landed three days ago: the next action item for this bug is to look at the results (via the script and perhaps tree herder) to see if 1) we can see similar trends to FNPRMS or 2) if the trends in FNPRMS don't accurately match what we might expect from users.

If we want to compare the FNPRMS trends more closely to perftest, one option is to push jobs to try where the perftest view test measures the time to loadEventEnd:

https://searchfox.org/mozilla-central/source/testing/performance/perftest_android_view.js#14-16

  const processLaunchToNavStart =
    browserScripts.pageinfo.navigationStartTime + browserScripts.timings.loadEventEnd -
    browserScripts.browser.processStartTime;

mcomella · 2020-10-08T21:06:26Z

Status update: Tarek is working on inserting perftest data into grafana. Once that's complete, we should be able to download the data directly from grafana.

If that takes too long, Sparky has a workaround for the active data issue; it doesn't sound like active data is coming back.

mcomella · 2021-02-19T00:44:49Z

Talking with pslawless, we're waiting for more cycles (perhaps from the perf test team) to complete this effort.

mcomella · 2021-07-12T20:32:21Z

This bug isn't super actionable anymore and we're waiting on cycles from the perftest team to move forward with this. Let's reopen or open a new one when we have those cycles.

mcomella self-assigned this Sep 17, 2020

mcomella added this to Needs prioritization in Performance, front-end roadmap via automation Sep 17, 2020

mcomella moved this from Needs prioritization to In progress in Performance, front-end roadmap Sep 17, 2020

mcomella mentioned this issue Sep 17, 2020

Validate mach perftest VIEW against FNPRMS #141

Closed

mcomella assigned acreskeyMoz Sep 21, 2020

mcomella moved this from In progress to Waiting in Performance, front-end roadmap Oct 1, 2020

MarcLeclair unassigned mcomella and acreskeyMoz Oct 27, 2020

mcomella changed the title ~~Figure out how to start using mach perftest VIEW for regression detection~~ Start using mach perftest VIEW for regression detection Feb 9, 2021

mcomella closed this as completed Jul 12, 2021

Performance, front-end roadmap automation moved this from Waiting to Done Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start using mach perftest VIEW for regression detection #162

Start using mach perftest VIEW for regression detection #162

mcomella commented Sep 17, 2020

mcomella commented Sep 18, 2020 •

edited

mcomella commented Sep 18, 2020

gmierz commented Sep 19, 2020

acreskeyMoz commented Sep 21, 2020

gmierz commented Sep 21, 2020

mcomella commented Sep 21, 2020

acreskeyMoz commented Sep 21, 2020

mcomella commented Sep 22, 2020

acreskeyMoz commented Sep 22, 2020

mcomella commented Sep 22, 2020

acreskeyMoz commented Sep 23, 2020 •

edited

gmierz commented Sep 23, 2020

acreskeyMoz commented Sep 23, 2020

gmierz commented Sep 23, 2020

acreskeyMoz commented Sep 23, 2020

gmierz commented Sep 23, 2020

acreskeyMoz commented Sep 24, 2020

acreskeyMoz commented Sep 24, 2020 •

edited

mcomella commented Sep 24, 2020

acreskeyMoz commented Sep 24, 2020

acreskeyMoz commented Sep 25, 2020

mcomella commented Sep 28, 2020

mcomella commented Sep 28, 2020

mcomella commented Sep 30, 2020

gmierz commented Sep 30, 2020

acreskeyMoz commented Oct 1, 2020

mcomella commented Oct 8, 2020

mcomella commented Feb 19, 2021

mcomella commented Jul 12, 2021

Start using mach perftest VIEW for regression detection #162

Start using mach perftest VIEW for regression detection #162

Comments

mcomella commented Sep 17, 2020

mcomella commented Sep 18, 2020 • edited

mcomella commented Sep 18, 2020

gmierz commented Sep 19, 2020

acreskeyMoz commented Sep 21, 2020

gmierz commented Sep 21, 2020

mcomella commented Sep 21, 2020

acreskeyMoz commented Sep 21, 2020

mcomella commented Sep 22, 2020

acreskeyMoz commented Sep 22, 2020

mcomella commented Sep 22, 2020

acreskeyMoz commented Sep 23, 2020 • edited

gmierz commented Sep 23, 2020

acreskeyMoz commented Sep 23, 2020

gmierz commented Sep 23, 2020

acreskeyMoz commented Sep 23, 2020

gmierz commented Sep 23, 2020

acreskeyMoz commented Sep 24, 2020

acreskeyMoz commented Sep 24, 2020 • edited

mcomella commented Sep 24, 2020

acreskeyMoz commented Sep 24, 2020

acreskeyMoz commented Sep 25, 2020

mcomella commented Sep 28, 2020

mcomella commented Sep 28, 2020

mcomella commented Sep 30, 2020

gmierz commented Sep 30, 2020

acreskeyMoz commented Oct 1, 2020

mcomella commented Oct 8, 2020

mcomella commented Feb 19, 2021

mcomella commented Jul 12, 2021

mcomella commented Sep 18, 2020 •

edited

acreskeyMoz commented Sep 23, 2020 •

edited

acreskeyMoz commented Sep 24, 2020 •

edited