Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545

mcomella · 2022-06-06T17:46:33Z

see #22144 for bimodal results in COLD VIEW, which is very likely unrelated

For our COLD MAIN first frame start up test, our replicates are usually scattered evenly (I verified this behavior on the Nighlies 4/29, 5/15, 5/17, 5/18, 5/19, 5/21, and 5/25-5/27). For example, here is the Nightly 5/27 performance test:

However, Nightly 5/28 introduces bimodal results:

I verified:

this bimodal behavior continues on 5/29.
From 5/30-6/1, we don't see bimodal behavior. Instead, we see a major regression such that all replicates are much longer
We return to bimodal behavior from 6/2-6/5, where our test results end

Bimodal results are undesirable because:

users are experiencing fast start ups roughly half the time and slower start ups the other half. Why not figure out how to get users the fast start up all the time?
it can cause false positive regressions making it difficult to identify true regressions (because if we get each of a low result and a high result roughly 50% of the time like in this bug, the performance test result will jump between each regularly since we use the median)

┆Issue is synchronized with this Jira Task

mcomella · 2022-06-06T17:50:29Z

One way to debug this would be to take multiple profiles (5?), find one of each behavior (since it occurs roughly 50% of the time, it shouldn't take too long assuming running the profiler doesn't change the behavior), and compare them to see what's different.

Another method would be to install the 5/27 nightly and install the 5/28 nightly and see if anything is visually different on start up to the homescreen (e.g. maybe this regression was introduced by a new view/feature being added to the homescreen).

mcomella · 2022-06-08T20:45:08Z

Triage: since such a large regression occurs roughly half the time, this is may be very bad for the user experience – setting P1.

cpeterso · 2022-06-08T22:00:08Z

I think this is the mozilla-central change log between 5/27 and 5/28:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=914ead848de431dcd0f42c2fda013018704b29b9&tochange=aebf5d6aabfedd693883dc66c07ed2a7b9400e02

MatthewTighe · 2022-06-09T20:12:25Z

Strong suspicion that this was introduced by the new task continuity feature that adds a synced tab to the homepage. There's quite a lot of IO that happens to fetch the tab. Surprised it's hitting the main thread that hard, but we decided it would be best to just disable the feature through a feature flag to ensure our suspicions are correct. Assuming we were, I'll work on correcting the underlying issue when I return next week.

Amejia481 · 2022-06-09T21:10:16Z

As @MatthewTighe mentioned @csadilek ,@agi,him and I did some research, where we tested from the commit range from c9240e20ed0afeb516323f0222bea59db6487afd to e1c94881f48ec0d5a694b4c778188cba121f326c,
commit range that we found on TC builds 2022.05.27 nightly apk and 2022.05.28 nightly apk.

We found that on commit 07d4a8599d102b7708c01616b1b5992f514b8d46 we were able to see the bimodal behavior and on commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 the pattern was not present.

At moment the commit, we think is causing the pattern is only on nightly #25571, we backed it out temporary until we can do more investigation, @mcomella it would be great if you could help us to confirm our findings.

Data: From running python3 measure_start_up.py -c=25 nightly cold_main_first_frame file.text

First run of commit 07d4a8599d102b7708c01616b1b5992f514b8d46 with 25 iterations

Second run of commit 07d4a8599d102b7708c01616b1b5992f514b8d46 with 25 iterations.

First run of commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 with 25 iterations.

Second run of commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 with 25 iterations.

mcomella · 2022-06-10T19:21:46Z

@mcomella it would be great if you could help us to confirm our findings.

Thanks for investigating. I'll run the backfill script on Monday and verify if the bimodal behavior was removed.

Since the regression is so large, another thing we can do to verify this is the root cause is take start up profiles of the affected Nightly. If we can get a profile where the behavior is reproduced, it'll probably pretty obvious where the regression occurred and we can see what exactly caused it.

mcomella · 2022-06-13T19:14:18Z

The regression fix merged mid-day on June 9th and it appears the bimodal behavior was addressed on 6/9 based on the graphs (see section below)! 🙌 However, there appears to be a new regression of ~56ms from the lower bimodal value. If we then increase the analysis range to include this regression #25253 (since that regression overlapped with the bimodal behavior which makes it more confusing to analyze results and we didn't identify any other regressions in this period), we see an even larger regression of ~72ms:

date	duration
5/9 (before the 333ms regression)	1.464s
6/8 (the lower bimodal value)	1.480s
6/9 (bimodal regression addressed)	1.536s

When you disable bimodal behavior in the ideal case, the longer-running group of replicates disappear and your median drops (since the longer-running group is no longer pulling your median upwards). However, it's possible when you disable the bimodal behavior that a regression is expected: e.g. the longer-running group of replicates is still longer-running than the lower group by just not as much.

There are a few points that give evidence for the former case, that this is a true regression and one that we can address:

the minimum replicate from 6/8 to 6/9 increased from 1428 to 1489. We would expect it to stay about the same in the latter case
for the ideal case when disabling bimodal behavior, you'd expect your post-regression value to be near median of your lower group of replicates (i.e. the 25th percentile of all replicates) with bimodal behavior enabled. The 25th percentile on 6/8 is 1448.52ms which is close to the 5/9 median (since there are so few data points, you wouldn't expect this to be too accurate) so it seems this is the case

So let's try to find the cause of this new regression. For clarity, perhaps this should be done in a new regression bug. I would suggest we bisect on 6/9 to identify if a new regression was introduced, unrelated to the backout.

If someone wants to do further analysis, I've attached the analysis from this week and 5/9: 25545-bimodal.zip

Graphs

For evidence of the removal of bimodal behavior, here is the 6/8 graph (bimodal expected):

And 6/9 (bimodal not expected):

I confirmed the graphs from 6/10-6/12 also do not show bimodal behavior.

mcomella · 2022-06-13T21:40:33Z

However, there appears to be a new regression of ~56ms from the lower bimodal value.

So let's try to find the cause of this new regression. For clarity, perhaps this should be done in a new regression bug.

I noticed this regression appears to affect both MAIN and VIEW so I filed a new bug: #25607 We should probably continue the investigation there.

MatthewTighe · 2022-06-13T23:38:07Z

A couple things to address here:

The larger regression was likely caused by the instantiation of the FxaAccountManager, as this kicks off quite a lot of work. AFAIK, this instantiation was previously done later than it is in the offending commit. Still investigating for certainty, but it's the obvious candidate.
My guess for the cause of the smaller regression would be that the feature's visibility was also switched from being determined by a simple boolean property to an accessor that relies on a disk read. This is so it could be hidden behind a secret debug setting if we decided we wanted to turn it off externally, but still test it internally.
If my hunch about the smaller regression is correct, then it is already on the v102 release branch and currently in beta.
In for #25545: turn off task continuity in nightly #25571 above, instead of reverting the offending commit, the entire feature was flagged as disabled. This was intended to give us insight into whether the feature itself or the changes around the feature flag were the underlying cause. IMO your findings indicate that both changes were impactful, though the smaller one obviously less so.

I don't have much context into perf targets yet. Is the smaller regression still large enough that it should be backed out of beta? I think optimally we'd like to keep that change in so that we have greater control over the feature's visibility as we move into testing and external availability. That said, I think it should be possible to remove it from the 102 release cycle if we'd like to.

I'll add a subset of this info to #25607 as well

MatthewTighe · 2022-06-13T23:53:57Z

After reading through #25607, I actually don't think any of my points about the smaller regression above are correct. Wanted to leave it for posterity, though.

It looks like the smaller regression is also affecting Focus, per mozilla-mobile/focus-android#7216, which shouldn't be affected at all to by any task continuity changes since sync is not incorporated into Focus. Given that, I think it's more likely that an A-C bump during 6/9 caused the smaller regression.

Amejia481 · 2022-06-14T17:46:06Z

Thanks for all the details @MatthewTighe. I think we should try figure out what is the source of the issue and then we can as team make decision on how we are going to approach it :)

mcomella · 2022-06-14T18:32:09Z

The larger regression was likely caused by the instantiation of the FxaAccountManager, as this kicks off quite a lot of work. AFAIK, this instantiation was previously done later than it is in the offending commit. Still investigating for certainty, but it's the obvious candidate.

@MatthewTighe A simple way to confirm where FxaAccountManager was initialized is to take a profile of start up and use the search field to filter methods until you find it. Since you're not taking performance timings, the simplest way is to use the Android Studio profiler (assuming Fxa is initialized in debug builds). I find the AS profiler interface to be difficult to use so you can save the trace file and load it into the Firefox Profiler instead by going to https://profiler.firefox.com/ and clicking "Load profile from file".

If you need a profile with accurate timing, I can share how to use the Firefox Profiler to measure start up. :)

MatthewTighe · 2022-06-14T19:44:07Z

Thanks for the tip! I've been using the Firefox profiler on a nightly build to try and get as close as possible to the production APK, though haven't set it up yet to include the secrets we inject through CI. I've only seen one run so far (only done a few, as I'm experimenting with the tool) that includes the suspect call:

This is on a high-end device, so I'm not expecting to see as large a regression in any case. My next step is to see if I can replicate that method duration on the commit before the one introducing the feature.

Is there a good way to inspect the performance data in the profiler just up to the first drawn frame, like a marker or something similar?

E: This looks to be it. Is that right?

mcomella · 2022-06-14T20:08:26Z

Is there a good way to inspect the performance data in the profiler just up to the first drawn frame, like a marker or something similar?

E: This looks to be it. Is that right?

@MatthewTighe Yep, that's the first frame. You can right-click on the marker to zoom in and fit the viewing area to the first frame.

I've only seen one run so far... that includes the suspect call:

As you probably know, you won't see the method if it doesn't get sampled. However, if it's worth the effort, you can add your own markers (even temporarily) to the profiler to track the method – markers get tracked every time even if the method they're called from doesn't get sampled.

MatthewTighe · 2022-06-15T23:36:33Z

Using the profiling tools, I had the following findings:

the original hypothesis that initialization of FxaAccountManager was the cause of the regression seems to be incorrect. That code is hit before and after the offending commit
one obvious difference between the good and bad commits was the initialization of SyncedTabsStorage. In one profile, that call stack alone accounted for 53ms on a high-end device

I implemented a fix for that second issue, but am having some difficulty profiling it on the devices I have available. I'm not able to recreate even the original bimodal behavior. Would @mcomella or @Amejia481 have access to a device they could use to double-check whether this fix actually resolves the issue? If you have advice otherwise on how I could verify, happy to go that route as well.

APKs are apparently too large to share inline on Github. What's a good alternative?

Amejia481 · 2022-06-16T13:44:53Z

Using the profiling tools, I had the following findings:
* the original hypothesis that initialization of `FxaAccountManager` was the cause of the regression seems to be incorrect. That code is hit before and after the offending commit

* one obvious difference between the good and bad commits was the initialization of `SyncedTabsStorage`. In one profile, that call stack alone accounted for 53ms on a high-end device
I implemented a fix for that second issue, but am having some difficulty profiling it on the devices I have available. I'm not able to recreate even the original bimodal behavior. Would @mcomella or @Amejia481 have access to a device they could use to double-check whether this fix actually resolves the issue? If you have advice otherwise on how I could verify, happy to go that route as well.

APKs are apparently too large to share inline on Github. What's a good alternative?

I could give you a hand, lets coordinate when you are online :)

mcomella · 2022-06-16T22:12:24Z

I worked with Amejia481 and MatthewTighe to identify the root cause. We took profiles of the 6/6 build (i.e. it has this regression but #25253 was reverted):

bimodal fast case: https://share.firefox.dev/3b04kxn
bimodal slow case: https://share.firefox.dev/3QoNxEp

We believe the root cause is that the patch moves the Pocket content to load before the first frame rather than after (as it has been) because we see HomeActivity.onLayout takes significantly longer in the slow case and it executes Pocket layout methods that we don't see in the fast case.

Since this patch is unrelated to Pocket, it's likely causing the issue because it causes Pocket to be scheduled before the first frame more frequently than it was before, delaying the draw.

While this patch may not delay visual completeness, an 800ms delay to first frame seems like an unacceptable hit to perceived performance, in my opinion.

For the next step, we came up with a hack solution: force Pocket to always load after the first frame so we can re-enable and test the task continuity work. This should prevent the first frame from slowing down due to Pocket.

However, it doesn't address the root problem: that we're loading content on the homescreen that we're not going to show (since parts of Pocket are below the fold but we often load all of it) and that Pocket seems unreasonably slow to load. However, we can address those more comprehensive issues in #21854 and #22755. Note that in these performance observations, Pocket may be slower than in Play store builds because of #25259.

Furthermore, we see a bug when we hit the bimodal slow case where the first frame loads with the homescreen scrolled down from the top, showing Pocket content (this is already filed as #22939):

Rather than scrolled to the top as we expect:

I confirmed I no longer see this in the 6/9 nightly, after the regressing bug was disabled.

Amejia481 · 2022-06-17T18:24:27Z

I worked with Amejia481 and MatthewTighe to identify the root cause. We took profiles of the 6/6 build (i.e. it has this regression but #25253 was reverted):
* bimodal fast case: https://share.firefox.dev/3b04kxn

* bimodal slow case: https://share.firefox.dev/3QoNxEp
We believe the root cause is that the patch moves the Pocket content to load before the first frame rather than after (as it has been) because we see HomeActivity.onLayout takes significantly longer in the slow case and it executes Pocket layout methods that we don't see in the fast case.

Since this patch is unrelated to Pocket, it's likely causing the issue because it causes Pocket to be scheduled before the first frame more frequently than it was before, delaying the draw.

While this patch may not delay visual completeness, an 800ms delay to first frame seems like an unacceptable hit to perceived performance, in my opinion.

For the next step, we came up with a hack solution: force Pocket to always load after the first frame so we can re-enable and test the task continuity work. This should prevent the first frame from slowing down due to Pocket.

However, it doesn't address the root problem: that we're loading content on the homescreen that we're not going to show (since parts of Pocket are below the fold but we often load all of it) and that Pocket seems unreasonably slow to load. However, we can address those more comprehensive issues in #21854 and #22755. Note that in these performance observations, Pocket may be slower than in Play store builds because of #25259.

Furthermore, we see a bug when we hit the bimodal slow case where the first frame loads with the homescreen scrolled down from the top, showing Pocket content (this is already filed as #22939):

Rather than scrolled to the top as we expect:

I confirmed I no longer see this in the 6/9 nightly, after the regressing bug was disabled.

@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context.

rocketsroger · 2022-06-22T17:15:52Z

@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context.

Doesn't look like it's related to #22402. Mine was a scroll position fix rather than a performance fix.

rocketsroger · 2022-06-22T17:44:49Z

@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context.

Doesn't look like it's related to #22402. Mine was a scroll position fix rather than a performance fix.

@MatthewTighe explained the issue to me. I'll try to help if I can.

mcomella added the performance Possible performance wins label Jun 6, 2022

github-actions bot added the needs:triage Issue needs triage label Jun 6, 2022

mcomella mentioned this issue Jun 6, 2022

NoCollectionsViewHolder to Compose caused 333ms regression in cold main start up (5/10/22) #25253

Closed

mcomella changed the title ~~Bimodal results in COLD MAIN first frame start up time (5/28 regression)~~ ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression due to bimodal behavior) Jun 8, 2022

mcomella changed the title ~~~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression due to bimodal behavior)~~ ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) Jun 8, 2022

Amejia481 self-assigned this Jun 8, 2022

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 9, 2022

for mozilla-mobile#25545: turn off task continuity in nightly

bf47aa4

mergify bot pushed a commit that referenced this issue Jun 9, 2022

for #25545: turn off task continuity in nightly

8a7e8e7

amedyne removed the needs:triage Issue needs triage label Jun 13, 2022

mcomella mentioned this issue Jun 13, 2022

66ms regression in cold view nav start, 56ms in cold main first frame (6/9/22) #25607

Closed

Amejia481 changed the title ~~~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior)~~ Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) Jun 15, 2022

MatthewTighe mentioned this issue Jun 15, 2022

Draft: for #25545 #25637

Closed

MatthewTighe mentioned this issue Jun 17, 2022

No issue: create add_secrets to copy local secrets into correct files #25669

Closed

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 23, 2022

Fixes mozilla-mobile#25545: wait to draw Pocket until after first frame

741b9aa

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 23, 2022

Fixes mozilla-mobile#25545: wait to draw Pocket until after first frame

e46a0ac

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 23, 2022

Fixes mozilla-mobile#25545: wait to draw Pocket until after first frame

4203c3e

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 27, 2022

Fixes mozilla-mobile#25545: wait to draw Pocket until after first frame

fcd5740

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 27, 2022

Fixes mozilla-mobile#25545: wait to draw Pocket until after first frame

2ce272c

mergify bot closed this as completed in d6952cb Jun 28, 2022

MatthewTighe mentioned this issue Jun 29, 2022

Low-reproducibility home screen performance regression in COLD MAIN first frame #25821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545

Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545

mcomella commented Jun 6, 2022 •

edited by data-sync-user

Loading

mcomella commented Jun 6, 2022

mcomella commented Jun 8, 2022

cpeterso commented Jun 8, 2022

MatthewTighe commented Jun 9, 2022

Amejia481 commented Jun 9, 2022 •

edited

Loading

mcomella commented Jun 10, 2022

mcomella commented Jun 13, 2022 •

edited

Loading

mcomella commented Jun 13, 2022

MatthewTighe commented Jun 13, 2022 •

edited

Loading

MatthewTighe commented Jun 13, 2022 •

edited

Loading

Amejia481 commented Jun 14, 2022 •

edited

Loading

mcomella commented Jun 14, 2022

MatthewTighe commented Jun 14, 2022 •

edited

Loading

mcomella commented Jun 14, 2022

MatthewTighe commented Jun 15, 2022 •

edited

Loading

Amejia481 commented Jun 16, 2022

mcomella commented Jun 16, 2022

Amejia481 commented Jun 17, 2022

rocketsroger commented Jun 22, 2022

rocketsroger commented Jun 22, 2022 •

edited

Loading

Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545

Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545

Comments

mcomella commented Jun 6, 2022 • edited by data-sync-user Loading

mcomella commented Jun 6, 2022

mcomella commented Jun 8, 2022

cpeterso commented Jun 8, 2022

MatthewTighe commented Jun 9, 2022

Amejia481 commented Jun 9, 2022 • edited Loading

mcomella commented Jun 10, 2022

mcomella commented Jun 13, 2022 • edited Loading

Graphs

mcomella commented Jun 13, 2022

MatthewTighe commented Jun 13, 2022 • edited Loading

MatthewTighe commented Jun 13, 2022 • edited Loading

Amejia481 commented Jun 14, 2022 • edited Loading

mcomella commented Jun 14, 2022

MatthewTighe commented Jun 14, 2022 • edited Loading

mcomella commented Jun 14, 2022

MatthewTighe commented Jun 15, 2022 • edited Loading

Amejia481 commented Jun 16, 2022

mcomella commented Jun 16, 2022

Amejia481 commented Jun 17, 2022

rocketsroger commented Jun 22, 2022

rocketsroger commented Jun 22, 2022 • edited Loading

mcomella commented Jun 6, 2022 •

edited by data-sync-user

Loading

Amejia481 commented Jun 9, 2022 •

edited

Loading

mcomella commented Jun 13, 2022 •

edited

Loading

MatthewTighe commented Jun 13, 2022 •

edited

Loading

MatthewTighe commented Jun 13, 2022 •

edited

Loading

Amejia481 commented Jun 14, 2022 •

edited

Loading

MatthewTighe commented Jun 14, 2022 •

edited

Loading

MatthewTighe commented Jun 15, 2022 •

edited

Loading

rocketsroger commented Jun 22, 2022 •

edited

Loading