Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545

Closed
mcomella opened this issue Jun 6, 2022 · 20 comments
Assignees
Labels
performance Possible performance wins

Comments

@mcomella
Copy link
Contributor

mcomella commented Jun 6, 2022

see #22144 for bimodal results in COLD VIEW, which is very likely unrelated

For our COLD MAIN first frame start up test, our replicates are usually scattered evenly (I verified this behavior on the Nighlies 4/29, 5/15, 5/17, 5/18, 5/19, 5/21, and 5/25-5/27). For example, here is the Nightly 5/27 performance test:
Figure_1

However, Nightly 5/28 introduces bimodal results:
Figure_12png

I verified:

  • this bimodal behavior continues on 5/29.
  • From 5/30-6/1, we don't see bimodal behavior. Instead, we see a major regression such that all replicates are much longer
  • We return to bimodal behavior from 6/2-6/5, where our test results end

Bimodal results are undesirable because:

  • users are experiencing fast start ups roughly half the time and slower start ups the other half. Why not figure out how to get users the fast start up all the time?
  • it can cause false positive regressions making it difficult to identify true regressions (because if we get each of a low result and a high result roughly 50% of the time like in this bug, the performance test result will jump between each regularly since we use the median)

┆Issue is synchronized with this Jira Task

@mcomella mcomella added the performance Possible performance wins label Jun 6, 2022
@github-actions github-actions bot added the needs:triage Issue needs triage label Jun 6, 2022
@mcomella
Copy link
Contributor Author

mcomella commented Jun 6, 2022

One way to debug this would be to take multiple profiles (5?), find one of each behavior (since it occurs roughly 50% of the time, it shouldn't take too long assuming running the profiler doesn't change the behavior), and compare them to see what's different.

Another method would be to install the 5/27 nightly and install the 5/28 nightly and see if anything is visually different on start up to the homescreen (e.g. maybe this regression was introduced by a new view/feature being added to the homescreen).

@mcomella mcomella changed the title Bimodal results in COLD MAIN first frame start up time (5/28 regression) ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression due to bimodal behavior) Jun 8, 2022
@mcomella mcomella changed the title ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression due to bimodal behavior) ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) Jun 8, 2022
@mcomella
Copy link
Contributor Author

mcomella commented Jun 8, 2022

Triage: since such a large regression occurs roughly half the time, this is may be very bad for the user experience – setting P1.

@Amejia481 Amejia481 self-assigned this Jun 8, 2022
@cpeterso
Copy link

cpeterso commented Jun 8, 2022

MatthewTighe added a commit to MatthewTighe/fenix that referenced this issue Jun 9, 2022
@MatthewTighe
Copy link
Contributor

Strong suspicion that this was introduced by the new task continuity feature that adds a synced tab to the homepage. There's quite a lot of IO that happens to fetch the tab. Surprised it's hitting the main thread that hard, but we decided it would be best to just disable the feature through a feature flag to ensure our suspicions are correct. Assuming we were, I'll work on correcting the underlying issue when I return next week.

@Amejia481
Copy link
Contributor

Amejia481 commented Jun 9, 2022

As @MatthewTighe mentioned @csadilek ,@agi,him and I did some research, where we tested from the commit range from c9240e20ed0afeb516323f0222bea59db6487afd to e1c94881f48ec0d5a694b4c778188cba121f326c,
commit range that we found on TC builds 2022.05.27 nightly apk and 2022.05.28 nightly apk.

image

We found that on commit 07d4a8599d102b7708c01616b1b5992f514b8d46 we were able to see the bimodal behavior and on commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 the pattern was not present.

At moment the commit, we think is causing the pattern is only on nightly #25571, we backed it out temporary until we can do more investigation, @mcomella it would be great if you could help us to confirm our findings.


Data: From running python3 measure_start_up.py -c=25 nightly cold_main_first_frame file.text


First run of commit 07d4a8599d102b7708c01616b1b5992f514b8d46 with 25 iterations

1722
2392
2385
2458
1675
2376
2435
1609
1678
1673
1673
1676
1671
2405
1668
1689
1704
1682
1673
1751
1665
1659
1703
1759
1689

Second run of commit 07d4a8599d102b7708c01616b1b5992f514b8d46 with 25 iterations.

1665
1666
1663
1651
2866
1670
1668
1653
1664
2134
1686
1673
1698
1675
1693
1673
1681
1699
1620
1672
1709
2072
1698
1695
2705

First run of commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 with 25 iterations.

1706
1648
1905
1676
1657
1673
1703
1680
1707
1725
1684
1690
1720
1676
1678
1650
1674
1679
1663
1710
1700
1714
1664
1662
1697

Second run of commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 with 25 iterations.


1694
1678
1698
1685
1658
1696
1711
1669
1721
1685
1870
2590
2694
1678
1692
1670
1677
1679
1713
1692
1660
1700
1743
1689
1666

@mcomella
Copy link
Contributor Author

@mcomella it would be great if you could help us to confirm our findings.

Thanks for investigating. I'll run the backfill script on Monday and verify if the bimodal behavior was removed.

Since the regression is so large, another thing we can do to verify this is the root cause is take start up profiles of the affected Nightly. If we can get a profile where the behavior is reproduced, it'll probably pretty obvious where the regression occurred and we can see what exactly caused it.

@amedyne amedyne removed the needs:triage Issue needs triage label Jun 13, 2022
@mcomella
Copy link
Contributor Author

mcomella commented Jun 13, 2022

The regression fix merged mid-day on June 9th and it appears the bimodal behavior was addressed on 6/9 based on the graphs (see section below)! 🙌 However, there appears to be a new regression of ~56ms from the lower bimodal value. If we then increase the analysis range to include this regression #25253 (since that regression overlapped with the bimodal behavior which makes it more confusing to analyze results and we didn't identify any other regressions in this period), we see an even larger regression of ~72ms:

date duration
5/9 (before the 333ms regression) 1.464s
6/8 (the lower bimodal value) 1.480s
6/9 (bimodal regression addressed) 1.536s

When you disable bimodal behavior in the ideal case, the longer-running group of replicates disappear and your median drops (since the longer-running group is no longer pulling your median upwards). However, it's possible when you disable the bimodal behavior that a regression is expected: e.g. the longer-running group of replicates is still longer-running than the lower group by just not as much.

There are a few points that give evidence for the former case, that this is a true regression and one that we can address:

  • the minimum replicate from 6/8 to 6/9 increased from 1428 to 1489. We would expect it to stay about the same in the latter case
  • for the ideal case when disabling bimodal behavior, you'd expect your post-regression value to be near median of your lower group of replicates (i.e. the 25th percentile of all replicates) with bimodal behavior enabled. The 25th percentile on 6/8 is 1448.52ms which is close to the 5/9 median (since there are so few data points, you wouldn't expect this to be too accurate) so it seems this is the case

So let's try to find the cause of this new regression. For clarity, perhaps this should be done in a new regression bug. I would suggest we bisect on 6/9 to identify if a new regression was introduced, unrelated to the backout.

If someone wants to do further analysis, I've attached the analysis from this week and 5/9: 25545-bimodal.zip

Graphs

For evidence of the removal of bimodal behavior, here is the 6/8 graph (bimodal expected):
june-8

And 6/9 (bimodal not expected):
june-9

I confirmed the graphs from 6/10-6/12 also do not show bimodal behavior.

@mcomella
Copy link
Contributor Author

However, there appears to be a new regression of ~56ms from the lower bimodal value.

So let's try to find the cause of this new regression. For clarity, perhaps this should be done in a new regression bug.

I noticed this regression appears to affect both MAIN and VIEW so I filed a new bug: #25607 We should probably continue the investigation there.

@MatthewTighe
Copy link
Contributor

MatthewTighe commented Jun 13, 2022

A couple things to address here:

  • The larger regression was likely caused by the instantiation of the FxaAccountManager, as this kicks off quite a lot of work. AFAIK, this instantiation was previously done later than it is in the offending commit. Still investigating for certainty, but it's the obvious candidate.
  • My guess for the cause of the smaller regression would be that the feature's visibility was also switched from being determined by a simple boolean property to an accessor that relies on a disk read. This is so it could be hidden behind a secret debug setting if we decided we wanted to turn it off externally, but still test it internally.
  • If my hunch about the smaller regression is correct, then it is already on the v102 release branch and currently in beta.
  • In for #25545: turn off task continuity in nightly #25571 above, instead of reverting the offending commit, the entire feature was flagged as disabled. This was intended to give us insight into whether the feature itself or the changes around the feature flag were the underlying cause. IMO your findings indicate that both changes were impactful, though the smaller one obviously less so.

I don't have much context into perf targets yet. Is the smaller regression still large enough that it should be backed out of beta? I think optimally we'd like to keep that change in so that we have greater control over the feature's visibility as we move into testing and external availability. That said, I think it should be possible to remove it from the 102 release cycle if we'd like to.

I'll add a subset of this info to #25607 as well

@MatthewTighe
Copy link
Contributor

MatthewTighe commented Jun 13, 2022

After reading through #25607, I actually don't think any of my points about the smaller regression above are correct. Wanted to leave it for posterity, though.

It looks like the smaller regression is also affecting Focus, per mozilla-mobile/focus-android#7216, which shouldn't be affected at all to by any task continuity changes since sync is not incorporated into Focus. Given that, I think it's more likely that an A-C bump during 6/9 caused the smaller regression.

@Amejia481
Copy link
Contributor

Amejia481 commented Jun 14, 2022

Thanks for all the details @MatthewTighe. I think we should try figure out what is the source of the issue and then we can as team make decision on how we are going to approach it :)

@mcomella
Copy link
Contributor Author

The larger regression was likely caused by the instantiation of the FxaAccountManager, as this kicks off quite a lot of work. AFAIK, this instantiation was previously done later than it is in the offending commit. Still investigating for certainty, but it's the obvious candidate.

@MatthewTighe A simple way to confirm where FxaAccountManager was initialized is to take a profile of start up and use the search field to filter methods until you find it. Since you're not taking performance timings, the simplest way is to use the Android Studio profiler (assuming Fxa is initialized in debug builds). I find the AS profiler interface to be difficult to use so you can save the trace file and load it into the Firefox Profiler instead by going to https://profiler.firefox.com/ and clicking "Load profile from file".

If you need a profile with accurate timing, I can share how to use the Firefox Profiler to measure start up. :)

@MatthewTighe
Copy link
Contributor

MatthewTighe commented Jun 14, 2022

Thanks for the tip! I've been using the Firefox profiler on a nightly build to try and get as close as possible to the production APK, though haven't set it up yet to include the secrets we inject through CI. I've only seen one run so far (only done a few, as I'm experimenting with the tool) that includes the suspect call:

image

This is on a high-end device, so I'm not expecting to see as large a regression in any case. My next step is to see if I can replicate that method duration on the commit before the one introducing the feature.

Is there a good way to inspect the performance data in the profiler just up to the first drawn frame, like a marker or something similar?

E: This looks to be it. Is that right?
image

@mcomella
Copy link
Contributor Author

Is there a good way to inspect the performance data in the profiler just up to the first drawn frame, like a marker or something similar?

E: This looks to be it. Is that right?

@MatthewTighe Yep, that's the first frame. You can right-click on the marker to zoom in and fit the viewing area to the first frame.

I've only seen one run so far... that includes the suspect call:

As you probably know, you won't see the method if it doesn't get sampled. However, if it's worth the effort, you can add your own markers (even temporarily) to the profiler to track the method – markers get tracked every time even if the method they're called from doesn't get sampled.

@Amejia481 Amejia481 changed the title ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) Jun 15, 2022
@MatthewTighe
Copy link
Contributor

MatthewTighe commented Jun 15, 2022

Using the profiling tools, I had the following findings:

  • the original hypothesis that initialization of FxaAccountManager was the cause of the regression seems to be incorrect. That code is hit before and after the offending commit
  • one obvious difference between the good and bad commits was the initialization of SyncedTabsStorage. In one profile, that call stack alone accounted for 53ms on a high-end device

I implemented a fix for that second issue, but am having some difficulty profiling it on the devices I have available. I'm not able to recreate even the original bimodal behavior. Would @mcomella or @Amejia481 have access to a device they could use to double-check whether this fix actually resolves the issue? If you have advice otherwise on how I could verify, happy to go that route as well.

APKs are apparently too large to share inline on Github. What's a good alternative?

@Amejia481
Copy link
Contributor

Using the profiling tools, I had the following findings:

* the original hypothesis that initialization of `FxaAccountManager` was the cause of the regression seems to be incorrect. That code is hit before and after the offending commit

* one obvious difference between the good and bad commits was the initialization of `SyncedTabsStorage`. In one profile, that call stack alone accounted for 53ms on a high-end device

I implemented a fix for that second issue, but am having some difficulty profiling it on the devices I have available. I'm not able to recreate even the original bimodal behavior. Would @mcomella or @Amejia481 have access to a device they could use to double-check whether this fix actually resolves the issue? If you have advice otherwise on how I could verify, happy to go that route as well.

APKs are apparently too large to share inline on Github. What's a good alternative?

I could give you a hand, lets coordinate when you are online :)

@mcomella
Copy link
Contributor Author

I worked with Amejia481 and MatthewTighe to identify the root cause. We took profiles of the 6/6 build (i.e. it has this regression but #25253 was reverted):

We believe the root cause is that the patch moves the Pocket content to load before the first frame rather than after (as it has been) because we see HomeActivity.onLayout takes significantly longer in the slow case and it executes Pocket layout methods that we don't see in the fast case.

Since this patch is unrelated to Pocket, it's likely causing the issue because it causes Pocket to be scheduled before the first frame more frequently than it was before, delaying the draw.

While this patch may not delay visual completeness, an 800ms delay to first frame seems like an unacceptable hit to perceived performance, in my opinion.

For the next step, we came up with a hack solution: force Pocket to always load after the first frame so we can re-enable and test the task continuity work. This should prevent the first frame from slowing down due to Pocket.

However, it doesn't address the root problem: that we're loading content on the homescreen that we're not going to show (since parts of Pocket are below the fold but we often load all of it) and that Pocket seems unreasonably slow to load. However, we can address those more comprehensive issues in #21854 and #22755. Note that in these performance observations, Pocket may be slower than in Play store builds because of #25259.


Furthermore, we see a bug when we hit the bimodal slow case where the first frame loads with the homescreen scrolled down from the top, showing Pocket content (this is already filed as #22939):

Rather than scrolled to the top as we expect:

I confirmed I no longer see this in the 6/9 nightly, after the regressing bug was disabled.

@Amejia481
Copy link
Contributor

I worked with Amejia481 and MatthewTighe to identify the root cause. We took profiles of the 6/6 build (i.e. it has this regression but #25253 was reverted):

* bimodal fast case: https://share.firefox.dev/3b04kxn

* bimodal slow case: https://share.firefox.dev/3QoNxEp

We believe the root cause is that the patch moves the Pocket content to load before the first frame rather than after (as it has been) because we see HomeActivity.onLayout takes significantly longer in the slow case and it executes Pocket layout methods that we don't see in the fast case.

Since this patch is unrelated to Pocket, it's likely causing the issue because it causes Pocket to be scheduled before the first frame more frequently than it was before, delaying the draw.

While this patch may not delay visual completeness, an 800ms delay to first frame seems like an unacceptable hit to perceived performance, in my opinion.

For the next step, we came up with a hack solution: force Pocket to always load after the first frame so we can re-enable and test the task continuity work. This should prevent the first frame from slowing down due to Pocket.

However, it doesn't address the root problem: that we're loading content on the homescreen that we're not going to show (since parts of Pocket are below the fold but we often load all of it) and that Pocket seems unreasonably slow to load. However, we can address those more comprehensive issues in #21854 and #22755. Note that in these performance observations, Pocket may be slower than in Play store builds because of #25259.

Furthermore, we see a bug when we hit the bimodal slow case where the first frame loads with the homescreen scrolled down from the top, showing Pocket content (this is already filed as #22939):

Rather than scrolled to the top as we expect:

I confirmed I no longer see this in the 6/9 nightly, after the regressing bug was disabled.

@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context.

@rocketsroger
Copy link
Contributor

@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context.

Doesn't look like it's related to #22402. Mine was a scroll position fix rather than a performance fix.

@rocketsroger
Copy link
Contributor

rocketsroger commented Jun 22, 2022

@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context.

Doesn't look like it's related to #22402. Mine was a scroll position fix rather than a performance fix.

@MatthewTighe explained the issue to me. I'll try to help if I can.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance Possible performance wins
Projects
None yet
Development

No branches or pull requests

6 participants