-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Task continuity: ~800ms regression on 50% of COLD MAIN first frame start ups (5/28 regression with bimodal behavior) #25545
Comments
One way to debug this would be to take multiple profiles (5?), find one of each behavior (since it occurs roughly 50% of the time, it shouldn't take too long assuming running the profiler doesn't change the behavior), and compare them to see what's different. Another method would be to install the 5/27 nightly and install the 5/28 nightly and see if anything is visually different on start up to the homescreen (e.g. maybe this regression was introduced by a new view/feature being added to the homescreen). |
Triage: since such a large regression occurs roughly half the time, this is may be very bad for the user experience – setting P1. |
I think this is the mozilla-central change log between 5/27 and 5/28: |
Strong suspicion that this was introduced by the new task continuity feature that adds a synced tab to the homepage. There's quite a lot of IO that happens to fetch the tab. Surprised it's hitting the main thread that hard, but we decided it would be best to just disable the feature through a feature flag to ensure our suspicions are correct. Assuming we were, I'll work on correcting the underlying issue when I return next week. |
As @MatthewTighe mentioned @csadilek ,@agi,him and I did some research, where we tested from the commit range from c9240e20ed0afeb516323f0222bea59db6487afd to e1c94881f48ec0d5a694b4c778188cba121f326c, We found that on commit 07d4a8599d102b7708c01616b1b5992f514b8d46 we were able to see the bimodal behavior and on commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 the pattern was not present. At moment the commit, we think is causing the pattern is only on nightly #25571, we backed it out temporary until we can do more investigation, @mcomella it would be great if you could help us to confirm our findings. Data: From running First run of commit 07d4a8599d102b7708c01616b1b5992f514b8d46 with 25 iterations
Second run of commit 07d4a8599d102b7708c01616b1b5992f514b8d46 with 25 iterations.
First run of commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 with 25 iterations.
Second run of commit 72a2ee688f4fd43a8a9f8b19f254e65bd63d6060 with 25 iterations.
|
Thanks for investigating. I'll run the backfill script on Monday and verify if the bimodal behavior was removed. Since the regression is so large, another thing we can do to verify this is the root cause is take start up profiles of the affected Nightly. If we can get a profile where the behavior is reproduced, it'll probably pretty obvious where the regression occurred and we can see what exactly caused it. |
The regression fix merged mid-day on June 9th and it appears the bimodal behavior was addressed on 6/9 based on the graphs (see section below)! 🙌 However, there appears to be a new regression of ~56ms from the lower bimodal value. If we then increase the analysis range to include this regression #25253 (since that regression overlapped with the bimodal behavior which makes it more confusing to analyze results and we didn't identify any other regressions in this period), we see an even larger regression of ~72ms:
When you disable bimodal behavior in the ideal case, the longer-running group of replicates disappear and your median drops (since the longer-running group is no longer pulling your median upwards). However, it's possible when you disable the bimodal behavior that a regression is expected: e.g. the longer-running group of replicates is still longer-running than the lower group by just not as much. There are a few points that give evidence for the former case, that this is a true regression and one that we can address:
So let's try to find the cause of this new regression. For clarity, perhaps this should be done in a new regression bug. I would suggest we bisect on 6/9 to identify if a new regression was introduced, unrelated to the backout. If someone wants to do further analysis, I've attached the analysis from this week and 5/9: 25545-bimodal.zip GraphsFor evidence of the removal of bimodal behavior, here is the 6/8 graph (bimodal expected): And 6/9 (bimodal not expected): I confirmed the graphs from 6/10-6/12 also do not show bimodal behavior. |
I noticed this regression appears to affect both MAIN and VIEW so I filed a new bug: #25607 We should probably continue the investigation there. |
A couple things to address here:
I don't have much context into perf targets yet. Is the smaller regression still large enough that it should be backed out of beta? I think optimally we'd like to keep that change in so that we have greater control over the feature's visibility as we move into testing and external availability. That said, I think it should be possible to remove it from the 102 release cycle if we'd like to. I'll add a subset of this info to #25607 as well |
After reading through #25607, I actually don't think any of my points about the smaller regression above are correct. Wanted to leave it for posterity, though. It looks like the smaller regression is also affecting Focus, per mozilla-mobile/focus-android#7216, which shouldn't be affected at all to by any task continuity changes since sync is not incorporated into Focus. Given that, I think it's more likely that an A-C bump during 6/9 caused the smaller regression. |
Thanks for all the details @MatthewTighe. I think we should try figure out what is the source of the issue and then we can as team make decision on how we are going to approach it :) |
@MatthewTighe A simple way to confirm where If you need a profile with accurate timing, I can share how to use the Firefox Profiler to measure start up. :) |
Thanks for the tip! I've been using the Firefox profiler on a nightly build to try and get as close as possible to the production APK, though haven't set it up yet to include the secrets we inject through CI. I've only seen one run so far (only done a few, as I'm experimenting with the tool) that includes the suspect call: This is on a high-end device, so I'm not expecting to see as large a regression in any case. My next step is to see if I can replicate that method duration on the commit before the one introducing the feature. Is there a good way to inspect the performance data in the profiler just up to the first drawn frame, like a marker or something similar? |
@MatthewTighe Yep, that's the first frame. You can right-click on the marker to zoom in and fit the viewing area to the first frame.
As you probably know, you won't see the method if it doesn't get sampled. However, if it's worth the effort, you can add your own markers (even temporarily) to the profiler to track the method – markers get tracked every time even if the method they're called from doesn't get sampled. |
Using the profiling tools, I had the following findings:
I implemented a fix for that second issue, but am having some difficulty profiling it on the devices I have available. I'm not able to recreate even the original bimodal behavior. Would @mcomella or @Amejia481 have access to a device they could use to double-check whether this fix actually resolves the issue? If you have advice otherwise on how I could verify, happy to go that route as well. APKs are apparently too large to share inline on Github. What's a good alternative? |
I could give you a hand, lets coordinate when you are online :) |
I worked with Amejia481 and MatthewTighe to identify the root cause. We took profiles of the 6/6 build (i.e. it has this regression but #25253 was reverted):
We believe the root cause is that the patch moves the Pocket content to load before the first frame rather than after (as it has been) because we see Since this patch is unrelated to Pocket, it's likely causing the issue because it causes Pocket to be scheduled before the first frame more frequently than it was before, delaying the draw. While this patch may not delay visual completeness, an 800ms delay to first frame seems like an unacceptable hit to perceived performance, in my opinion. For the next step, we came up with a hack solution: force Pocket to always load after the first frame so we can re-enable and test the task continuity work. This should prevent the first frame from slowing down due to Pocket. However, it doesn't address the root problem: that we're loading content on the homescreen that we're not going to show (since parts of Pocket are below the fold but we often load all of it) and that Pocket seems unreasonably slow to load. However, we can address those more comprehensive issues in #21854 and #22755. Note that in these performance observations, Pocket may be slower than in Play store builds because of #25259. Furthermore, we see a bug when we hit the bimodal slow case where the first frame loads with the homescreen scrolled down from the top, showing Pocket content (this is already filed as #22939): Rather than scrolled to the top as we expect: I confirmed I no longer see this in the 6/9 nightly, after the regressing bug was disabled. |
@rocketsroger we are seeing a similar issue to the one that was addressed on #22402, maybe you could help to have a better context. |
Doesn't look like it's related to #22402. Mine was a scroll position fix rather than a performance fix. |
@MatthewTighe explained the issue to me. I'll try to help if I can. |
see #22144 for bimodal results in COLD VIEW, which is very likely unrelated
For our COLD MAIN first frame start up test, our replicates are usually scattered evenly (I verified this behavior on the Nighlies 4/29, 5/15, 5/17, 5/18, 5/19, 5/21, and 5/25-5/27). For example, here is the Nightly 5/27 performance test:
However, Nightly 5/28 introduces bimodal results:
I verified:
Bimodal results are undesirable because:
┆Issue is synchronized with this Jira Task
The text was updated successfully, but these errors were encountered: