Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

[Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS #9935

Closed
mcomella opened this issue Apr 14, 2020 · 28 comments
Assignees
Labels
🐞 bug Crashes, Something isn't working, .. E5 Estimation Point: about 5 days eng:qa:not-needed Added by QA to issues that cannot be tested performance Possible performance wins

Comments

@mcomella
Copy link
Contributor

mcomella commented Apr 14, 2020

Steps to reproduce

  • Build geckoBetaFennecBeta or geckoBetaForPerformanceTest
  • Start app (cold start)
  • Click url bar

Expected behavior

Quick to open

Actual behavior

Long pause. In forPerformanceTest with the profiler going (so with much overhead), it takes > 10 seconds.

Device information

  • Android device: P2
  • Fenix version: master 152642d

I took a profile and it looks like FenixSearchEngineProvider.installedSearchEngines is blocking the main thread for a long time:
Screen Shot 2020-04-14 at 15 00 43

I looked at blame and nothing changed in Fenix so this is likely an issue in a-c.


Comment form @MarcLeclair : This issue occurs because Fenix blocks the UI thread when calling LocationServices in FenixSearchEngineProvider.kt in method installedSearchEngine(). The best way to see this is to put LocationServiceS.dummy() in the else branch. Since the app built locally doesn't have any token, the app makes empty http request that just hangs the app.

FYI it does occur on the first click as the most obvious but it will happen on any subsequent calls, it just seems faster ( no idea why, didn't look much further into it).

┆Issue is synchronized with this Jira Task

@mcomella mcomella added 🐞 bug Crashes, Something isn't working, .. performance Possible performance wins labels Apr 14, 2020
@mcomella mcomella added this to Needs prioritization in Performance, front-end roadmap via automation Apr 14, 2020
@github-actions github-actions bot added the needs:triage Issue needs triage label Apr 14, 2020
@mcomella
Copy link
Contributor Author

@csadilek Do you know what might have changed in a-c to have caused this?

@csadilek
Copy link
Contributor

@mcomella looks like the cause is this: d89fbd7

Basically, the lack of an API key for MLS. I've verified that switching to the dummy location provider for those builds fixes it.

@mcomella
Copy link
Contributor Author

I think the possible solutions are:

  1. Use dummy MLS if API key is not available (i.e. based not on build variant but build inputs)
  2. Use a production version of MLS without an API key

I'd rather we do 2. so we can mimic the performance characteristics of production builds (and developers are closer to running what production builds do in their debug builds) but I'm not sure how feasible that is.

@mcomella mcomella self-assigned this Apr 15, 2020
@mcomella
Copy link
Contributor Author

For context, MLS is the Mozilla Location Service which is used to determine roughly where the user is so we can provide locale-specific search engines for them.


During triage we realized this could be indicative of an underlying performance problem: if a production version of MLS without an API key pauses the app for 10+ seconds, can the production version with the API key also create a critical performance issue?

@pocmo Can you explain to me why this timeout occurs if we don't have an API key? Do you think it's possible that this is indicative of an underlying performance problem we need to address? fwiw, csadilek thought it could be a perf problem but not a P1 perf problem.

Assigning self to remember to identify the priority.

@pocmo
Copy link
Contributor

pocmo commented Apr 22, 2020

@mcomella looks like the cause is this: d89fbd7

This is not the cause. It is trying to put even more bandaid on this thing. :)

@pocmo Can you explain to me why this timeout occurs if we don't have an API key? Do you think it's possible that this is indicative of an underlying performance problem we need to address? fwiw, csadilek thought it could be a perf problem but not a P1 perf problem.

There are definitely multiple weird things going on in Fenix and I didn't have the time to look into them. Fenix has some wrappers around the AC search code and those seem to change the behavior in a way that I do not understand yet.

  • AC's SearchEngineManager does only try once per process lifetime to determine the location and only if it was never done before and is not cached. In Fenix those seemed to happen more often, sidesteping our implementation.
  • Everything accessing SearchEngineManager needs to use the suspending methods asynchronously since we may need to do at least one disk read per process (or network request per install). If the app freezes then this does not seem to be the case (or the wrapper does something else synchronously)
  • Our MLS implementation in AC currently uses a 10 second connect and read timeout. That is probably a bit long as well and we could consider making that shorter - although that is not the root problem here.

So yeah, it does make sense to use the dummy implementation in builds that don't have an API key. No need to ping the server for nothing. But at the same time the app should not freeze if this request takes long and as of now it can take at least 10+ seconds with those timeouts.

@pocmo
Copy link
Contributor

pocmo commented Apr 22, 2020

Our MLS implementation in AC currently uses a 10 second connect and read timeout. That is probably a bit long as well and we could consider making that shorter - although that is not the root problem here.

FWIW, those are exactly the same timeouts as Fennec uses:
https://searchfox.org/mozilla-esr68/source/mobile/android/base/java/org/mozilla/gecko/search/SearchEngineManager.java#399

@mcomella mcomella moved this from Needs prioritization to Backlog (prioritized) in Performance, front-end roadmap Apr 22, 2020
@mcomella mcomella removed their assignment Apr 22, 2020
@liuche
Copy link
Contributor

liuche commented Apr 23, 2020

We're blocking the main thread when fetching geolocation for search engines. This is even worse when there is no MLS key (like with the two build flavors that mcomella has mentioned). Sebastian and mcomella have good suggestions in their comments above, and these would be good perf bugs to try profiling (for before/after).

Might be useful to have some SearchFragment knowledge

  1. don't block if no MLS key, fetch the key async
  2. Show empty list of engines until they load

@mcomella mcomella changed the title [Bug] Long pause after clicking url bar from homescreen on cold startup in fennecBeta/forPerformanceTest builds [Bug] Long pause after clicking url bar from homescreen on cold startup in fennecBeta/forPerformanceTest builds due to MLS May 6, 2020
@liuche liuche added this to Prioritized Backlog in Fenix Sprint Kanban May 7, 2020
@mcomella mcomella moved this from Backlog (prioritized) to Top 10 Inter-Team bugs in Performance, front-end roadmap May 13, 2020
@liuche liuche added the needs:UX-feedback Needs UX Feedback label May 14, 2020
@liuche
Copy link
Contributor

liuche commented May 14, 2020

Question for UX, while we're waiting for search engines to load, we're planning on just showing an empty list.

@liuche liuche changed the title [Bug] Long pause after clicking url bar from homescreen on cold startup in fennecBeta/forPerformanceTest builds due to MLS [Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS May 14, 2020
@liuche liuche added the size M label May 21, 2020
@liuche liuche unassigned boek May 22, 2020
@liuche liuche added the E3 Estimation Point: average, 2 - 3 days label May 22, 2020
@vesta0 vesta0 moved this from Prioritized Backlog to In Design in Fenix Sprint Kanban May 29, 2020
@vesta0 vesta0 moved this from In Design to Prioritized UX Backlog in Fenix Sprint Kanban May 29, 2020
@mcomella
Copy link
Contributor Author

@sblatz @boek I checked out the latest master 673507d and I still see the issue in profiles: https://share.firefox.dev/2ZjPRDd

It looks like getDefaultEngine calls installedSearchEngines and we actually call getDefaultEngine twice.

@mcomella
Copy link
Contributor Author

I investigated the issue I discovered. It appears the following call stack happens:

installedSearchEngines
- installedSearchEngineIdentifiers
 - localeAwareInstalledEnginesKey
  - localizationProvider.determineRegion
   - RegionSearchLocalizationProvider.determineRegion
    - MozillaLocationService.fetchRegion

When the region has not been successfully fetched, we try to fetch it from the network. The connection timeout is 10s and the read timeout is 10s so, in the worst case, we'd block for 20s in a single call. If the region has been successfully fetched, we fetch it from SharedPrefs. This method gets called multiple times on SearchFragment startup so it's possible to block for longer than 20s if we're unable to connect to the server. However, if we ever connect to the server successfully once, we cache the value and this should never happen again because we can always read from SharedPrefs.

I think we need still need to address this issue: even if it's unlikely to happen and only then likely to happen only once, the potential blocking for multiple seconds for the first user action on first run leaves a poor first impression and I'd prefer to avoid that. Furthermore, in the event that the MLS server is down, every new user would experience this.

Notes on reproducing:

  • I experienced this in my local builds because I don't have an MLS key so presumably I contact the server, receive an error response, and close the connection each time this method is called on SearchFragment startup.
  • I didn't experience this in production builds on my Pixel 2 because I have a fast connection, a fast device, and was able to contact the server successfully so I repeatedly used the cached value
    • I also tried with airplane mode but there was no delay: I assume the HTTP lib knows I'm not connected to the internet and returns an error immediately
  • I experienced this in production builds on an Android emulator with simulated network latency. As expected by theory, this delay (maybe 10s anecdotally? it's based on network connection) occurred once and never occurred again, even after killing the app

@liuche
Copy link
Contributor

liuche commented Jun 25, 2020

Could this be causing this crash if we're loading without search engines? #11906

@mcomella
Copy link
Contributor Author

I believe this will be resolved with the PR #11974 (review)

boek added a commit to boek/fenix that referenced this issue Jun 26, 2020
boek added a commit to boek/fenix that referenced this issue Jun 26, 2020
boek added a commit to boek/fenix that referenced this issue Jun 27, 2020
boek added a commit to boek/fenix that referenced this issue Jun 27, 2020
boek added a commit to boek/fenix that referenced this issue Jun 27, 2020
boek added a commit to boek/fenix that referenced this issue Jun 27, 2020
boek added a commit to boek/fenix that referenced this issue Jun 27, 2020
@liuche liuche removed the needs:UX-feedback Needs UX Feedback label Jun 27, 2020
boek added a commit that referenced this issue Jun 27, 2020
@liuche liuche mentioned this issue Jun 27, 2020
12 tasks
@sv-ohorvath
Copy link
Contributor

From #11974 (review) this was verified from the perf perspective. @boek Does this need manual QA as well?

@sv-ohorvath sv-ohorvath added eng:qa:not-needed Added by QA to issues that cannot be tested and removed eng:qa:needed QA Needed labels Jun 29, 2020
@sblatz sblatz closed this as completed Jul 13, 2020
Fenix Sprint Kanban automation moved this from Ready for QA to Sprint 20.11 Done Jul 13, 2020
Performance, front-end roadmap automation moved this from Top 10 Inter-Team bugs to Done Jul 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐞 bug Crashes, Something isn't working, .. E5 Estimation Point: about 5 days eng:qa:not-needed Added by QA to issues that cannot be tested performance Possible performance wins
Projects
Fenix Sprint Kanban
  
Sprint 20.11 Done
Development

No branches or pull requests

10 participants