[Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS #9935

mcomella · 2020-04-14T22:09:14Z

Steps to reproduce

Build geckoBetaFennecBeta or geckoBetaForPerformanceTest
Start app (cold start)
Click url bar

Expected behavior

Quick to open

Actual behavior

Long pause. In forPerformanceTest with the profiler going (so with much overhead), it takes > 10 seconds.

Device information

Android device: P2
Fenix version: master 152642d

I took a profile and it looks like FenixSearchEngineProvider.installedSearchEngines is blocking the main thread for a long time:

I looked at blame and nothing changed in Fenix so this is likely an issue in a-c.

Comment form @MarcLeclair : This issue occurs because Fenix blocks the UI thread when calling LocationServices in FenixSearchEngineProvider.kt in method installedSearchEngine(). The best way to see this is to put LocationServiceS.dummy() in the else branch. Since the app built locally doesn't have any token, the app makes empty http request that just hangs the app.

FYI it does occur on the first click as the most obvious but it will happen on any subsequent calls, it just seems faster ( no idea why, didn't look much further into it).

┆Issue is synchronized with this Jira Task

The text was updated successfully, but these errors were encountered:

mcomella · 2020-04-14T22:11:44Z

@csadilek Do you know what might have changed in a-c to have caused this?

csadilek · 2020-04-15T00:13:00Z

@mcomella looks like the cause is this: d89fbd7

Basically, the lack of an API key for MLS. I've verified that switching to the dummy location provider for those builds fixes it.

mcomella · 2020-04-15T00:35:05Z

I think the possible solutions are:

Use dummy MLS if API key is not available (i.e. based not on build variant but build inputs)
Use a production version of MLS without an API key

I'd rather we do 2. so we can mimic the performance characteristics of production builds (and developers are closer to running what production builds do in their debug builds) but I'm not sure how feasible that is.

mcomella · 2020-04-16T00:32:46Z

For context, MLS is the Mozilla Location Service which is used to determine roughly where the user is so we can provide locale-specific search engines for them.

During triage we realized this could be indicative of an underlying performance problem: if a production version of MLS without an API key pauses the app for 10+ seconds, can the production version with the API key also create a critical performance issue?

@pocmo Can you explain to me why this timeout occurs if we don't have an API key? Do you think it's possible that this is indicative of an underlying performance problem we need to address? fwiw, csadilek thought it could be a perf problem but not a P1 perf problem.

Assigning self to remember to identify the priority.

pocmo · 2020-04-22T08:30:47Z

@mcomella looks like the cause is this: d89fbd7

This is not the cause. It is trying to put even more bandaid on this thing. :)

@pocmo Can you explain to me why this timeout occurs if we don't have an API key? Do you think it's possible that this is indicative of an underlying performance problem we need to address? fwiw, csadilek thought it could be a perf problem but not a P1 perf problem.

There are definitely multiple weird things going on in Fenix and I didn't have the time to look into them. Fenix has some wrappers around the AC search code and those seem to change the behavior in a way that I do not understand yet.

AC's SearchEngineManager does only try once per process lifetime to determine the location and only if it was never done before and is not cached. In Fenix those seemed to happen more often, sidesteping our implementation.
Everything accessing SearchEngineManager needs to use the suspending methods asynchronously since we may need to do at least one disk read per process (or network request per install). If the app freezes then this does not seem to be the case (or the wrapper does something else synchronously)
Our MLS implementation in AC currently uses a 10 second connect and read timeout. That is probably a bit long as well and we could consider making that shorter - although that is not the root problem here.

So yeah, it does make sense to use the dummy implementation in builds that don't have an API key. No need to ping the server for nothing. But at the same time the app should not freeze if this request takes long and as of now it can take at least 10+ seconds with those timeouts.

pocmo · 2020-04-22T13:00:47Z

Our MLS implementation in AC currently uses a 10 second connect and read timeout. That is probably a bit long as well and we could consider making that shorter - although that is not the root problem here.

FWIW, those are exactly the same timeouts as Fennec uses:
https://searchfox.org/mozilla-esr68/source/mobile/android/base/java/org/mozilla/gecko/search/SearchEngineManager.java#399

liuche · 2020-04-23T23:51:29Z

We're blocking the main thread when fetching geolocation for search engines. This is even worse when there is no MLS key (like with the two build flavors that mcomella has mentioned). Sebastian and mcomella have good suggestions in their comments above, and these would be good perf bugs to try profiling (for before/after).

Might be useful to have some SearchFragment knowledge

don't block if no MLS key, fetch the key async
Show empty list of engines until they load

liuche · 2020-05-14T18:23:17Z

Question for UX, while we're waiting for search engines to load, we're planning on just showing an empty list.

mcomella · 2020-06-24T16:53:40Z

@sblatz @boek I checked out the latest master 673507d and I still see the issue in profiles: https://share.firefox.dev/2ZjPRDd

It looks like getDefaultEngine calls installedSearchEngines and we actually call getDefaultEngine twice.

mcomella · 2020-06-25T18:40:52Z

I investigated the issue I discovered. It appears the following call stack happens:

installedSearchEngines
- installedSearchEngineIdentifiers
 - localeAwareInstalledEnginesKey
  - localizationProvider.determineRegion
   - RegionSearchLocalizationProvider.determineRegion
    - MozillaLocationService.fetchRegion

When the region has not been successfully fetched, we try to fetch it from the network. The connection timeout is 10s and the read timeout is 10s so, in the worst case, we'd block for 20s in a single call. If the region has been successfully fetched, we fetch it from SharedPrefs. This method gets called multiple times on SearchFragment startup so it's possible to block for longer than 20s if we're unable to connect to the server. However, if we ever connect to the server successfully once, we cache the value and this should never happen again because we can always read from SharedPrefs.

I think we need still need to address this issue: even if it's unlikely to happen and only then likely to happen only once, the potential blocking for multiple seconds for the first user action on first run leaves a poor first impression and I'd prefer to avoid that. Furthermore, in the event that the MLS server is down, every new user would experience this.

Notes on reproducing:

I experienced this in my local builds because I don't have an MLS key so presumably I contact the server, receive an error response, and close the connection each time this method is called on SearchFragment startup.
I didn't experience this in production builds on my Pixel 2 because I have a fast connection, a fast device, and was able to contact the server successfully so I repeatedly used the cached value
- I also tried with airplane mode but there was no delay: I assume the HTTP lib knows I'm not connected to the internet and returns an error immediately
I experienced this in production builds on an Android emulator with simulated network latency. As expected by theory, this delay (maybe 10s anecdotally? it's based on network connection) occurred once and never occurred again, even after killing the app

liuche · 2020-06-25T21:10:26Z

Could this be causing this crash if we're loading without search engines? #11906

mcomella · 2020-06-26T16:56:02Z

I believe this will be resolved with the PR #11974 (review)

sv-ohorvath · 2020-06-29T12:05:41Z

From #11974 (review) this was verified from the perf perspective. @boek Does this need manual QA as well?

mcomella added 🐞 bug Crashes, Something isn't working, .. performance Possible performance wins labels Apr 14, 2020

mcomella added this to Needs prioritization in Performance, front-end roadmap via automation Apr 14, 2020

github-actions bot added the needs:triage Issue needs triage label Apr 14, 2020

mcomella self-assigned this Apr 15, 2020

mcomella moved this from Needs prioritization to Backlog (prioritized) in Performance, front-end roadmap Apr 22, 2020

mcomella removed their assignment Apr 22, 2020

mcomella added the needs:group-triage label Apr 22, 2020

eliserichards assigned boek Apr 29, 2020

eliserichards removed needs:group-triage needs:triage Issue needs triage labels Apr 29, 2020

mcomella changed the title ~~[Bug] Long pause after clicking url bar from homescreen on cold startup in fennecBeta/forPerformanceTest builds~~ [Bug] Long pause after clicking url bar from homescreen on cold startup in fennecBeta/forPerformanceTest builds due to MLS May 6, 2020

liuche added this to Prioritized Backlog in Fenix Sprint Kanban May 7, 2020

mcomella moved this from Backlog (prioritized) to Top 10 Inter-Team bugs in Performance, front-end roadmap May 13, 2020

liuche added the needs:UX-feedback Needs UX Feedback label May 14, 2020

liuche changed the title ~~[Bug] Long pause after clicking url bar from homescreen on cold startup in fennecBeta/forPerformanceTest builds due to MLS~~ [Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS May 14, 2020

liuche added the size M label May 21, 2020

liuche unassigned boek May 22, 2020

liuche added the E3 Estimation Point: average, 2 - 3 days label May 22, 2020

vesta0 moved this from Prioritized Backlog to In Design in Fenix Sprint Kanban May 29, 2020

vesta0 moved this from In Design to Prioritized UX Backlog in Fenix Sprint Kanban May 29, 2020

ekager mentioned this issue Jun 2, 2020

[Bug] App Freezes with Could not fetch region from location service in MozillaLocationService #10793

Closed

mcomella mentioned this issue Jun 24, 2020

[Bug] Search engine logo in homescreen toolbar takes a long time to show up on start #11911

Closed

sblatz moved this from In Progress to Ready for QA in Fenix Sprint Kanban Jun 25, 2020

boek added a commit to boek/fenix that referenced this issue Jun 25, 2020

For mozilla-mobile#9935 - Fallback region selection on first load

6ad2676

boek added a commit to boek/fenix that referenced this issue Jun 25, 2020

For mozilla-mobile#9935 - Fallback region selection on first load

ef05df3

liuche mentioned this issue Jun 25, 2020

[Bug] java.lang.IndexOutOfBoundsException: at java.util.ArrayList.get(ArrayList.java) #11906

Closed

boek added a commit to boek/fenix that referenced this issue Jun 25, 2020

For mozilla-mobile#9935 - Use the searchengine deferred

12d2c7f

liuche mentioned this issue Jun 26, 2020

NoSuchElementExceptionkotlin.collections.ArraysKt in first fatal List is empty #11985

Closed

boek added a commit to boek/fenix that referenced this issue Jun 26, 2020

For mozilla-mobile#9935 - Fallback region selection on first load

478d875

boek added a commit to boek/fenix that referenced this issue Jun 26, 2020

For mozilla-mobile#9935 - Use the searchengine deferred

8c97c84

boek added a commit to boek/fenix that referenced this issue Jun 27, 2020

For mozilla-mobile#9935 - Use the searchengine deferred

71df92f

boek added a commit to boek/fenix that referenced this issue Jun 27, 2020

For mozilla-mobile#9935 - Fallback region selection on first load

65c8e98

boek added a commit to boek/fenix that referenced this issue Jun 27, 2020

For mozilla-mobile#9935 - Use the searchengine deferred

d9c9d2c

boek added a commit to boek/fenix that referenced this issue Jun 27, 2020

For mozilla-mobile#9935 - Use the searchengine deferred

8e6355d

boek added a commit to boek/fenix that referenced this issue Jun 27, 2020

For mozilla-mobile#9935 - Use the searchengine deferred

f993937

liuche removed the needs:UX-feedback Needs UX Feedback label Jun 27, 2020

boek added a commit that referenced this issue Jun 27, 2020

For #9935 - Fallback region selection on first load

40977a9

boek added a commit that referenced this issue Jun 27, 2020

For #9935 - Use the searchengine deferred

b1a8c0f

liuche mentioned this issue Jun 27, 2020

Releng 78.0.1/79.0.0 #12048

Closed

12 tasks

sv-ohorvath added eng:qa:not-needed Added by QA to issues that cannot be tested and removed eng:qa:needed QA Needed labels Jun 29, 2020

Mugurell mentioned this issue Jul 3, 2020

[Bug]Wrong default search engine displayed on startup #11875

Closed

sblatz closed this as completed Jul 13, 2020

Fenix Sprint Kanban automation moved this from Ready for QA to Sprint 20.11 Done Jul 13, 2020

Performance, front-end roadmap automation moved this from Top 10 Inter-Team bugs to Done Jul 13, 2020

liuche mentioned this issue Aug 30, 2020

[Bug] First search from urlbar very slow, 2 minutes #14500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS #9935

[Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS #9935

mcomella commented Apr 14, 2020 •

edited by data-sync-user

mcomella commented Apr 14, 2020

csadilek commented Apr 15, 2020

mcomella commented Apr 15, 2020

mcomella commented Apr 16, 2020

pocmo commented Apr 22, 2020

pocmo commented Apr 22, 2020 •

edited

liuche commented Apr 23, 2020 •

edited

liuche commented May 14, 2020

mcomella commented Jun 24, 2020

mcomella commented Jun 25, 2020

liuche commented Jun 25, 2020

mcomella commented Jun 26, 2020

sv-ohorvath commented Jun 29, 2020

[Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS #9935

[Bug] On slow network, main thread blocked after clicking url bar from homescreen on cold startup due to MLS #9935

Comments

mcomella commented Apr 14, 2020 • edited by data-sync-user

Steps to reproduce

Expected behavior

Actual behavior

Device information

mcomella commented Apr 14, 2020

csadilek commented Apr 15, 2020

mcomella commented Apr 15, 2020

mcomella commented Apr 16, 2020

pocmo commented Apr 22, 2020

pocmo commented Apr 22, 2020 • edited

liuche commented Apr 23, 2020 • edited

liuche commented May 14, 2020

mcomella commented Jun 24, 2020

mcomella commented Jun 25, 2020

liuche commented Jun 25, 2020

mcomella commented Jun 26, 2020

sv-ohorvath commented Jun 29, 2020

mcomella commented Apr 14, 2020 •

edited by data-sync-user

pocmo commented Apr 22, 2020 •

edited

liuche commented Apr 23, 2020 •

edited