Skip to content
This repository has been archived by the owner on Feb 20, 2023. It is now read-only.

[Bug][Experiment]: The TCP mobile experiment doesn't enable the feature #26041

Closed
SoftVision-CosminMuntean opened this issue Jul 15, 2022 · 17 comments · Fixed by #26228, #26427, fork-house/fenix#12 or nathanmkaya/fenix#108
Labels
🐞 bug Crashes, Something isn't working, .. eng:qa:verified QA Verified Nimbus qa-triaged Issues triaged by qa S1 Blocks development/testing, may impact more than 25% of users, causes data loss, potential chemspill
Milestone

Comments

@SoftVision-CosminMuntean
Copy link

SoftVision-CosminMuntean commented Jul 15, 2022

Steps to reproduce

  1. Open the latest Nightly build downloaded from Google Play Store.
  2. Tap the "..." menu button and tap the "Settings" option.
  3. Tap the "About Firefox Nightly" option and tap 5 times on the Nightly logo to enter in debug mode.
  4. Return to the "Settings" menu and tab the "Secret Settings" option.
  5. Enable the "Use Nimbus Preview Collection" option.
  6. Restart the application (at least twice) in order to have the Preview collection.
  7. Enter again in debug mode (see step 3).
  8. Return to the "Settings" menu and tab the "Nimbus Experiments" option.
  9. Tap the "Total Cookie Protection Experiment v103" experiment.
  10. Tap the "tcp-on" branch and restart the browser.
  11. Navigate to the TCP test page and observe column 3 and 4.

Expected behaviour

  • Cookies and localStorage are "PARTITIONED" on both 3 and 4 columns.

Actual behaviour

  • Cookies and localStorage are "UNRESTRICTED" on both 3 and 4 columns.

Device name

Google Pixel 2XL, Samsung s9

Android version

Android 11, Android 10

Firefox release type

Firefox Nightly

Firefox version

103.0a1 and 104.0a4

Device logs

No response

Additional information

┆Issue is synchronized with this Jira Task

@SoftVision-CosminMuntean SoftVision-CosminMuntean added needs:triage Issue needs triage 🐞 bug Crashes, Something isn't working, .. labels Jul 15, 2022
@Mugurell
Copy link
Contributor

I am not sure but is there a chance for this behavior to be related to the one described in #26040?

They are related but while the behavior from #26040 is expected the Nimbus experiment should override the local settings whatever they may be.

@SoftVision-LorandJanos
Copy link

I was able to reproduce this issue on the latest Nightly 104.0a1 (2022-07-18) build.
Device used: Google Pixel 4 (Android 12).

@SoftVision-LorandJanos SoftVision-LorandJanos added S3 Blocks non-critical functionality and a work around exists qa-triaged Issues triaged by qa Nimbus labels Jul 18, 2022
@Amejia481
Copy link
Contributor

@SoftVision-CosminMuntean thanks for testing,
I noticed that on the STR we mentioned we are testing on nightly, but on the Recipe's JSON, we are targeting the release channel, it could the problem is that while you were testing you were not eligible to be enrolled on the experiment.

image

@Amejia481
Copy link
Contributor

Could be try to test on the same channel where we are launching the experiment?

An additional targeting version from the Recipe experiment page:

Full targeting expression (app_version|versionCompare('103.*') <= 0) && (app_version|versionCompare('103.!') >= 0)

@SoftVision-CosminMuntean
Copy link
Author

@Amejia481 thanks for looking over this issue.

I have also verified with Fenix 103.1.0 and the feature is still not enabled by the experiment. I can enroll in the experiment in the treatment branch, but the feature is not enabled.
We have also created a recipe using the Stage environment and naturally enrolled in the experiment, but the feature is still not enabled.
So it seems that the problem is not that we cannot enroll in the experiment, the problem is that after we enroll in the experiment, in the treatment branch, the TCP feature is not enabled.

However, just to be sure, the test page mentioned in Step 11 is a valid way to verify that we have TCP enabled?

@Amejia481
Copy link
Contributor

Amejia481 commented Jul 19, 2022

Thanks @SoftVision-CosminMuntean, for double checking, I'll continue investigating, I'll asking the nimbus team for help as I'm no sure why every time I query the nimbus API, about the TCP experiment I get the same value independently of the branch that I'm enrolled into.

However, just to be sure, the test page mentioned in Step 11 is a valid way to verify that we have TCP enabled?

Yes, it's good way. Additionally, you could verify by going to about:config and searching for network.cookie.cookieBehavior and it should be set to 5.

@SoftVision-CiprianMuresan SoftVision-CiprianMuresan added S1 Blocks development/testing, may impact more than 25% of users, causes data loss, potential chemspill and removed S3 Blocks non-critical functionality and a work around exists labels Jul 19, 2022
@SoftVision-CiprianMuresan

Marking this as S1 as it blocks further experiment testing.

@SoftVision-LorandJanos SoftVision-LorandJanos added S1 Blocks development/testing, may impact more than 25% of users, causes data loss, potential chemspill and removed S1 Blocks development/testing, may impact more than 25% of users, causes data loss, potential chemspill labels Jul 19, 2022
@Amejia481
Copy link
Contributor

Thanks @SoftVision-CiprianMuresan, @jhugman and I we are investigating!

@jonalmeida
Copy link
Contributor

jonalmeida commented Jul 28, 2022

Copying over James' JIRA comment here so I can reference it (Fenix JIRA is only a one-way sync):

Progress/investigation:

  • the lazy caching of the feature value() which can be removed from settings.totalCookieProtection
  • the value coming out of the value() is always the default value for engine-settings (no other feature). The Variables implementation that the generated code (FML) gets is always NullVariables, which appeared to be null coming out of the SDK.
  • Closer inspection reveals that the SDK available to the feature is null!
  • This indicates that the call to get the engine-settings configuration object from FxNimbus i.e. FxNimbus.features.engineSettings.value() is likely happening before FxNimbus.initialize { components.analytics.experiments }
  • ConceptFetchHttpUploader requires the engine, which is needed by Glean which is initialized before Nimbus; i.e. we can’t experiment with concept-fetch
  • Side note: concept-fetch is also needed for Nimbus initialization, so when we disable Glean, we get a deadlock. I don’t know why we get a deadlock not a stack overflow.
  • The FxNimbus.initialize() method takes a closure to allow features to lazily initialize the Nimbus SDK on demand when the feature first needs it. This getSdk property is passed to FeatureHolders on first use.
  • For features that are used before the Nimbus SDK is constructed, (i.e. engine-settings feature only) the feature holder never has access to the SDK.

Short term unblocking fix:

  • use a deprecated way of attaching the SDK to the generated FxNimbus code.
    • FxNimbus.api = nimbus

Longer term:

  • consider ways of updating the SDK or the getSdk closure after features have been first used, i.e. make the initialize method proactively go update all FeatureHolders with new SDK objects or getSdk closures.

Plan:

  1. Jonathan Almeida will land the short term fix,
  2. Jonathan Almeida and James Hugman to discuss best path/best practice for updating the early-use features with the new value of getSdk
  3. Nimbus team to implement longer term fix recommended in step 2.

To clarify this part:

  • ConceptFetchHttpUploader requires the engine, which is needed by Glean which is initialized before Nimbus; i.e. we can’t experiment with concept-fetch
  • Side note: concept-fetch is also needed for Nimbus initialization, so when we disable Glean, we get a deadlock. I don’t know why we get a deadlock not a stack overflow.

There are many parts of the app that require the engine to be initialized ASAP. Even if we are able to disable Glean, the engine will be initialized by the next system that needs (e.g. which is probably the profiler). So concept-fetch isn't the cause of the dependency issue, but the first part of the engine that is immediately needed.

Short term unblocking fix:

  • use a deprecated way of attaching the SDK to the generated FxNimbus code.
    • FxNimbus.api = nimbus

With the workaround that is mentioned above, we are then required to the tell the engine to re-evaluate it's settings for any feature we want to experiment with (in this case, we re-evaluate the tracking protection settings).

jonalmeida added a commit to jonalmeida/fenix that referenced this issue Jul 28, 2022
…s ready.

There are three issues here that we have uncovered while investigating
this bug:
 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 2. The `FeatureHolder` within FxNimbus has a lazy block around it
    which also ends up caching the first result it evaluates.
 3. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.

There are multiple systems that require engine to be initialized for
them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
experiment, we need to apply these engine settings during the engine
initialization. So when we try and evaluate Nimbus that early on, it
has not had time to initialize itself correctly or even use the
engine's concept-fetch client to return the correct experiment result.
This bug is made worse because of the first two caching bugs where we
are always holding onto a cached value of the wrong result.

Our temporary solution is to:
 1. Remove the `lazy` around `Settings.enabledTotalCookieProtection`.
 2. Set the `FxNimbus.api` value right after we are done initializing
    `FxNimbus` and `NimbusApi` so that all future queries to FxNimbus
    will be made against a real instance of `NimbusApi`. This is a
    short-term fix for the `FeatureHolder` caching bug.
 3. Set a new TrackingProtectionPolicy that will evaluate Nimbus now
    that it is in the correct state when receive the
    `NimbusInterface.Observer.onUpdatesApplied`.

Co-authored-by: jhugman
@github-actions github-actions bot added the eng:reopen-for-qa Reopens and tags the issue for QA needed when the issue is merged label Jul 28, 2022
jonalmeida added a commit to jonalmeida/fenix that referenced this issue Aug 2, 2022
…s ready.

There are three issues here that we have uncovered while investigating
this bug:
 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 2. The `FeatureHolder` within FxNimbus caches the incorrectly
    evaluated value and returns this value hence forth.
 3. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.

There are multiple systems that require engine to be initialized for
them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
experiment, we need to apply these engine settings during the engine
initialization. So when we try and evaluate Nimbus that early on, it
has not had time to initialize itself correctly or even use the
engine's concept-fetch client to return the correct experiment result.
This bug is made worse because of the first two caching bugs where we
are always holding onto a cached value of the wrong result.

Our temporary solution is to:
 1. Remove the `lazy` around `Settings.enabledTotalCookieProtection`.
 2. Set the `FxNimbus.api` value right after we are done initializing
    `FxNimbus` and `NimbusApi` so that all future queries to FxNimbus
    will be made against a real instance of `NimbusApi`. This is a
    short-term fix for the `FeatureHolder` caching bug.
 3. Set a new TrackingProtectionPolicy that will evaluate Nimbus now
    that it is in the correct state when receive the
    `NimbusInterface.Observer.onUpdatesApplied`.

Co-authored-by: jhugman <jhugman@users.noreply.github.com>
jonalmeida added a commit to jonalmeida/fenix that referenced this issue Aug 3, 2022
…he engine

There are three issues here that we have uncovered while investigating
this bug:
 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 2. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.
 3. The `FeatureHolder` within FxNimbus caches the incorrectly
    evaluated value and returns this value hence forth.

There are multiple systems that require the engine to be initialized
for them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
experiment, we need to apply these engine settings during the engine
initialization. So when we try and evaluate Nimbus that early on, it
has not had time to initialize itself correctly or even use the
engine's concept-fetch client to return the correct experiment result.
This bug is made worse because of the two other caching bugs where we
are always holding onto a cached value of the wrong result.

During investigation, we have identified that the BrowserStore was
eagerly initialized (which initialized the engine as well via the
`EngineMiddleware`) by trying to sending startup metrics from values in
the store.

By moving Nimbus to be initialized as early on as possible, we're able
to guarantee now that our queries to get the feature branch for any
engine experiment will provide us with the correct values from Nimbus.
It's important to note that moving these calls to gather startup
metrics are still happening as soon as possible, and are now more
likely to be accurate in the future, if we collect metrics from the
engine settings that the user has.

In a debug build, we were also seeing a circular dependencies because of
the StrictModeManager initializing the engine when trying to access the
profiler - this call was also removed because it was originally added a
preventative measure that is no longer required.

In this change, we have also moved Glean initialization and wallpaper
fetching into the `setupInMainProcessOnly` method to keep the original
consistency of what is being initialized in the main process and when.
This change does not affect the order of initialization in any way,
however the inconsistency added a small amount of confusion during
investigation, which is worth cleaning up in this patch as well.

Co-authored-by: jhugman <jhugman@users.noreply.github.com>
Co-authored-by: Christian Sadilek <christian.sadilek@gmail.com>
jonalmeida added a commit to jonalmeida/fenix that referenced this issue Aug 3, 2022
…he engine

There are three issues here that we have uncovered while investigating
this bug:
 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 2. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.
 3. The `FeatureHolder` within FxNimbus caches the incorrectly
    evaluated value and returns this value hence forth.

There are multiple systems that require the engine to be initialized
for them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
experiment, we need to apply these engine settings during the engine
initialization. So when we try and evaluate Nimbus that early on, it
has not had time to initialize itself correctly or even use the
engine's concept-fetch client to return the correct experiment result.
This bug is made worse because of the two other caching bugs where we
are always holding onto a cached value of the wrong result.

During investigation, we have identified that the BrowserStore was
eagerly initialized (which initialized the engine as well via the
`EngineMiddleware`) by trying to sending startup metrics from values in
the store.

By moving Nimbus to be initialized as early on as possible, we're able
to guarantee now that our queries to get the feature branch for any
engine experiment will provide us with the correct values from Nimbus.
It's important to note that moving these calls to gather startup
metrics are still happening as soon as possible, and are now more
likely to be accurate in the future, if we collect metrics from the
engine settings that the user has.

In a debug build, we were also seeing a circular dependencies because of
the StrictModeManager initializing the engine when trying to access the
profiler - this call was also removed because it was originally added a
preventative measure that is no longer required.

In this change, we have also moved Glean initialization and wallpaper
fetching into the `setupInMainProcessOnly` method to keep the original
consistency of what is being initialized in the main process and when.
This change does not affect the order of initialization in any way,
however the inconsistency added a small amount of confusion during
investigation, which is worth cleaning up in this patch as well.

Co-authored-by: jhugman <jhugman@users.noreply.github.com>
Co-authored-by: Christian Sadilek <christian.sadilek@gmail.com>
jonalmeida added a commit to jonalmeida/fenix that referenced this issue Aug 4, 2022
…bus SDK is initialized

There are three issues here that we have uncovered while investigating
this bug:

 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 3. The `FeatureHolder` within FxNimbus caches the incorrectly
    evaluated value and returns this value hence forth.
 4. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.

There are multiple systems that require engine to be initialized for
 them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
 experiment, we need to apply these engine settings during the engine
 initialization. So when we try and evaluate Nimbus that early on, it
 has not had time to initialize itself correctly or even use the
 engine's concept-fetch client to return the correct experiment result.
 This bug is made worse because of the first two caching bugs where we
 are always holding onto a cached value of the wrong result.

Our temporary solution is to:

 1. Remove the `lazy` around `Settings.enabledTotalCookieProtection`.
 2. Set the `FxNimbus.api` value right after we are done initializing
    `FxNimbus` and `NimbusApi` so that all future queries to FxNimbus
    will be made against a real instance of `NimbusApi`. This is a
    short-term fix for the `FeatureHolder` caching bug.
 3. Set a new TrackingProtectionPolicy that will evaluate Nimbus now
    that it is in the correct state when receive the
    `NimbusInterface.Observer.onUpdatesApplied`.

Co-authored-by: jhugman <jhugman@users.noreply.github.com>
Co-authored-by: Christian Sadilek <christian.sadilek@gmail.com>
jonalmeida added a commit to jonalmeida/android-components that referenced this issue Aug 8, 2022
As part of mozilla-mobile/fenix#26041, we need to ensure we have at
least version 93.6.0 of the plugin downloaded. The reason we are not
using a newer version, even though they are available, is to reduce the
complexity of things that go into the next AC release in order to get
the Fenix experiment launched.
mergify bot pushed a commit to mozilla-mobile/android-components that referenced this issue Aug 8, 2022
As part of mozilla-mobile/fenix#26041, we need to ensure we have at
least version 93.6.0 of the plugin downloaded. The reason we are not
using a newer version, even though they are available, is to reduce the
complexity of things that go into the next AC release in order to get
the Fenix experiment launched.
@github-actions github-actions bot added this to the 105 milestone Aug 9, 2022
@github-actions github-actions bot reopened this Aug 9, 2022
@github-actions github-actions bot added eng:qa:needed QA Needed and removed eng:reopen-for-qa Reopens and tags the issue for QA needed when the issue is merged labels Aug 9, 2022
@jonalmeida
Copy link
Contributor

@SoftVision-CosminMuntean could you retest this experiment using the latest nightly 105 build (available in a few hours from when this comment was written)?

@SoftVision-CosminMuntean
Copy link
Author

@jonalmeida We have retested this issue on the latest Nightly 105.0a1 build and we are successfully enrolled in the experiment and also the TCP feature is enabled if enrolling in the "tcp-on" branch.

We have verified this on Google 2 XL with Android 11, Samsung S9 with Android 11, and Motorola Moto G(10) with Android 11.

jonalmeida added a commit to mozilla-mobile/android-components that referenced this issue Aug 10, 2022
As part of mozilla-mobile/fenix#26041, we need to ensure we have at
least version 93.6.0 of the plugin downloaded. The reason we are not
using a newer version, even though they are available, is to reduce the
complexity of things that go into the next AC release in order to get
the Fenix experiment launched.

(cherry picked from commit 95c0400)
@jonalmeida
Copy link
Contributor

Thank you!

mergify bot pushed a commit that referenced this issue Aug 11, 2022
…tialized

There are three issues here that we have uncovered while investigating
this bug:

 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 3. The `FeatureHolder` within FxNimbus caches the incorrectly
    evaluated value and returns this value hence forth.
 4. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.

There are multiple systems that require engine to be initialized for
 them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
 experiment, we need to apply these engine settings during the engine
 initialization. So when we try and evaluate Nimbus that early on, it
 has not had time to initialize itself correctly or even use the
 engine's concept-fetch client to return the correct experiment result.
 This bug is made worse because of the first two caching bugs where we
 are always holding onto a cached value of the wrong result.

Our temporary solution is to:

 1. Remove the `lazy` around `Settings.enabledTotalCookieProtection`.
 2. Set the `FxNimbus.api` value right after we are done initializing
    `FxNimbus` and `NimbusApi` so that all future queries to FxNimbus
    will be made against a real instance of `NimbusApi`. This is a
    short-term fix for the `FeatureHolder` caching bug.
 3. Set a new TrackingProtectionPolicy that will evaluate Nimbus now
    that it is in the correct state when receive the
    `NimbusInterface.Observer.onUpdatesApplied`.

Co-authored-by: jhugman <jhugman@users.noreply.github.com>
Co-authored-by: Christian Sadilek <christian.sadilek@gmail.com>
(cherry picked from commit f03ee91)
jonalmeida added a commit to jonalmeida/fenix that referenced this issue Aug 11, 2022
This workaround was temporary and is not needed with the Nimbus groovy
plugin updates in Android Components.
@github-actions github-actions bot added the eng:reopen-for-qa Reopens and tags the issue for QA needed when the issue is merged label Aug 11, 2022
jonalmeida added a commit that referenced this issue Aug 11, 2022
This workaround was temporary and is not needed with the Nimbus groovy
plugin updates in Android Components.
@github-actions github-actions bot reopened this Aug 11, 2022
@github-actions github-actions bot removed the eng:reopen-for-qa Reopens and tags the issue for QA needed when the issue is merged label Aug 11, 2022
@jonalmeida
Copy link
Contributor

Apologiese @SoftVision-CosminMuntean, we had to land one more patch for this fix. Could you please verify that the experiment is still working in nightly?

jonalmeida added a commit that referenced this issue Aug 11, 2022
…tialized

There are three issues here that we have uncovered while investigating
this bug:

 1. Settings.kt has a lazy block around `enabledTotalCookieProtection`
    which ends up caching the first result it evaluates.
 3. The `FeatureHolder` within FxNimbus caches the incorrectly
    evaluated value and returns this value hence forth.
 4. Nimbus is not ready to return a result for an engine experiment
    when we need it early on in the dependency tree initialization.

There are multiple systems that require engine to be initialized for
 them to work (e.g. Glean, Profiler, concept-fetch). In our TCP,
 experiment, we need to apply these engine settings during the engine
 initialization. So when we try and evaluate Nimbus that early on, it
 has not had time to initialize itself correctly or even use the
 engine's concept-fetch client to return the correct experiment result.
 This bug is made worse because of the first two caching bugs where we
 are always holding onto a cached value of the wrong result.

Our temporary solution is to:

 1. Remove the `lazy` around `Settings.enabledTotalCookieProtection`.
 2. Set the `FxNimbus.api` value right after we are done initializing
    `FxNimbus` and `NimbusApi` so that all future queries to FxNimbus
    will be made against a real instance of `NimbusApi`. This is a
    short-term fix for the `FeatureHolder` caching bug.
 3. Set a new TrackingProtectionPolicy that will evaluate Nimbus now
    that it is in the correct state when receive the
    `NimbusInterface.Observer.onUpdatesApplied`.

Co-authored-by: jhugman <jhugman@users.noreply.github.com>
Co-authored-by: Christian Sadilek <christian.sadilek@gmail.com>
(cherry picked from commit f03ee91)
jonalmeida added a commit that referenced this issue Aug 11, 2022
This workaround was temporary and is not needed with the Nimbus groovy
plugin updates in Android Components.
@SoftVision-CosminMuntean
Copy link
Author

@jonalmeida no problem! We have retested the experiment on the latest Nightly 105.0a1 build and also on the new Beta 104.0b6.
Since the production recipe recipe for this experiment is moved to the "Draft" state we have cloned the recipe on the Stage environment in order to re-test the experiment.

  • Everything seems to work as expected: we are successfully enrolled in the experiment and also the TCP feature is enabled when enrolling in the "tcp-on" branch.

We have verified this on Google 2 XL with Android 11 and Samsung S9 with Android 11.

@jonalmeida
Copy link
Contributor

@cpeterso @athomasmoz FYI: we've landed the patches in Nightly and 104. After the next 104 release, we can start the experiment.

@cpeterso
Copy link

@cpeterso @athomasmoz FYI: we've landed the patches in Nightly and 104. After the next 104 release, we can start the experiment.

Awesome. 104 will be released next week on 2022-08-23. We can try to launch the TCP experiment soon after.

@gabrielluong gabrielluong removed the needs:triage Issue needs triage label Aug 19, 2022
JohanLorenzo pushed a commit to mozilla-mobile/firefox-android that referenced this issue Oct 31, 2022
As part of mozilla-mobile/fenix#26041, we need to ensure we have at
least version 93.6.0 of the plugin downloaded. The reason we are not
using a newer version, even though they are available, is to reduce the
complexity of things that go into the next AC release in order to get
the Fenix experiment launched.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🐞 bug Crashes, Something isn't working, .. eng:qa:verified QA Verified Nimbus qa-triaged Issues triaged by qa S1 Blocks development/testing, may impact more than 25% of users, causes data loss, potential chemspill
Projects
None yet
9 participants