Skip to content

Add explicit opt-in for logins component telemetry.#2325

Closed
rfk wants to merge 2 commits intologins-metrics-v1from
logins-metrics-v1-flag
Closed

Add explicit opt-in for logins component telemetry.#2325
rfk wants to merge 2 commits intologins-metrics-v1from
logins-metrics-v1-flag

Conversation

@rfk
Copy link
Contributor

@rfk rfk commented Dec 4, 2019

For our logins-component metrics gathering to be palatable to consumers, it needs to be off by default; otherwise we risk a "metrics surprise!" for consuming apps that upgrade without carefully reading the list of breaking changes.

This adds an explicit function that consumers can call to opt-in to metrics gathering, in a way that only means they're opting in to the current set of metrics, not all metrics that we might ever add in future. I'm not thrilled about the API, but it's simple and I think it has a low probability of being used incorrectly.

Ref mozilla-mobile/android-components#5223 for discussion about this for integration into android-components. Note that this is targeting the active PR #2225, since I don't want to merge that into master until it's ready to integrate upstream.

I hope that we can eventually replace this with some functionality/tooling provided by Glean rather than having to roll it ourselves, but that depends on a lot of figuring-out-the-right-thing-in-general that we don't want to block on here. Ref https://bugzilla.mozilla.org/show_bug.cgi?id=1598247 for some early discussion on that front.

Pull Request checklist

  • Quality: This PR builds and tests run cleanly
    • cargo test --all produces no test failures
    • cargo clippy --all --all-targets --all-features runs without emitting any warnings
    • cargo fmt does not produce any changes to the code
    • ./gradlew ktlint detekt runs without emitting any warnings
    • swiftformat --swiftversion 4 megazords components/*/ios && swiftlint runs without emitting any warnings or producing changes
    • Note: For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGES_UNRELEASED.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our [dependency management guidelines](https://github.com/mozilla/application-services/blob/master/docs/dependency-management.md)
    • Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.

rfk added 2 commits December 3, 2019 16:19
We don't want to surprise consuming applications with extra metrics
that they didn't expect, so hide the logins component telemetry behind
an explicit enablement call.
@rfk rfk requested review from grigoryk and linabutler December 4, 2019 03:03
throw RuntimeException("mozilla.appservices.logins is on metrics version $LATEST_METRICS_VERSION, not $version")
}
enabledMetricsVersion = version
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be extremely unidiomatic Kotlin, I'm happy to take suggestions for improvement.

Copy link
Contributor

@thomcc thomcc Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like this would be more idiomatic/less error prone:

object LoginsTelemetry {
    enum class Version(val number: Int) {
        // Possibly better to use `null` for disabled and have a
        // separate `disable()` function...
        Disabled(0),
        V1(1),
    }
    @Synchronized
    fun enable(version: Version) {
        // check `version.value` against the known max version (see below)
        // update a private var, ...
    }
}

It would have the calling code looking like LoginsTelemetry.enable(LoginsTelemetry.Version.V1) or something (Prefixing this with logins avoids import being annoying for the user when we add PlacesTelemetry or whatever in the future). This would prevent the user from specifying a version higher than what we support.

We'd still probably want to check that the arg to enable has a version lower than our known max, to prevent the case where the calling code was compiled against a newer version of the logins package than it got at runtime, which can happen due to the compilation classpath being different from the runtime classpath, and having a newer version on the compilation classpath than runtime (aka gradle misconfiguration weirdness), but this would prevent accidents.

EDIT: actually maybe ignore much of this, I hadn't seen that glean team dislikes metrics versioning when I wrote it.


- Added invalid character checks from Desktop to `LoginsStorage.ensureValid` and introduced `INVALID_LOGIN_ILLEGAL_FIELD_VALUE` error. ([#2262](https://github.com/mozilla/application-services/pull/2262))
- The Android bindings can now collect some basic performance and quality metrics via Glean.
Collection is off by default, but consuming apps that submit telemetry via glean are encouraged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required by some specific product you're currently targetting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's driven by the push-back on integration into a-c as a breaking change here: mozilla-mobile/android-components#5223

/**
* This component can emit metrics via Glean, but the consuming application
* needs to explicitly opt-in. It does so by calling this function to tell
* us what version of our metrics it wants to opt in to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes things much more complicated when analysing the data. See this Fenix issue. From that message:

For example, when correlating measurements that may or may not have been collected, will we have a way of telling whether something isn't there because it didn't happen or because the user opted out

Is there a specific, concrete use case you're trying to address with this that is currently open? Or is this attempting to resolve the problem in advance?

Metrics versioning adds another layer of confusion to the analysis, which I'd strongly discourage doing. The tables that will contain the ingested data will contain all of these fields, no matter the version.

cc @fbertsch @mdboom

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional clarity, in the current form, the solution also breaks documentation, since versions should be mentioned for metrics.

I would encourage all of us to properly discuss this and make it a Q1 objective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points, thanks!

It's also possible that the way I've named and described this makes it sound more fragile than it is, so just to be clear, here's how I imagined this being used in practice:

  • Apps that are already using the logins component (Lockwise, soon to be Fenix depending on whether our metrics land before or after their initial logins integration) can upgrade to newer versions of the logins component without getting any surprise new metrics that may not have been data-reviewed etc.
  • Probably with some prodding from our team, those apps do a data-review for the new metrics and update their app to call logins.enableTelemetry(1) at startup. As of the next release of the app, they start emitting those metrics.
    • Importantly, they do not pref individual collection of these metrics on or off at runtime; every version of the app as of this release includes the metrics in their telemetry, just like if it had added some new metrics into its own metrics.yaml file.
  • If we change the way metrics works in the logins component in the future, we go through the above cycle again, but this time to new version of the app needs to change to call logins.enableTelemetry(2) at startup.

Using named constants instead of integers might make it a bit clearer, and maybe "version" is not the best name. But that's the idea, a kind of application-level once-and-done opt-in for "yes we're ready for these metrics to start flowing now" rather than a runtime user-level opt-in-or-out scenario.

Metrics versioning adds another layer of confusion to the analysis, which I'd strongly
discourage doing. The tables that will contain the ingested data will contain all of
these fields, no matter the version.

Does the usage described above change the possibility for confusion here? In my head it doesn't seem any different to adding new metrics directly into a new version of an app, but I'm sure there's plenty of nuance involved that I don't really understand well.

I would encourage all of us to properly discuss this and make it a Q1 objective.

👍 certainly happy to meet and discuss. I don't have a good sense of your team's schedule, would you be happy to set something up with the right people and timezones @Dexterp37? Ideally we can involve @grigoryk but I also feel like I have a good understanding of his requirements and could represent them if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every version of the app as of this release includes the metrics in their telemetry,
just like if it had added some new metrics into its own metrics.yaml file.

I guess one difference here is, you could have a situation where:

  • Logins Component vX has landed some metrics, declaring them in its yaml file.
  • Lockwise App vY has updated to Logins Component vX, but not enabled the new metrics.
  • Transitively scanning the metrics.yaml files for Lockwise App vY and its dependencies makes it look like the app should be emitting logins-component metrics, but it isn't.

Which I guess in turn, is what leads to the "The tables that will contain the ingested data will contain all of these fields" situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points, thanks!

Thank you for your patience on this :) I've also contributed with a more general discussion here.

It's also possible that the way I've named and described this makes it sound more fragile than it is, so just to be clear, here's how I imagined this being used in practice:

* Apps that are already using the logins component (Lockwise, soon to be Fenix depending on whether our metrics land before or after their initial logins integration) can upgrade to newer versions of the logins component without getting any surprise new metrics that may not have been data-reviewed etc.

That's a good point. I think that this should be called out in the changelog in Android Components and in here. People updating should be aware and should be consulting the changelog anyway.

* Probably with some prodding from our team, those apps do a data-review for the new metrics and update their app to call `logins.enableTelemetry(1)` at startup. As of the next release of the app, they start emitting those metrics.
  
  * Importantly, they _**do not**_ pref individual collection of these metrics on or off at runtime; every version of the app as of this release includes the metrics in their telemetry, just like if it had added some new metrics into its own `metrics.yaml` file.

They are required to do the data review for adding the Glean component. Yes, this might need some help by your team to make sure that happens, given that we currently have no build-time way to enforce this. We're trying to address it, but it's a bit complicated. See this bug.

* If we change the way metrics works in the logins component in the future, we go through the above cycle again, but this time to new version of the app needs to change to call `logins.enableTelemetry(2)` at startup.

I think that should not be required, unless you merge some concerning metric in your component. And, if you do that, we have much bigger problems :-)

In order to transition this from "I think" to "that's our policy" we reached out to Data Stewardship to hopefully encode these practices.

Other components are currently working this way and it would be really hard for them to work differently than that. One prime example could be engine-gecko-* components: the metrics coming from that component live in mozilla-central (see here for how it works :D ).

Gecko metrics seem to have an high churn rate, with metrics being removed/added quite frequently. I'm not sure products would want to file a data-review on a daily basis, or maybe twice a day, once every new GeckoView nightly comes out.

Using named constants instead of integers might make it a bit clearer, and maybe "version" is not the best name. But that's the idea, a kind of application-level once-and-done opt-in for "yes we're ready for these metrics to start flowing now" rather than a runtime user-level opt-in-or-out scenario.

Metrics versioning adds another layer of confusion to the analysis, which I'd strongly
discourage doing. The tables that will contain the ingested data will contain all of
these fields, no matter the version.

Does the usage described above change the possibility for confusion here? In my head it doesn't seem any different to adding new metrics directly into a new version of an app, but I'm sure there's plenty of nuance involved that I don't really understand well.

I'm afraid not :( This has to do with the way Glean "automagically" builds tables containing your data. The probe-scraper fetches the metrics.yaml for all the dependencies of Mozilla apps, then assembles a schema for storing data in. It nows about dependencies from this file (which reminds me we need to add logins there, before you can see any data :) ).

I would encourage all of us to properly discuss this and make it a Q1 objective.

👍 certainly happy to meet and discuss. I don't have a good sense of your team's schedule, would you be happy to set something up with the right people and timezones @Dexterp37? Ideally we can involve @grigoryk but I also feel like I have a good understanding of his requirements and could represent them if necessary.

This feedback is important to us, we'll make time for it. On our end, we can at least commit to discussing and designing a solution. I think @grigoryk and the representative for the product who asked for this requirement should both attend :)

Which I guess in turn, is what leads to the "The tables that will contain the ingested data will contain all of these fields" situation.

Practically speaking, since the table would have all columns, how would you tell if the data is missing because no collection took place vs the data is missing because the component disabled telemetry? It's complicated, unless we report the active status of telemetry for that component. That would not solve the problem, it would only push this problem to analysts who would need to know one more wrinkle to work with your data :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfk , given @chutten comment, does this PR still make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not; I'm going to go head and close it down. Thanks @Dexterp37 @chutten!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chutten is it fair to think of this as a trade-off between simplicity and accuracy of the data review process?

Let's take this example, for instance:

From your example, Lockwise upgrading to a new Sync version is fine from Data Stewardship's POV because the new Sync version's data collections will have been reviewed for collection from the Lockwise population. (And if they haven't, then that's Data Stewardship will help amend that. No biggie, just let us know if you notice it and we'll file some bugs to fix it up.)

It's not clear to me how this is supposed to actually work in practice. Components can and will expand their telemetry collection. Products will upgrade to newer versions of components, or include new components they haven't used before, possibly without filling out data review forms (since developers may not notice a changelog notice, or even look for them, etc). All of this is likely to create a situation in which we're collecting more telemetry than we intend to, for a larger audience than we intend to, without actually realizing that this is happening, or a process in place to reliably prevent this from happening.

I understand that for updates to metrics, from a developer's point of view, a separate review process for each update could be quite onerous, if this happens a lot. It will also place a higher burden on data stewards, as well. But, that seems like a fairly unique case; it's hard to imagine that most components will have churny metrics.

I think one way to phrase my problem with the current situation is that we're over-relying on humans and process, and under-relying on tooling/API design. It's currently completely possible to land telemetry in a new product, that's completely out-of-context for that product, without knowing this happened. Is my assumption true here, or am I missing something that would prevent this from happening?

Another way to look at this, just because telemetry XY is fine to collect within Product A, it's not necessarily the case that it's fine to collect for Product B. Privacy implications/promises of these two products may be quite different!

(slightly offtopic, but this also ties into the larger strategy around components - they exist, in part, to allow us to ship and maintain a variety of different products, supposedly a larger number than we currently have. This hypothetical future of many products, many devs, etc, amplifies the problem.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one way to phrase my problem with the current situation is that we're
over-relying on humans and process, and under-relying on tooling/API design.

The impression I've come away with is basically: we can and should improve the tooling for this over time (once, in Glean) but the current process won't fail in sufficiently catastrophic ways that we need to block on improved tooling in the meantime.

### Telemetry

This component supports collecting quality and performance telemetry using the [Glean SDK](https://mozilla.github.io/glean/),
but this collection is disabled by default. Applications that send telemetry via Glean may opt-in to gathering these metrics by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an exception to the way other components work, I think we really need a general solution to this problem, and I'm willing to prioritize this.

Copy link
Contributor

@thomcc thomcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me notice a few issues I missed during my review of the metrics, But I'll file a follow up for those.

I think this is mostly fine, you could keep the current API, but have a couple nits, but I've also suggested what (IMO) is a better/more idiomatic API. Although, take it with a grain of salt since I don't write kotlin that much now.


private const val LATEST_METRICS_VERSION = 1
private const val METRICS_V1 = 1
private var enabledMetricsVersion = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a @Volatile, and you should consider synchronizing on something (at the moment synchronizing wouldn't matter, but it would be required to prevent stuff like TOCTOU errors if enableTelemetry grows to look at enabledMetricsVersion too).


@Suppress("TooGenericExceptionThrown")
fun enableTelemetry(version: Int) {
if (version > LATEST_METRICS_VERSION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using (check)[https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/check.html] might be better here, and avoid the @Suppress (we don't tend to use check elsewhere, but that's only because I only just learned about it by reading A-C code, lol).

throw RuntimeException("mozilla.appservices.logins is on metrics version $LATEST_METRICS_VERSION, not $version")
}
enabledMetricsVersion = version
}
Copy link
Contributor

@thomcc thomcc Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like this would be more idiomatic/less error prone:

object LoginsTelemetry {
    enum class Version(val number: Int) {
        // Possibly better to use `null` for disabled and have a
        // separate `disable()` function...
        Disabled(0),
        V1(1),
    }
    @Synchronized
    fun enable(version: Version) {
        // check `version.value` against the known max version (see below)
        // update a private var, ...
    }
}

It would have the calling code looking like LoginsTelemetry.enable(LoginsTelemetry.Version.V1) or something (Prefixing this with logins avoids import being annoying for the user when we add PlacesTelemetry or whatever in the future). This would prevent the user from specifying a version higher than what we support.

We'd still probably want to check that the arg to enable has a version lower than our known max, to prevent the case where the calling code was compiled against a newer version of the logins package than it got at runtime, which can happen due to the compilation classpath being different from the runtime classpath, and having a newer version on the compilation classpath than runtime (aka gradle misconfiguration weirdness), but this would prevent accidents.

EDIT: actually maybe ignore much of this, I hadn't seen that glean team dislikes metrics versioning when I wrote it.

@linabutler
Copy link
Contributor

I'll hold off on reviewing, since there's still discussion to be had, but, to summarize why we want this (for my own benefit, if no one else's 😁):

  1. Adding new metrics to a Rust component requires two data reviews: one for a-s, for the change to its metrics.yaml, and one for the consuming app, like Fenix, Lockwise, etc., when it bumps the version of a-s it depends on for the new metrics.
  2. We're concerned that consumers might not ask for that second data review, or, indeed, even be aware of metrics changes in a-s.
  3. It's not realistic to expect consumers to peruse our changelogs for every release, especially if they're bumping the a-s version for other reasons (like a new API or component).
  4. This problem is magnified when a consumer depends on a Rust component through another abstraction layer, like a-c. This means we need to call out new metrics as a breaking change in our changelog, then copy-paste that "breaking change" warning into the a-c changelog, and then expect consumers to read the changelog for a-c.
  5. The a-c team pushed back on (5) in Update to latest app-services, including metrics from logins component mozilla-mobile/android-components#5223, because this approach doesn't scale. It also means version bumps have to be done separately; otherwise, if a-c bumps the a-s version that introduces new APIs (3) and new metrics, they'd need to get data review for metrics that might be entirely unrelated to what they're developing.
  6. Without the second data review, a downstream consumer might end up emitting new metrics that it didn't expect—even though the metrics had data review when they were added to the library.

Versioning metrics is a way for the consumer to say "yes, I've gotten the second data review, and understand we'll be emitting new metrics after bumping the library version."

@rfk
Copy link
Contributor Author

rfk commented Dec 10, 2019

Closing this as it sounds like it's going to cause more problems than it solves, and the problems it solves are ones that we don't intend to actually have. Thanks for an excellent discussion all!

@rfk rfk closed this Dec 10, 2019
@rfk rfk deleted the logins-metrics-v1-flag branch June 7, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants