Add explicit opt-in for logins component telemetry.#2325
Add explicit opt-in for logins component telemetry.#2325rfk wants to merge 2 commits intologins-metrics-v1from
Conversation
We don't want to surprise consuming applications with extra metrics that they didn't expect, so hide the logins component telemetry behind an explicit enablement call.
| throw RuntimeException("mozilla.appservices.logins is on metrics version $LATEST_METRICS_VERSION, not $version") | ||
| } | ||
| enabledMetricsVersion = version | ||
| } |
There was a problem hiding this comment.
This may be extremely unidiomatic Kotlin, I'm happy to take suggestions for improvement.
There was a problem hiding this comment.
I think something like this would be more idiomatic/less error prone:
object LoginsTelemetry {
enum class Version(val number: Int) {
// Possibly better to use `null` for disabled and have a
// separate `disable()` function...
Disabled(0),
V1(1),
}
@Synchronized
fun enable(version: Version) {
// check `version.value` against the known max version (see below)
// update a private var, ...
}
}It would have the calling code looking like LoginsTelemetry.enable(LoginsTelemetry.Version.V1) or something (Prefixing this with logins avoids import being annoying for the user when we add PlacesTelemetry or whatever in the future). This would prevent the user from specifying a version higher than what we support.
We'd still probably want to check that the arg to enable has a version lower than our known max, to prevent the case where the calling code was compiled against a newer version of the logins package than it got at runtime, which can happen due to the compilation classpath being different from the runtime classpath, and having a newer version on the compilation classpath than runtime (aka gradle misconfiguration weirdness), but this would prevent accidents.
EDIT: actually maybe ignore much of this, I hadn't seen that glean team dislikes metrics versioning when I wrote it.
|
|
||
| - Added invalid character checks from Desktop to `LoginsStorage.ensureValid` and introduced `INVALID_LOGIN_ILLEGAL_FIELD_VALUE` error. ([#2262](https://github.com/mozilla/application-services/pull/2262)) | ||
| - The Android bindings can now collect some basic performance and quality metrics via Glean. | ||
| Collection is off by default, but consuming apps that submit telemetry via glean are encouraged |
There was a problem hiding this comment.
Is this required by some specific product you're currently targetting?
There was a problem hiding this comment.
It's driven by the push-back on integration into a-c as a breaking change here: mozilla-mobile/android-components#5223
| /** | ||
| * This component can emit metrics via Glean, but the consuming application | ||
| * needs to explicitly opt-in. It does so by calling this function to tell | ||
| * us what version of our metrics it wants to opt in to. |
There was a problem hiding this comment.
I think this makes things much more complicated when analysing the data. See this Fenix issue. From that message:
For example, when correlating measurements that may or may not have been collected, will we have a way of telling whether something isn't there because it didn't happen or because the user opted out
Is there a specific, concrete use case you're trying to address with this that is currently open? Or is this attempting to resolve the problem in advance?
Metrics versioning adds another layer of confusion to the analysis, which I'd strongly discourage doing. The tables that will contain the ingested data will contain all of these fields, no matter the version.
There was a problem hiding this comment.
For additional clarity, in the current form, the solution also breaks documentation, since versions should be mentioned for metrics.
I would encourage all of us to properly discuss this and make it a Q1 objective.
There was a problem hiding this comment.
These are good points, thanks!
It's also possible that the way I've named and described this makes it sound more fragile than it is, so just to be clear, here's how I imagined this being used in practice:
- Apps that are already using the logins component (Lockwise, soon to be Fenix depending on whether our metrics land before or after their initial logins integration) can upgrade to newer versions of the logins component without getting any surprise new metrics that may not have been data-reviewed etc.
- Probably with some prodding from our team, those apps do a data-review for the new metrics and update their app to call
logins.enableTelemetry(1)at startup. As of the next release of the app, they start emitting those metrics.- Importantly, they do not pref individual collection of these metrics on or off at runtime; every version of the app as of this release includes the metrics in their telemetry, just like if it had added some new metrics into its own
metrics.yamlfile.
- Importantly, they do not pref individual collection of these metrics on or off at runtime; every version of the app as of this release includes the metrics in their telemetry, just like if it had added some new metrics into its own
- If we change the way metrics works in the logins component in the future, we go through the above cycle again, but this time to new version of the app needs to change to call
logins.enableTelemetry(2)at startup.
Using named constants instead of integers might make it a bit clearer, and maybe "version" is not the best name. But that's the idea, a kind of application-level once-and-done opt-in for "yes we're ready for these metrics to start flowing now" rather than a runtime user-level opt-in-or-out scenario.
Metrics versioning adds another layer of confusion to the analysis, which I'd strongly
discourage doing. The tables that will contain the ingested data will contain all of
these fields, no matter the version.
Does the usage described above change the possibility for confusion here? In my head it doesn't seem any different to adding new metrics directly into a new version of an app, but I'm sure there's plenty of nuance involved that I don't really understand well.
I would encourage all of us to properly discuss this and make it a Q1 objective.
👍 certainly happy to meet and discuss. I don't have a good sense of your team's schedule, would you be happy to set something up with the right people and timezones @Dexterp37? Ideally we can involve @grigoryk but I also feel like I have a good understanding of his requirements and could represent them if necessary.
There was a problem hiding this comment.
every version of the app as of this release includes the metrics in their telemetry,
just like if it had added some new metrics into its own metrics.yaml file.
I guess one difference here is, you could have a situation where:
- Logins Component vX has landed some metrics, declaring them in its yaml file.
- Lockwise App vY has updated to Logins Component vX, but not enabled the new metrics.
- Transitively scanning the
metrics.yamlfiles for Lockwise App vY and its dependencies makes it look like the app should be emitting logins-component metrics, but it isn't.
Which I guess in turn, is what leads to the "The tables that will contain the ingested data will contain all of these fields" situation.
There was a problem hiding this comment.
These are good points, thanks!
Thank you for your patience on this :) I've also contributed with a more general discussion here.
It's also possible that the way I've named and described this makes it sound more fragile than it is, so just to be clear, here's how I imagined this being used in practice:
* Apps that are already using the logins component (Lockwise, soon to be Fenix depending on whether our metrics land before or after their initial logins integration) can upgrade to newer versions of the logins component without getting any surprise new metrics that may not have been data-reviewed etc.
That's a good point. I think that this should be called out in the changelog in Android Components and in here. People updating should be aware and should be consulting the changelog anyway.
* Probably with some prodding from our team, those apps do a data-review for the new metrics and update their app to call `logins.enableTelemetry(1)` at startup. As of the next release of the app, they start emitting those metrics. * Importantly, they _**do not**_ pref individual collection of these metrics on or off at runtime; every version of the app as of this release includes the metrics in their telemetry, just like if it had added some new metrics into its own `metrics.yaml` file.
They are required to do the data review for adding the Glean component. Yes, this might need some help by your team to make sure that happens, given that we currently have no build-time way to enforce this. We're trying to address it, but it's a bit complicated. See this bug.
* If we change the way metrics works in the logins component in the future, we go through the above cycle again, but this time to new version of the app needs to change to call `logins.enableTelemetry(2)` at startup.
I think that should not be required, unless you merge some concerning metric in your component. And, if you do that, we have much bigger problems :-)
In order to transition this from "I think" to "that's our policy" we reached out to Data Stewardship to hopefully encode these practices.
Other components are currently working this way and it would be really hard for them to work differently than that. One prime example could be engine-gecko-* components: the metrics coming from that component live in mozilla-central (see here for how it works :D ).
Gecko metrics seem to have an high churn rate, with metrics being removed/added quite frequently. I'm not sure products would want to file a data-review on a daily basis, or maybe twice a day, once every new GeckoView nightly comes out.
Using named constants instead of integers might make it a bit clearer, and maybe "version" is not the best name. But that's the idea, a kind of application-level once-and-done opt-in for "yes we're ready for these metrics to start flowing now" rather than a runtime user-level opt-in-or-out scenario.
Metrics versioning adds another layer of confusion to the analysis, which I'd strongly
discourage doing. The tables that will contain the ingested data will contain all of
these fields, no matter the version.Does the usage described above change the possibility for confusion here? In my head it doesn't seem any different to adding new metrics directly into a new version of an app, but I'm sure there's plenty of nuance involved that I don't really understand well.
I'm afraid not :( This has to do with the way Glean "automagically" builds tables containing your data. The probe-scraper fetches the metrics.yaml for all the dependencies of Mozilla apps, then assembles a schema for storing data in. It nows about dependencies from this file (which reminds me we need to add logins there, before you can see any data :) ).
I would encourage all of us to properly discuss this and make it a Q1 objective.
👍 certainly happy to meet and discuss. I don't have a good sense of your team's schedule, would you be happy to set something up with the right people and timezones @Dexterp37? Ideally we can involve @grigoryk but I also feel like I have a good understanding of his requirements and could represent them if necessary.
This feedback is important to us, we'll make time for it. On our end, we can at least commit to discussing and designing a solution. I think @grigoryk and the representative for the product who asked for this requirement should both attend :)
Which I guess in turn, is what leads to the "The tables that will contain the ingested data will contain all of these fields" situation.
Practically speaking, since the table would have all columns, how would you tell if the data is missing because no collection took place vs the data is missing because the component disabled telemetry? It's complicated, unless we report the active status of telemetry for that component. That would not solve the problem, it would only push this problem to analysts who would need to know one more wrinkle to work with your data :)
There was a problem hiding this comment.
It does not; I'm going to go head and close it down. Thanks @Dexterp37 @chutten!
There was a problem hiding this comment.
@chutten is it fair to think of this as a trade-off between simplicity and accuracy of the data review process?
Let's take this example, for instance:
From your example, Lockwise upgrading to a new Sync version is fine from Data Stewardship's POV because the new Sync version's data collections will have been reviewed for collection from the Lockwise population. (And if they haven't, then that's Data Stewardship will help amend that. No biggie, just let us know if you notice it and we'll file some bugs to fix it up.)
It's not clear to me how this is supposed to actually work in practice. Components can and will expand their telemetry collection. Products will upgrade to newer versions of components, or include new components they haven't used before, possibly without filling out data review forms (since developers may not notice a changelog notice, or even look for them, etc). All of this is likely to create a situation in which we're collecting more telemetry than we intend to, for a larger audience than we intend to, without actually realizing that this is happening, or a process in place to reliably prevent this from happening.
I understand that for updates to metrics, from a developer's point of view, a separate review process for each update could be quite onerous, if this happens a lot. It will also place a higher burden on data stewards, as well. But, that seems like a fairly unique case; it's hard to imagine that most components will have churny metrics.
I think one way to phrase my problem with the current situation is that we're over-relying on humans and process, and under-relying on tooling/API design. It's currently completely possible to land telemetry in a new product, that's completely out-of-context for that product, without knowing this happened. Is my assumption true here, or am I missing something that would prevent this from happening?
Another way to look at this, just because telemetry XY is fine to collect within Product A, it's not necessarily the case that it's fine to collect for Product B. Privacy implications/promises of these two products may be quite different!
(slightly offtopic, but this also ties into the larger strategy around components - they exist, in part, to allow us to ship and maintain a variety of different products, supposedly a larger number than we currently have. This hypothetical future of many products, many devs, etc, amplifies the problem.)
There was a problem hiding this comment.
I think one way to phrase my problem with the current situation is that we're
over-relying on humans and process, and under-relying on tooling/API design.
The impression I've come away with is basically: we can and should improve the tooling for this over time (once, in Glean) but the current process won't fail in sufficiently catastrophic ways that we need to block on improved tooling in the meantime.
| ### Telemetry | ||
|
|
||
| This component supports collecting quality and performance telemetry using the [Glean SDK](https://mozilla.github.io/glean/), | ||
| but this collection is disabled by default. Applications that send telemetry via Glean may opt-in to gathering these metrics by |
There was a problem hiding this comment.
This would be an exception to the way other components work, I think we really need a general solution to this problem, and I'm willing to prioritize this.
thomcc
left a comment
There was a problem hiding this comment.
This made me notice a few issues I missed during my review of the metrics, But I'll file a follow up for those.
I think this is mostly fine, you could keep the current API, but have a couple nits, but I've also suggested what (IMO) is a better/more idiomatic API. Although, take it with a grain of salt since I don't write kotlin that much now.
|
|
||
| private const val LATEST_METRICS_VERSION = 1 | ||
| private const val METRICS_V1 = 1 | ||
| private var enabledMetricsVersion = 0 |
There was a problem hiding this comment.
This should be a @Volatile, and you should consider synchronizing on something (at the moment synchronizing wouldn't matter, but it would be required to prevent stuff like TOCTOU errors if enableTelemetry grows to look at enabledMetricsVersion too).
|
|
||
| @Suppress("TooGenericExceptionThrown") | ||
| fun enableTelemetry(version: Int) { | ||
| if (version > LATEST_METRICS_VERSION) { |
There was a problem hiding this comment.
Using (check)[https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/check.html] might be better here, and avoid the @Suppress (we don't tend to use check elsewhere, but that's only because I only just learned about it by reading A-C code, lol).
| throw RuntimeException("mozilla.appservices.logins is on metrics version $LATEST_METRICS_VERSION, not $version") | ||
| } | ||
| enabledMetricsVersion = version | ||
| } |
There was a problem hiding this comment.
I think something like this would be more idiomatic/less error prone:
object LoginsTelemetry {
enum class Version(val number: Int) {
// Possibly better to use `null` for disabled and have a
// separate `disable()` function...
Disabled(0),
V1(1),
}
@Synchronized
fun enable(version: Version) {
// check `version.value` against the known max version (see below)
// update a private var, ...
}
}It would have the calling code looking like LoginsTelemetry.enable(LoginsTelemetry.Version.V1) or something (Prefixing this with logins avoids import being annoying for the user when we add PlacesTelemetry or whatever in the future). This would prevent the user from specifying a version higher than what we support.
We'd still probably want to check that the arg to enable has a version lower than our known max, to prevent the case where the calling code was compiled against a newer version of the logins package than it got at runtime, which can happen due to the compilation classpath being different from the runtime classpath, and having a newer version on the compilation classpath than runtime (aka gradle misconfiguration weirdness), but this would prevent accidents.
EDIT: actually maybe ignore much of this, I hadn't seen that glean team dislikes metrics versioning when I wrote it.
|
I'll hold off on reviewing, since there's still discussion to be had, but, to summarize why we want this (for my own benefit, if no one else's 😁):
Versioning metrics is a way for the consumer to say "yes, I've gotten the second data review, and understand we'll be emitting new metrics after bumping the library version." |
9a60c75 to
28dce90
Compare
|
Closing this as it sounds like it's going to cause more problems than it solves, and the problems it solves are ones that we don't intend to actually have. Thanks for an excellent discussion all! |
For our logins-component metrics gathering to be palatable to consumers, it needs to be off by default; otherwise we risk a "metrics surprise!" for consuming apps that upgrade without carefully reading the list of breaking changes.
This adds an explicit function that consumers can call to opt-in to metrics gathering, in a way that only means they're opting in to the current set of metrics, not all metrics that we might ever add in future. I'm not thrilled about the API, but it's simple and I think it has a low probability of being used incorrectly.
Ref mozilla-mobile/android-components#5223 for discussion about this for integration into android-components. Note that this is targeting the active PR #2225, since I don't want to merge that into master until it's ready to integrate upstream.
I hope that we can eventually replace this with some functionality/tooling provided by Glean rather than having to roll it ourselves, but that depends on a lot of figuring-out-the-right-thing-in-general that we don't want to block on here. Ref https://bugzilla.mozilla.org/show_bug.cgi?id=1598247 for some early discussion on that front.
Pull Request checklist
cargo test --allproduces no test failurescargo clippy --all --all-targets --all-featuresruns without emitting any warningscargo fmtdoes not produce any changes to the code./gradlew ktlint detektruns without emitting any warningsswiftformat --swiftversion 4 megazords components/*/ios && swiftlintruns without emitting any warnings or producing changes[ci full]to the PR title.Any breaking changes to Swift or Kotlin binding APIs are noted explicitlyDependencies: This PR follows our [dependency management guidelines](https://github.com/mozilla/application-services/blob/master/docs/dependency-management.md)Any new dependencies are accompanied by a summary of the due dilligence applied in selecting them.