Rolling PR for telemetry #1112
Rolling PR for telemetry #1112
Conversation
|
This looks great and goes a lot further than I expected :) I've a few small queries, but wouldn't mind looking at it again and at the glean integration, but nice work! |
| @@ -1,4 +1,4 @@ | |||
| libraryVersion: 0.27.1 | |||
| libraryVersion: 0.27.1-lina5 | |||
mhammond
May 8, 2019
Member
I assume this should be reverted ;)
I assume this should be reverted ;)
| let mut reuploads = Vec::new(); | ||
| let mut stmt = self.db.prepare(&format!( | ||
| "WITH | ||
| ranks(rank, at, guid, action) AS ( |
mhammond
May 8, 2019
Member
/me googles sqlite rank.
/me goes "Nice!"
IIUC, this will rely on a single timestamp for all entries written here, which I believe you do correctly in store.rs, but I wonder if a comment there indicating how important the single timestamp is is worthwhile?
/me googles sqlite rank.
/me goes "Nice!"
IIUC, this will rely on a single timestamp for all entries written here, which I believe you do correctly in store.rs, but I wonder if a comment there indicating how important the single timestamp is is worthwhile?
linacambridge
May 10, 2019
Author
Contributor
It definitely is. I'd also like to test this out for large logs—what happens if we have 10k events? 100k?
It definitely is. I'd also like to test this out for large logs—what happens if we have 10k events? 100k?
| @@ -0,0 +1,248 @@ | |||
| /* This Source Code Form is subject to the terms of the Mozilla Public | |||
mhammond
May 8, 2019
Member
how do you picture the glean integration working? I guess I was kinda assuming that the glean integration part might just do the string -> json -> glean in the same module and avoid exposing this to the app layer, but that's probably because I don't understand how that integration will shape up.
how do you picture the glean integration working? I guess I was kinda assuming that the glean integration part might just do the string -> json -> glean in the same module and avoid exposing this to the app layer, but that's probably because I don't understand how that integration will shape up.
linacambridge
May 11, 2019
•
Author
Contributor
IIUC, the integration needs to happen at the app layer, that's part of the problem. 😕 Fenix defines the schema for the pings (YAML files that are code-generated into Kotlin bindings), so the ping needs to get passed through Android Components. a-c can't directly translate this into Glean, because it provides Glean as a "service", not a "feature", and it doesn't define the actual metric names.
Edit: I don't think that last part is true, Glean defines its own metrics.yaml and pings.yaml files. Maybe we can push the engine ping definitions into a-c?
We could have Android Components pass JSON strings around, and have Fenix deserialize them, but I don't think they'd be too keen on that. Their wrappers around our APIs (Connection::syncHistory and Connection::syncBookmarks) would need to return strings, we'd need to pass those opaque strings up through the different layers to Fenix, and Fenix would need to know that it's a JSON telemetry ping that it should unpack and send.
So the way this works is:
- We deserialize the ping from Rust into a
mozilla.appservices.support.SyncTelemetryPing.
- Android Components passes that up through all its layers to a new
onStoreSynced() observer, which takes the ping as the argument. (There's some additional serialization we have to do here, since it also runs the sync in a worker. See mozilla-mobile/android-components#2971).
- Fenix unpacks the
SyncTelemetryPing, and calls the Glean submission methods like this.
@grigoryk, does that make sense to you? Are there ways we can simplify?
IIUC, the integration needs to happen at the app layer, that's part of the problem. a-c can't directly translate this into Glean, because it provides Glean as a "service", not a "feature", and it doesn't define the actual metric names.
Edit: I don't think that last part is true, Glean defines its own metrics.yaml and pings.yaml files. Maybe we can push the engine ping definitions into a-c?
We could have Android Components pass JSON strings around, and have Fenix deserialize them, but I don't think they'd be too keen on that. Their wrappers around our APIs (Connection::syncHistory and Connection::syncBookmarks) would need to return strings, we'd need to pass those opaque strings up through the different layers to Fenix, and Fenix would need to know that it's a JSON telemetry ping that it should unpack and send.
So the way this works is:
- We deserialize the ping from Rust into a
mozilla.appservices.support.SyncTelemetryPing. - Android Components passes that up through all its layers to a new
onStoreSynced()observer, which takes the ping as the argument. (There's some additional serialization we have to do here, since it also runs the sync in a worker. See mozilla-mobile/android-components#2971). - Fenix unpacks the
SyncTelemetryPing, and calls the Glean submission methods like this.
@grigoryk, does that make sense to you? Are there ways we can simplify?
linacambridge
May 11, 2019
Author
Contributor
(Also, I'm planning to add a comment to this file explaining how this all works, and why! 😄 )
(Also, I'm planning to add a comment to this file explaining how this all works, and why!
|
Did a quick skim (I'm assuming this is still not ready for full review?), only real comment is about us needing to free the strings returned from syncing on iOS to avoid leaking a decently large JSON string on each sync. |
...t/android/src/main/java/mozilla/appservices/support/SyncTelemetryPing.kt
Outdated
Show resolved
Hide resolved
| /// Fetches a list of items that have been reuploaded with new structure for | ||
| /// the last 5 syncs. This is captured in logs and telemetry, and suggests | ||
| /// that we might be in a sync loop with another client (bug 1530145). | ||
| fn fetch_consecutive_reuploads(&self) -> Result<Vec<ConsecutiveReupload>> { |
linacambridge
May 30, 2019
Author
Contributor
I'm going to punt on this for later. There's probably a better way to collect this data, anyway—for example, tracking the number of bookmarks with BookmarkValidityState::Reupload and BookmarkValidityState::Replace in validation telemetry, along with the other problems.
By themselves, consecutive uploads for the same record aren't bad—they're a symptom of other problems that we can already track—and validation and structure change data seems sufficient for now. Also, this query is pretty big, complicated, scans the entire moz_bookmarks table and the WITH views for each sync, and doesn't capture everything...an invalid remote item that's unchanged locally will have a Local merge state, not RemoteWithNewStructure.
I'm going to punt on this for later. There's probably a better way to collect this data, anyway—for example, tracking the number of bookmarks with BookmarkValidityState::Reupload and BookmarkValidityState::Replace in validation telemetry, along with the other problems.
By themselves, consecutive uploads for the same record aren't bad—they're a symptom of other problems that we can already track—and validation and structure change data seems sufficient for now. Also, this query is pretty big, complicated, scans the entire moz_bookmarks table and the WITH views for each sync, and doesn't capture everything...an invalid remote item that's unchanged locally will have a Local merge state, not RemoteWithNewStructure.
|
Awesome Lina, great work! |
| @@ -116,6 +119,18 @@ internal inline fun <U> rustCall(syncOn: Any, callback: (RustError.ByReference) | |||
| } | |||
| } | |||
|
|
|||
| @Suppress("TooGenericExceptionThrown") | |||
mhammond
May 31, 2019
Member
It's unfortunate this is duplicated from push - do you see any scope to share this in a followup?
It's unfortunate this is duplicated from push - do you see any scope to share this in a followup?
thomcc
May 31, 2019
Contributor
Can't prior to #1103, since places_destroy_string != push_destroy_string. After that we can find a place for shared dtors and such, potentially.
Can't prior to #1103, since places_destroy_string != push_destroy_string. After that we can find a place for shared dtors and such, potentially.
| @@ -254,6 +254,8 @@ pub enum SyncFailure { | |||
|
|
|||
| pub fn sync_failure_from_error(e: &Error) -> SyncFailure { | |||
| SyncFailure::Unexpected { | |||
| // TODO: Distinguish between error types, truncate, and anonymize | |||
mhammond
May 31, 2019
Member
We should have a followup to do this - hopefully the SyncStatus/SyncResult stuff we landed recently makes that a little easier.
We should have a followup to do this - hopefully the SyncStatus/SyncResult stuff we landed recently makes that a little easier.
|
(although the taskcluster failure looks like a real issue) |
This commit reverts the protobuf schema, in favor of passing the serialized ping directly. On iOS, we can pass this ping directly through to the existing ping sender. On Android, we'll need to unpack the ping into a Kotlin structure, and pass that to Fenix via Android Components. Fenix can then assemble a Glean ping from the Kotlin `SyncTelemetryPing`.
This commit reports Dogear's problem counts, along with potential sync loops caused by reuploading new structure for the same record in five consecutive syncs, in validation results.
* Move `SyncTelemetryPing.kt` into a new `sync15` library. This library will be consumed directly by a-c, and exposed to Fenix. * Free the ping JSON string in Swift. * Pass `Engine`, not `EngineIncoming` and `Validation`, to `Store::apply_incoming`. * Implement `SyncTelemetryPing::toJSON()` for serializing pings to JSON on the Kotlin side. This will be used to pass the ping payload in an `androidx.work.Data` wrapper in a-c.
f6ca45a
into
master
Depends on mozilla/dogear#41.
Pull Request checklist
cargo test --allproduces no test failurescargo clippy --all --all-targets --all-featuresruns without emitting any warningscargo fmtdoes not produce any changes to the code./gradlew ktlint detektruns without emitting any warningsswiftformat --swiftversion 4 megazords components/*/ios && swiftlintruns without emitting any warnings or producing changes[ci full]to the PR title.