-
Notifications
You must be signed in to change notification settings - Fork 4
First draft of data source & metrics for Firefox install rate monitoring #50
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. The only thing that needs some consideration is how clients can be identified.
I'm not sure why the CI isn't running for you. It might have something to do with the switch to requiring Github SSO: https://support.circleci.com/hc/en-us/articles/360043002793-Troubleshooting-CircleCI-Access-After-Enabling-Github-SSO
I pushed the config with the changes I suggested here and the validation is succeeding: #51
Since the start date is set to November the dashboard will not be available before then.
After November it will be available at: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo
WHERE branch != "other" | ||
) | ||
""" | ||
submission_date_column = "submission_date" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpMon assumes by default that the metrics are computed on a set of clients. The aggregation specified as part of the metric definitions are on a per-client basis. It looks like the install
table doesn't have a client_id
field. I don't know if document_id
can be used here, or maybe some other field that can identify clients?
submission_date_column = "submission_date" | |
submission_date_column = "submission_date" | |
client_id_column = "document_id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not :(. There is no way to track clients over time in this table. All pings are entirely separate from one another. (document_id
is just a uuid that is different with each submission, regardless of the client.)
( | ||
WITH success_rate_per_branch AS ( | ||
SELECT | ||
DATE(submission_timestamp) AS submission_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DATE(submission_timestamp) AS submission_date, | |
DATE(submission_timestamp) AS submission_date, | |
document_id, |
AND installer_type = 'stub' | ||
) | ||
SELECT | ||
submission_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submission_date, | |
submission_date, | |
document_id, |
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
Looks like this was the case - thanks for the pointer! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reproduces the linked STMO query. The full config is here: https://github.com/mozilla/opmon-config/pull/51/files
I ran a backfill of the last 3 days and the dashboard seems to reproduce the expected numbers: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo
The percentile filter can be ignored on the dashboard since there are no percentile statistics. SUM represents the total submissions that succeeded for each OS. Mean is the average, and should be pretty much the same as on the redash graph.
OpMon does internally expect a per-client analysis, however in this case we don't have clients and are only interested in what the individual submission/their success average looks like. Since document_id
is unique for each submission, it can be used instead of client_id in this case. An aggregation still needs to be defined, in this case LOGICAL_OR(succeeded) will just return the value of succeeded (since document_id is unique) and then gets cast to either 0 or 100. The average (and sum) get computed as part of the statistical aggregation step.
STRUCT ( -- this is the structure opmon expects | ||
[ | ||
STRUCT ( | ||
"firefox-install-success-rate" AS key, -- dummy experiment/rollout slug to make opmon happy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the slug needs to match the config file name
"firefox-install-success-rate" AS key, -- dummy experiment/rollout slug to make opmon happy | |
"firefox-install-demo" AS key, -- dummy experiment/rollout slug to make opmon happy |
|
||
[metrics.install_success_rate] | ||
data_source = "firefox_installs" | ||
select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)" | |
select_expression = "IF(LOGICAL_OR(succeeded), 1, 0) * 100" |
Thank you so much for your help in pushing this along!
Agreed, it looks like we get the same data as long as the STMO goes at least 4 days back (I'm guessing it's cutting off data at the wrong time of day if I only query 3 days back).
Is this because there's no stable field across multiple submissions (like
Got it. Seeing this makes me realize that the sum as it is now is not really that useful. It probably would be useful to have a simple SUM for all platforms. I'm guessing that this means a second data source and metrics block (without the os_version group by)?
Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the If you'd like, I'm happy to close this ticket out for now in favour of #51. I'm hoping we can get that merged soon (I'd like to follow-up with a couple of tweaks - most importantly one to test out alerting). Thank you again for your help so far! |
Yes, that should work, if it is actually more useful to have the total over all submissions
OpMon executes two phases of computations on the data:
So what the metric configurations specify is how multiple client pings should be aggregated, and how the metric values of all the clients should be aggregated. This table doesn't differentiate between clients, but the same concept is applicable on a per-document instead of per-client basis here I'll merge #51 now |
Ah, I think I get it now. So OpMan needs something to aggregate on, but for a case where we either can't or don't want to aggregate by client - some sort of unique value will satisfy the requirement (and effectively result in each row being treated as an individual "client"). In this case, we have Thank you again for your help! |
@amirmozla and I put this together as a first draft to try to migrate monitoring of Firefox installation success rate into OpMon. For now, we're mostly trying to migrate this STMO query. Once we see the data and useful graphs, then we can think about monitoring/alerting.