First draft of data source & metrics for Firefox install rate monitoring #50

bhearsum · 2022-10-04T17:55:26Z

@amirmozla and I put this together as a first draft to try to migrate monitoring of Firefox installation success rate into OpMon. For now, we're mostly trying to migrate this STMO query. Once we see the data and useful graphs, then we can think about monitoring/alerting.

scholtzan

Looking good. The only thing that needs some consideration is how clients can be identified.

I'm not sure why the CI isn't running for you. It might have something to do with the switch to requiring Github SSO: https://support.circleci.com/hc/en-us/articles/360043002793-Troubleshooting-CircleCI-Access-After-Enabling-Github-SSO

I pushed the config with the changes I suggested here and the validation is succeeding: #51

~~Since the start date is set to November the dashboard will not be available before then.~~
~~After November~~ it will be available at: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo

firefox-install-demo.toml

scholtzan · 2022-10-07T20:09:19Z

firefox-install-demo.toml

+        WHERE branch != "other"
+    )
+"""
+submission_date_column = "submission_date"


OpMon assumes by default that the metrics are computed on a set of clients. The aggregation specified as part of the metric definitions are on a per-client basis. It looks like the install table doesn't have a client_id field. I don't know if document_id can be used here, or maybe some other field that can identify clients?

Suggested change

submission_date_column = "submission_date"

submission_date_column = "submission_date"

client_id_column = "document_id"

Unfortunately not :(. There is no way to track clients over time in this table. All pings are entirely separate from one another. (document_id is just a uuid that is different with each submission, regardless of the client.)

scholtzan · 2022-10-07T20:09:52Z

firefox-install-demo.toml

+    (
+        WITH success_rate_per_branch AS (
+            SELECT 
+                DATE(submission_timestamp) AS submission_date,


Suggested change

DATE(submission_timestamp) AS submission_date,

DATE(submission_timestamp) AS submission_date,

document_id,

scholtzan · 2022-10-07T20:10:18Z

firefox-install-demo.toml

+                AND installer_type = 'stub'
+        )
+        SELECT 
+        submission_date,


Suggested change

submission_date,

submission_date,

document_id,

Co-authored-by: Anna Scholtz <anna@scholtzan.net>

bhearsum · 2022-10-11T13:10:04Z

I'm not sure why the CI isn't running for you. It might have something to do with the switch to requiring Github SSO: https://support.circleci.com/hc/en-us/articles/360043002793-Troubleshooting-CircleCI-Access-After-Enabling-Github-SSO

Looks like this was the case - thanks for the pointer!

scholtzan

This reproduces the linked STMO query. The full config is here: https://github.com/mozilla/opmon-config/pull/51/files

I ran a backfill of the last 3 days and the dashboard seems to reproduce the expected numbers: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo

The percentile filter can be ignored on the dashboard since there are no percentile statistics. SUM represents the total submissions that succeeded for each OS. Mean is the average, and should be pretty much the same as on the redash graph.

OpMon does internally expect a per-client analysis, however in this case we don't have clients and are only interested in what the individual submission/their success average looks like. Since document_id is unique for each submission, it can be used instead of client_id in this case. An aggregation still needs to be defined, in this case LOGICAL_OR(succeeded) will just return the value of succeeded (since document_id is unique) and then gets cast to either 0 or 100. The average (and sum) get computed as part of the statistical aggregation step.

scholtzan · 2022-10-11T18:18:49Z

firefox-install-demo.toml

+        STRUCT (    -- this is the structure opmon expects
+            [
+            STRUCT (
+                "firefox-install-success-rate" AS key,   -- dummy experiment/rollout slug to make opmon happy


the slug needs to match the config file name

Suggested change

"firefox-install-success-rate" AS key, -- dummy experiment/rollout slug to make opmon happy

"firefox-install-demo" AS key, -- dummy experiment/rollout slug to make opmon happy

scholtzan · 2022-10-11T18:19:20Z

firefox-install-demo.toml

+
+[metrics.install_success_rate]
+data_source = "firefox_installs"
+select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)"


Suggested change

select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)"

select_expression = "IF(LOGICAL_OR(succeeded), 1, 0) * 100"

bhearsum · 2022-10-11T23:02:57Z

Thank you so much for your help in pushing this along!

This reproduces the linked STMO query. The full config is here: https://github.com/mozilla/opmon-config/pull/51/files

I ran a backfill of the last 3 days and the dashboard seems to reproduce the expected numbers: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo

Agreed, it looks like we get the same data as long as the STMO goes at least 4 days back (I'm guessing it's cutting off data at the wrong time of day if I only query 3 days back).

The percentile filter can be ignored on the dashboard since there are no percentile statistics.

Is this because there's no stable field across multiple submissions (like client_id), or something else?

SUM represents the total submissions that succeeded for each OS. Mean is the average, and should be pretty much the same as on the redash graph.

Got it. Seeing this makes me realize that the sum as it is now is not really that useful. It probably would be useful to have a simple SUM for all platforms. I'm guessing that this means a second data source and metrics block (without the os_version group by)?

OpMon does internally expect a per-client analysis, however in this case we don't have clients and are only interested in what the individual submission/their success average looks like. Since document_id is unique for each submission, it can be used instead of client_id in this case. An aggregation still needs to be defined, in this case LOGICAL_OR(succeeded) will just return the value of succeeded (since document_id is unique) and then gets cast to either 0 or 100. The average (and sum) get computed as part of the statistical aggregation step.

Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the y columns? Eg: it is applied for each row the query returns?

If you'd like, I'm happy to close this ticket out for now in favour of #51. I'm hoping we can get that merged soon (I'd like to follow-up with a couple of tweaks - most importantly one to test out alerting).

Thank you again for your help so far!

scholtzan · 2022-10-11T23:31:47Z

The percentile filter can be ignored on the dashboard since there are no percentile statistics.
Is this because there's no stable field across multiple submissions (like client_id), or something else?
No this is because no percentiles (which is also a statistical computation that is available) are being computed (only avg and sum for this dashboard) but the dashboard generation logic hasn't been updated to remove the filter if percentiles aren't actually used. So currently a UI bug on our side that will be fixed soon.

Got it. Seeing this makes me realize that the sum as it is now is not really that useful. It probably would be useful to have a simple SUM for all platforms. I'm guessing that this means a second data source and metrics block (without the os_version group by)?

Yes, that should work, if it is actually more useful to have the total over all submissions

Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the y columns? Eg: it is applied for each row the query returns?

OpMon executes two phases of computations on the data:

the first phase is to compute metrics on a per-client basis. What that means is that clients, could send multiple pings (for example, multiple main pings) on a single day. This phase will aggregate the values of all the pings a client sends to a single value per client. Imagine we are interested in the search_counts which are sent as part of the main ping. To get the number of total searches by client, we'd want to sum up the search_counts for each client from each ping that was sent. This is the aggregation that is specified as part of the metric select_expression
the second phase is to compute statistics on the client population. The result is essentially a single value that is representative for the client population. For example, if we'd be interested in how many searches a client does on average daily, then the statistical computation mean would need to be used.

So what the metric configurations specify is how multiple client pings should be aggregated, and how the metric values of all the clients should be aggregated.

This table doesn't differentiate between clients, but the same concept is applicable on a per-document instead of per-client basis here

I'll merge #51 now

bhearsum · 2022-10-12T15:25:16Z

Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the y columns? Eg: it is applied for each row the query returns?

OpMon executes two phases of computations on the data:
* the first phase is to compute metrics on a per-client basis. What that means is that clients, could send multiple pings (for example, multiple `main` pings) on a single day. This phase will aggregate the values of all the pings a client sends to a single value per client. Imagine we are interested in the `search_counts` which are sent as part of the `main` ping. To get the number of total searches by client, we'd want to sum up the `search_counts` for each client from each ping that was sent. This is the aggregation that is specified as part of the `metric` `select_expression`

* the second phase is to compute statistics on the client population. The result is essentially a single value that is representative for the client population. For example, if we'd be interested in how many searches a client does on average daily, then the statistical computation `mean` would need to be used.
So what the metric configurations specify is how multiple client pings should be aggregated, and how the metric values of all the clients should be aggregated.

This table doesn't differentiate between clients, but the same concept is applicable on a per-document instead of per-client basis here

Ah, I think I get it now. So OpMan needs something to aggregate on, but for a case where we either can't or don't want to aggregate by client - some sort of unique value will satisfy the requirement (and effectively result in each row being treated as an individual "client"). In this case, we have document_id to do this (but if we were dealing with a table that had no such unique field, we might have to do something differently).

Thank you again for your help!

First draft of data source & metrics for Firefox install rate monitoring

997c7e4

scholtzan self-requested a review October 7, 2022 19:24

scholtzan reviewed Oct 7, 2022

View reviewed changes

Apply suggestions from code review

acc734e

Co-authored-by: Anna Scholtz <anna@scholtzan.net>

bhearsum closed this Oct 11, 2022

bhearsum reopened this Oct 11, 2022

scholtzan reviewed Oct 11, 2022

View reviewed changes

bhearsum closed this Oct 12, 2022

bhearsum mentioned this pull request Oct 12, 2022

Alerts & Additional data for firefox installs #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First draft of data source & metrics for Firefox install rate monitoring #50

First draft of data source & metrics for Firefox install rate monitoring #50

bhearsum commented Oct 4, 2022

scholtzan left a comment •

edited

Loading

scholtzan Oct 7, 2022

bhearsum Oct 8, 2022

scholtzan Oct 7, 2022

scholtzan Oct 7, 2022

bhearsum commented Oct 11, 2022

scholtzan left a comment

scholtzan Oct 11, 2022

scholtzan Oct 11, 2022

bhearsum commented Oct 11, 2022

scholtzan commented Oct 11, 2022 •

edited

Loading

bhearsum commented Oct 12, 2022

	submission_date_column = "submission_date"
	submission_date_column = "submission_date"
	client_id_column = "document_id"

	DATE(submission_timestamp) AS submission_date,
	DATE(submission_timestamp) AS submission_date,
	document_id,

	"firefox-install-success-rate" AS key, -- dummy experiment/rollout slug to make opmon happy
	"firefox-install-demo" AS key, -- dummy experiment/rollout slug to make opmon happy

	select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)"
	select_expression = "IF(LOGICAL_OR(succeeded), 1, 0) * 100"

First draft of data source & metrics for Firefox install rate monitoring #50

First draft of data source & metrics for Firefox install rate monitoring #50

Conversation

bhearsum commented Oct 4, 2022

scholtzan left a comment • edited Loading

Choose a reason for hiding this comment

scholtzan Oct 7, 2022

Choose a reason for hiding this comment

bhearsum Oct 8, 2022

Choose a reason for hiding this comment

scholtzan Oct 7, 2022

Choose a reason for hiding this comment

scholtzan Oct 7, 2022

Choose a reason for hiding this comment

bhearsum commented Oct 11, 2022

scholtzan left a comment

Choose a reason for hiding this comment

scholtzan Oct 11, 2022

Choose a reason for hiding this comment

scholtzan Oct 11, 2022

Choose a reason for hiding this comment

bhearsum commented Oct 11, 2022

scholtzan commented Oct 11, 2022 • edited Loading

bhearsum commented Oct 12, 2022

scholtzan left a comment •

edited

Loading

scholtzan commented Oct 11, 2022 •

edited

Loading