Skip to content
This repository has been archived by the owner on Jan 25, 2023. It is now read-only.

First draft of data source & metrics for Firefox install rate monitoring #50

Closed
wants to merge 2 commits into from

Conversation

bhearsum
Copy link
Collaborator

@bhearsum bhearsum commented Oct 4, 2022

@amirmozla and I put this together as a first draft to try to migrate monitoring of Firefox installation success rate into OpMon. For now, we're mostly trying to migrate this STMO query. Once we see the data and useful graphs, then we can think about monitoring/alerting.

@scholtzan scholtzan self-requested a review October 7, 2022 19:24
Copy link
Collaborator

@scholtzan scholtzan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. The only thing that needs some consideration is how clients can be identified.

I'm not sure why the CI isn't running for you. It might have something to do with the switch to requiring Github SSO: https://support.circleci.com/hc/en-us/articles/360043002793-Troubleshooting-CircleCI-Access-After-Enabling-Github-SSO

I pushed the config with the changes I suggested here and the validation is succeeding: #51

Since the start date is set to November the dashboard will not be available before then.
After November it will be available at: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo

firefox-install-demo.toml Outdated Show resolved Hide resolved
firefox-install-demo.toml Outdated Show resolved Hide resolved
WHERE branch != "other"
)
"""
submission_date_column = "submission_date"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpMon assumes by default that the metrics are computed on a set of clients. The aggregation specified as part of the metric definitions are on a per-client basis. It looks like the install table doesn't have a client_id field. I don't know if document_id can be used here, or maybe some other field that can identify clients?

Suggested change
submission_date_column = "submission_date"
submission_date_column = "submission_date"
client_id_column = "document_id"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not :(. There is no way to track clients over time in this table. All pings are entirely separate from one another. (document_id is just a uuid that is different with each submission, regardless of the client.)

(
WITH success_rate_per_branch AS (
SELECT
DATE(submission_timestamp) AS submission_date,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DATE(submission_timestamp) AS submission_date,
DATE(submission_timestamp) AS submission_date,
document_id,

AND installer_type = 'stub'
)
SELECT
submission_date,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
submission_date,
submission_date,
document_id,

Co-authored-by: Anna Scholtz <anna@scholtzan.net>
@bhearsum bhearsum closed this Oct 11, 2022
@bhearsum bhearsum reopened this Oct 11, 2022
@bhearsum
Copy link
Collaborator Author

I'm not sure why the CI isn't running for you. It might have something to do with the switch to requiring Github SSO: https://support.circleci.com/hc/en-us/articles/360043002793-Troubleshooting-CircleCI-Access-After-Enabling-Github-SSO

Looks like this was the case - thanks for the pointer!

Copy link
Collaborator

@scholtzan scholtzan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reproduces the linked STMO query. The full config is here: https://github.com/mozilla/opmon-config/pull/51/files

I ran a backfill of the last 3 days and the dashboard seems to reproduce the expected numbers: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo

The percentile filter can be ignored on the dashboard since there are no percentile statistics. SUM represents the total submissions that succeeded for each OS. Mean is the average, and should be pretty much the same as on the redash graph.

OpMon does internally expect a per-client analysis, however in this case we don't have clients and are only interested in what the individual submission/their success average looks like. Since document_id is unique for each submission, it can be used instead of client_id in this case. An aggregation still needs to be defined, in this case LOGICAL_OR(succeeded) will just return the value of succeeded (since document_id is unique) and then gets cast to either 0 or 100. The average (and sum) get computed as part of the statistical aggregation step.

STRUCT ( -- this is the structure opmon expects
[
STRUCT (
"firefox-install-success-rate" AS key, -- dummy experiment/rollout slug to make opmon happy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the slug needs to match the config file name

Suggested change
"firefox-install-success-rate" AS key, -- dummy experiment/rollout slug to make opmon happy
"firefox-install-demo" AS key, -- dummy experiment/rollout slug to make opmon happy


[metrics.install_success_rate]
data_source = "firefox_installs"
select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
select_expression = "round(avg(if(succeeded, 1, 0)) * 100, 1)"
select_expression = "IF(LOGICAL_OR(succeeded), 1, 0) * 100"

@bhearsum
Copy link
Collaborator Author

Thank you so much for your help in pushing this along!

This reproduces the linked STMO query. The full config is here: https://github.com/mozilla/opmon-config/pull/51/files

I ran a backfill of the last 3 days and the dashboard seems to reproduce the expected numbers: https://mozilla.cloud.looker.com/dashboards/operational_monitoring::firefox_install_demo

Agreed, it looks like we get the same data as long as the STMO goes at least 4 days back (I'm guessing it's cutting off data at the wrong time of day if I only query 3 days back).

The percentile filter can be ignored on the dashboard since there are no percentile statistics.

Is this because there's no stable field across multiple submissions (like client_id), or something else?

SUM represents the total submissions that succeeded for each OS. Mean is the average, and should be pretty much the same as on the redash graph.

Got it. Seeing this makes me realize that the sum as it is now is not really that useful. It probably would be useful to have a simple SUM for all platforms. I'm guessing that this means a second data source and metrics block (without the os_version group by)?

OpMon does internally expect a per-client analysis, however in this case we don't have clients and are only interested in what the individual submission/their success average looks like. Since document_id is unique for each submission, it can be used instead of client_id in this case. An aggregation still needs to be defined, in this case LOGICAL_OR(succeeded) will just return the value of succeeded (since document_id is unique) and then gets cast to either 0 or 100. The average (and sum) get computed as part of the statistical aggregation step.

Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the y columns? Eg: it is applied for each row the query returns?

If you'd like, I'm happy to close this ticket out for now in favour of #51. I'm hoping we can get that merged soon (I'd like to follow-up with a couple of tweaks - most importantly one to test out alerting).

Thank you again for your help so far!

@scholtzan
Copy link
Collaborator

scholtzan commented Oct 11, 2022

The percentile filter can be ignored on the dashboard since there are no percentile statistics.
Is this because there's no stable field across multiple submissions (like client_id), or something else?
No this is because no percentiles (which is also a statistical computation that is available) are being computed (only avg and sum for this dashboard) but the dashboard generation logic hasn't been updated to remove the filter if percentiles aren't actually used. So currently a UI bug on our side that will be fixed soon.

Got it. Seeing this makes me realize that the sum as it is now is not really that useful. It probably would be useful to have a simple SUM for all platforms. I'm guessing that this means a second data source and metrics block (without the os_version group by)?

Yes, that should work, if it is actually more useful to have the total over all submissions

Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the y columns? Eg: it is applied for each row the query returns?

OpMon executes two phases of computations on the data:

  • the first phase is to compute metrics on a per-client basis. What that means is that clients, could send multiple pings (for example, multiple main pings) on a single day. This phase will aggregate the values of all the pings a client sends to a single value per client. Imagine we are interested in the search_counts which are sent as part of the main ping. To get the number of total searches by client, we'd want to sum up the search_counts for each client from each ping that was sent. This is the aggregation that is specified as part of the metric select_expression
  • the second phase is to compute statistics on the client population. The result is essentially a single value that is representative for the client population. For example, if we'd be interested in how many searches a client does on average daily, then the statistical computation mean would need to be used.

So what the metric configurations specify is how multiple client pings should be aggregated, and how the metric values of all the clients should be aggregated.

This table doesn't differentiate between clients, but the same concept is applicable on a per-document instead of per-client basis here

I'll merge #51 now

@bhearsum
Copy link
Collaborator Author

Cool! I'm not certain I fully understand this part yet though -- is that expression what is feeding the y columns? Eg: it is applied for each row the query returns?

OpMon executes two phases of computations on the data:

* the first phase is to compute metrics on a per-client basis. What that means is that clients, could send multiple pings (for example, multiple `main` pings) on a single day. This phase will aggregate the values of all the pings a client sends to a single value per client. Imagine we are interested in the `search_counts` which are sent as part of the `main` ping. To get the number of total searches by client, we'd want to sum up the `search_counts` for each client from each ping that was sent. This is the aggregation that is specified as part of the `metric` `select_expression`

* the second phase is to compute statistics on the client population. The result is essentially a single value that is representative for the client population. For example, if we'd be interested in how many searches a client does on average daily, then the statistical computation `mean` would need to be used.

So what the metric configurations specify is how multiple client pings should be aggregated, and how the metric values of all the clients should be aggregated.

This table doesn't differentiate between clients, but the same concept is applicable on a per-document instead of per-client basis here

Ah, I think I get it now. So OpMan needs something to aggregate on, but for a case where we either can't or don't want to aggregate by client - some sort of unique value will satisfy the requirement (and effectively result in each row being treated as an individual "client"). In this case, we have document_id to do this (but if we were dealing with a table that had no such unique field, we might have to do something differently).

Thank you again for your help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants