New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1774621: Add OLM CSV metrics #253
Conversation
I'm aware that OLM has been granted extra budget for time-series, but could you elaborate anyways what the maximum cardinality of this metric is per cluster? |
Good question. The metric introduced in this pr consists of 4 labels: Labels being introducedNameThis label tracks the name of the CSV (operator) installed. Many operator authors include the version appended to the operator name, so this could change with each install/upgrade of an operator. VersionThis label tracks the version of the CSV (operator) installed. This value will change with each upgrade / install of an operator. PhaseThis label tracks the phases of a CSV as OLM reconciles it. Here is a list of possible phases:
ReasonThis label tracks why the CSV is in the given phase. A list of Reasons includes:
The potential maximum is quite high, but at this moment it appears that roughly 15 time series are generated when installing a CSV. The average cluster right now has about 3 operators installed - so we can expect roughly 50 time series per cluster. |
If I'm counting correctly, this could potentially produce up to 450 time-series per CSV, even if not active all at once. This seems a little excessive. Are there not some states that we are more interested in than others that we could roll up? |
docs/data-collection.md
Outdated
@@ -97,6 +97,9 @@ For the OpenShift 4 Developer Preview we will be sending back these exact attrib | |||
// subscription_sync_total is the number of times an OLM operator | |||
// Subscription has been synced, labelled by name and installed csv | |||
'{__name__="subscription_sync_total"}', | |||
// csv_sync_total is the number of times an OLM operator | |||
// CSV has been synced, labelled by name, version, phase, and reason | |||
'{__name__="csv_sync_total"}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the total is the current number of times in a row the version/phase/reason has been seen?
Ie if I am (“foo”, “1.0”, “deployed”, “”) 10, and suddenly it fails, will it be (“foo”, “1.0”, “failed”, “reason”) 1?
The description needs to be a bit better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, it's the current number of syncs for the csv by name, labelled with the other properites
so your example would be:
- (“foo”, “1.0”, “deployed”, “”) 10
- (“foo”, “1.0”, “failed”, “reason”) 11
Can’t we only have one series per operator version live at any time? Or did I misread that. |
We could do that, but I think I'm missing what that would buy us. Wouldn't the "old" timeseries still be stored in the aggregator for querying, unless we actually do a |
Once the old series is dead (5m) we can stop tracking it. After all, if you restart we’d drop all those other states as well. We need some way to control full cardinality - this is definitely important to get right. The bit of info we dont need is the rate of syncs (or it has less value than the others). Can we reasonably limit the cardinality of this to 2 series at any time per (name,version) - have one modeled like up (where it’s 1 if we are in the succeeded phase), and one which reports 1 if the phase is anything other than success? The current flow is the simplest modeling, but we need to strongly control OLM cardinality somehow. |
That's reasonable and is something we can do, but may be tight to make it for Friday. Also - this will reduce our total time series but ultimately still leaves the cardinality under user control. I know you know this, but figured it's worth a reminder because you mentioned "strongly controlling" cardinality. |
@smarterclayton @brancz The suggested changes were implemented in operator-framework/operator-lifecycle-manager#1099 |
/retest |
06005b7
to
0826320
Compare
0826320
to
10746a8
Compare
@awgreene: This pull request references Bugzilla bug 1774621, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@awgreene: This pull request references Bugzilla bug 1774621, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: awgreene, brancz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@awgreene: All pull requests linked via external trackers have merged. Bugzilla bug 1774621 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.