Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rules for cluster CPU-hours and Instance-hours #418

Conversation

kahowell
Copy link
Contributor

These are intended to help simplify PromQL used by subscription watch,
as well as make it more visible to others.

Additionally, the encapsulation allows us to tweak the definition of
CPU-hours or instance-hours if better underlying metrics are made
available.

Note this is untested, but I'm happy to assist with testing, but I lack context/fixtures.

These are intended to help simplify PromQL used by subscription watch,
as well as make it more visible to others.

Additionally, the encapsulation allows us to tweak the definition of
CPU-hours or instance-hours if better underlying metrics are made
available.
@openshift-ci openshift-ci bot requested review from bwplotka and sthaha July 28, 2022 20:48
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 28, 2022

Hi @kahowell. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 28, 2022
// max(...) by (_id) is used ensure a single datapoint per cluster ID
record: 'cluster:usage:workload:capacity_physical_cpu_hours',
expr: |||
max(sum_over_time(cluster:usage:workload:capacity_physical_cpu_cores:max:5m[1h:5m]) / count_over_time(vector(1)[1h:5m])) by (_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current form, the expression would return no data because the right-hand side has no _id label to match with the left-hand side (vector() returns a scalar as a vector with no label).
count_over_time(vector(1)[1h:5m]) is always going to return 12 anyway. Hence it could be replaced by 12 (if it's really what you wanted).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was missing a scalar call. I PoC'd this change and then forgot to include scalar when I transcribed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to the 12 thing, the reason I thought it might be wise to use scalar(count_over_time(vector(1)[1h:5m])) instead was that I noticed different answers for scalar(count_over_time(vector(1)[1h:5m])) depending on step and timestamp passed. For example:

Expand for screenshot

image

In practice, we noticed that when using step=3600 and aligned to the top of the hour, we always get a value of `13`:
Expand for screenshot

image

I figured this was something to do with Prometheus doing sampling (actually, I'd love a more specific/accurate explanation, if you have one). Thus it seemed unsafe to simply use 13, since, if I understand correctly the recording rule doesn't necessarily run at the top of the hour. If you believe 13 is fine (or 12), though, I am more than happy to hardcode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this looks like a Thanos artifact. With a vanilla Prometheus, I can't reproduce...

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier given the above, how should we proceed?

  • hardcode 12?
  • hardcode 13?
  • use scalar(count_over_time(vector(1)[1h:5m]))?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar(count_over_time(vector(1)[1h:5m])) is probably going to return the "correct" result but I'd be eager to hear from @bwplotka if it's something that he's aware of.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka following up on the question that went your August 2. Does this look ok to you?

jsonnet/telemeter/rules.libsonnet Outdated Show resolved Hide resolved
@barnabycourt
Copy link

/assign @bwplotka per the comment earlier in the PR

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 5, 2022

@barnabycourt: GitHub didn't allow me to assign the following users: in, PR, per, the, comment, earlier.

Note that only openshift members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @bwplotka per the comment earlier in the PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@accorvin
Copy link

@anishasthana I think you have a lot of experience with the rules we're using like this in RHODS - can you take a look at this too and give your opinion on whether it could satisfy RHODS use cases from your perspective?

@anishasthana
Copy link
Member

After talking to Jeff and Kevin, I think these rules would satisfy RHODS use cases.

Copy link
Contributor

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

jsonnet/telemeter/rules.libsonnet Outdated Show resolved Hide resolved
jsonnet/telemeter/rules.libsonnet Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2022
@moadz
Copy link
Contributor

moadz commented Sep 9, 2022

LGTM

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 9, 2022

@moadz: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
@simonpasquier
Copy link
Contributor

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 12, 2022
@barnabycourt
Copy link

/retest

@simonpasquier
Copy link
Contributor

/test e2e-aws-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2022

@kahowell: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@anishasthana
Copy link
Member

Looks like we still need a lgtm on the PR.

@douglascamata
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 28, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bwplotka, douglascamata, kahowell, moadz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants