Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from watermarking to counting time in bands #109066

Closed
wants to merge 1 commit into from

Conversation

MikeSpreitzer
Copy link
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR replaces the use of watermarking with counting time spent in bands of utilization, and also increases the sampling period by a factor of 10. The goal is to reduce the amount of runtime CPU spent on these metrics, as well as reduce the volume of these metrics.

This is hoped to partially address #108272

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

TBD

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Also multiply sampling period by 10.

To reduce work on these metrics.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 28, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2022
@MikeSpreitzer
Copy link
Member Author

/cc @wojtek-t
/cc @tkashem

[]string{priorityLevel},
)
// PriorityLevelConcurrencyObserverPairGenerator creates pairs that observe concurrency for priority levels
PriorityLevelConcurrencyObserverPairGenerator = NewSampleAndWaterMarkHistogramsPairGenerator(clock.RealClock{}, time.Millisecond,
PriorityLevelConcurrencyObserverPairGenerator = NewSampleAndCountHistogramsPairGenerator(clock.RealClock{}, time.Millisecond*10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewSampleAndWaterMarkHistograms seems to be no longer used anywhere.
Can we remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

)

const (
labelNameLB = "lb"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is LB supposed to be? For me the first associatiation with LB is "load-balancer", which clearly isn't the case here.
Can we be more explict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"lower bound".
I was thinking that if a histogram can use "le" for "Less than or Equal", this could use "lb" for "Lower Bound".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it "lower_bound"

}
}

type SampleAndCountObserverGenerator struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't SampleAndCountObserverGenerator just be an interface? (and sampleAndCountObserverGenerator just its implementation)?

labelNameLB = "lb"
)

// NewSampleAndCountHistogramsGenerator makes a new one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please update function name to match the function name below

when, whenInt, acc, wellOrdered := func() (time.Time, int64, sampleAndCountAccumulator, bool) {
saw.Lock()
defer saw.Unlock()
// Moved these variables here to tiptoe around https://github.com/golang/go/issues/43570 for #97685
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned bug seems to be close.
Can we verify if we still need it?

ConstLabels: map[string]string{phase: "executing"},
StabilityLevel: compbasemetrics.ALPHA,
},
[]float64{0.9, 1},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it, and how about 0.5, 0.9 and 0.99?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain your thinking?

The virtue of 1 is that it tells us how much time was spent completely saturated. For a priority level with a concurrency limit of 100 or more, that is very different from --- and, I think, more interesting than --- the amount of time with at least 99% utilized.

Maybe 0.9 is pretty boring, it is really unlikely that a lot of time the utilization will be in [0.9, 1) without this showing up in the samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused by this, you're replacing a histogram with a counterVec w/ explicit buckets? Why not just reduce the buckets in the existing histogram? The only difference between a counterVec and a histogram is the aggregate metrics you get with a histogram (you get two additional summary metrics).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that histogram/summary compute quantiles from reported observations. This is not what we precisely care about. Because what we want is to say what percentage of the "real time" we were X% saturated.

@MikeSpreitzer - value of "1" was kind-of special until each request was occupying 1 seat. Now, it's no longer that special, because we may not be able to consume more requests even with occupancy less than 1.
Now - as a cluster operator I would like to be able to use those metrics not just to signal me we're out of capacity, but also to be able to tune them e.g. on organic growth of the load. I don't have very strong preference about exact numbers, but 0.5 is kind of useful value, and 0.9 and 0.99 are "stop-the-gaps".

Name: "priority_level_seat_count_watermarks",
Help: "Watermarks of the number of seats occupied for any stage of execution (but only initial stage for WATCHes)",
Buckets: []float64{0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1},
Name: "priority_level_seat_count_band_secs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we probably shouldn't be changing the type of the already exposed metric...
We should deprecate the historical one and introduce a new one instead.

@dgrisonnet @logicalhan - for thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your intuition that we should deprecate the old one and introduce the new one is mostly correct, but since you are renaming this metric, you're effectively deleting the old alpha metric and creating a brand new one. Since there is a memory usage issue with the old one, I'm actually okay with this approach, but definitely we should note this in the release notes, since anyone ingesting the old metric is just going to stop receiving data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. We are replacing the watermark histograms with a few counters, there is no doubt in my mind about changing the type.

klog.Errorf("Time went backwards from %s to %s for labelValues=%#+v", lastSetS, whenS, saw.labelValues)
}
for acc.lastSetInt < whenInt {
saw.samples.WithLabelValues(saw.labelValues...).Observe(acc.ratio)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling with this one a bit. Namely: what this metric really gives us.

Once we have a counter, we know how much time we actually spent in each of predefined buckets. So I don't really see how I would be supposed to use this histogram in addition to the counter above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample histograms have a complete set of buckets, the band counters are focused on just extremely high values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But - the bigger the sample period we take, the less usable that is (as it may be completely inaccurate).
And additionally, we're not really solving the core problem, because we're still reporting a bunch of metrics here.

Also - I know that it gives us complete set of buckets - but how will I use it. What do I get from knowing that I spend 20% of time in 0.2 bucket instead of 0.3 bucket?
I guess my point is - if we add 0.5 in our counter (and maybe one more small value like 0.1 or sth) I don't know how I would ever want to use this metric.

if wellOrdered {
bucket := findBucket(saw.countBuckets, saw.ratio)
if saw.lastBucket >= 0 {
saw.counts.WithLabelValues(saw.countLabelValues[saw.lastBucket]...).Add(dt.Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a problem that we report it only for a given bucket.
This seems fine as long as we name the bucket in a clear way e.g. not just the end of the bucket, but rather the whole bucket.
Something like "0.9-1.0" (or sth like that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is .9 representative of the 90% percentile?

Why not use a summary metric if that's the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my response above - we don't want quantiles from the observations.

@k8s-ci-robot
Copy link
Contributor

@MikeSpreitzer: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit e1a84c0 link true /test pull-kubernetes-unit
pull-kubernetes-integration e1a84c0 link true /test pull-kubernetes-integration
pull-kubernetes-e2e-kind-ipv6 e1a84c0 link true /test pull-kubernetes-e2e-kind-ipv6
pull-kubernetes-e2e-kind e1a84c0 link true /test pull-kubernetes-e2e-kind
pull-kubernetes-e2e-gce-ubuntu-containerd e1a84c0 link true /test pull-kubernetes-e2e-gce-ubuntu-containerd
pull-kubernetes-e2e-gce-100-performance e1a84c0 link true /test pull-kubernetes-e2e-gce-100-performance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@MikeSpreitzer
Copy link
Member Author

Several of the comments are rooted in the replacement of a complete histogram with counters covering just a few bands of possible values. I also find this dissatisfying.

Remember that a histogram is just a collection of counters that follow a certain pattern. I wondered if I could get the behavior I want (a histogram with Add instead of Observe) by manipulating actual counters whose names, labels, and semantics follow the same pattern. I was stopped by the following thoughts.

A scrape has these # TYPE lines as well, and they would be different. Maybe it would not actually matter to someone applying histogram_quantile in PromQL?

utilization <= 1 is much less interesting than utilization >= 1. However, if we focus on the complement of utilization, namely unusued or spare capacity, a bucket for spare <= 0 tell us the same thing as utilization >= 1.

I was not enthused about replicating the logic that keeps a member in a Vec for each combination of labels in use --- efficiently. Actually, this is not a blocker; the sample-and-watermark histograms file is already keeping an object per label combination in use.

In a histogram, one observation causes an increment in several counters. Maybe that is not a prohibitive cost? Or maybe I could synthesize that by attacking this at a lower level that allows me to do the sums at gather time rather than Add time.

@MikeSpreitzer
Copy link
Member Author

MikeSpreitzer commented Mar 28, 2022

Actually, for utilization, both 0 and 1 are interesting values to distinguish from all others. So reversing the polarity only changes which one of them is not easy to distinguish. But there is another simple hack. Using utilization buckets closed on the top end (as in histograms today), have a bucket boundary at 0.999999 as well as at 1.

The boundary at 1 is not even needed for a normal histogram, because the implicit +inf bucket will cover it.

@MikeSpreitzer
Copy link
Member Author

On second thought, the way to represent accumulated time is obvious --- because the accumulator is not necessarily a float64. With the pattern that @beorn7 showed the accumulator can be anything and is converted to a float64 at Collect time. So we can simply use a time.Duration as the accumulator.

@MikeSpreitzer
Copy link
Member Author

@cici37
Copy link
Contributor

cici37 commented Mar 29, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2022
@logicalhan
Copy link
Member

/triage accepted
/assign @logicalhan @dgrisonnet @CatherineF-dev

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 24, 2022
@k8s-ci-robot
Copy link
Contributor

@MikeSpreitzer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MikeSpreitzer
Copy link
Member Author

This PR is moot, we are taking a more fundamental whack at the problem in the PR series including #110104 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants