New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
breakdown PodSchedulingDuration by number of attempts #92650
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/scheduler/scheduler.go
Outdated
metrics.PodSchedulingDuration.Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp)) | ||
|
||
// We breakdown the pod scheduling duration by attempts (capped to a limit). | ||
attempts := podInfo.Attempts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan do we really need to cap? can I just use attempts as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I don't think the number of attempts will have a wide range of values, leaving it unbounded is certainly not a good idea.
pkg/scheduler/scheduler.go
Outdated
if attempts > 50 { | ||
attempts = 50 | ||
} | ||
metrics.PodSchedulingDuration.WithLabelValues(string(attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you send a parent PR that changes the variable name InitialAttemptTimestamp
to QueueuingTimestamp
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do that in a follow up PR to allow for patching this PR back if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about explicitly expressing that internally we have already "bucked" the attempts (we can adjust the buckets like [5,10), [10, 20), [20, +inf):
metrics.PodSchedulingDuration.WithLabelValues(displayAttempts(podInfo.Attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
func displayAttempts(n int) string {
if n < 5 {
return string(n)
} else if n < 20 {
return "[5, 20)"
}
return "[20, +inf)"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important information from this is actually distinguishing pods scheduled from the first attempt, so perhaps we can change the label to first_attempt, and the value is true or false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to the current version as it exposes more info (maybe not that useful for now), while also able to infer whether it's the first attempt or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg, I increased the limit to 15, I think it is small enough cardinality, and will likely cover most cases. We should increase the limit for PodSchedulingAttempts, I think 5 is too low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we display 15 as [15, +inf)
or something similar? so that users would know it's not literally 15.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to "15+", likely clear enough, sg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG.
/retest |
/priority important-soon |
2552052
to
fa18572
Compare
/assign @Huang-Wei Aldo is on vacation for the next couple of days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM generally. Just one comment.
pkg/scheduler/scheduler.go
Outdated
if attempts > 50 { | ||
attempts = 50 | ||
} | ||
metrics.PodSchedulingDuration.WithLabelValues(string(attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about explicitly expressing that internally we have already "bucked" the attempts (we can adjust the buckets like [5,10), [10, 20), [20, +inf):
metrics.PodSchedulingDuration.WithLabelValues(displayAttempts(podInfo.Attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
func displayAttempts(n int) string {
if n < 5 {
return string(n)
} else if n < 20 {
return "[5, 20)"
}
return "[20, +inf)"
}
ce6f48b
to
b7476e0
Compare
/lgtm |
/lgtm |
sorry, updated the comment, can you lgtm again :) |
/retest |
@ahg-g: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Breakdown PodSchedulingDuration by number of attempts, this is useful when monitoring the scheduler to understand how scheduling attempts impact latency.
Special notes for your reviewer:
Does this PR introduce a user-facing change?: