Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

breakdown PodSchedulingDuration by number of attempts #92650

Merged
merged 1 commit into from Jul 2, 2020

Conversation

ahg-g
Copy link
Member

@ahg-g ahg-g commented Jun 30, 2020

What type of PR is this?

/kind feature

What this PR does / why we need it:

Breakdown PodSchedulingDuration by number of attempts, this is useful when monitoring the scheduler to understand how scheduling attempts impact latency.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add attempts label to scheduler's PodSchedulingDuration metric.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 30, 2020
metrics.PodSchedulingDuration.Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))

// We breakdown the pod scheduling duration by attempts (capped to a limit).
attempts := podInfo.Attempts
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan do we really need to cap? can I just use attempts as is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I don't think the number of attempts will have a wide range of values, leaving it unbounded is certainly not a good idea.

if attempts > 50 {
attempts = 50
}
metrics.PodSchedulingDuration.WithLabelValues(string(attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you send a parent PR that changes the variable name InitialAttemptTimestamp to QueueuingTimestamp or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do that in a follow up PR to allow for patching this PR back if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about explicitly expressing that internally we have already "bucked" the attempts (we can adjust the buckets like [5,10), [10, 20), [20, +inf):

metrics.PodSchedulingDuration.WithLabelValues(displayAttempts(podInfo.Attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))

func displayAttempts(n int) string {
	if n < 5 {
		return string(n)
	} else if n < 20 {
		return "[5, 20)"
	}
	return "[20, +inf)"
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important information from this is actually distinguishing pods scheduled from the first attempt, so perhaps we can change the label to first_attempt, and the value is true or false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to the current version as it exposes more info (maybe not that useful for now), while also able to infer whether it's the first attempt or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, I increased the limit to 15, I think it is small enough cardinality, and will likely cover most cases. We should increase the limit for PodSchedulingAttempts, I think 5 is too low.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we display 15 as [15, +inf) or something similar? so that users would know it's not literally 15.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to "15+", likely clear enough, sg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.

@ahg-g
Copy link
Member Author

ahg-g commented Jun 30, 2020

/retest

@ahg-g
Copy link
Member Author

ahg-g commented Jun 30, 2020

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 30, 2020
@ahg-g ahg-g force-pushed the ahg-attempts branch 2 times, most recently from 2552052 to fa18572 Compare July 1, 2020 16:34
@ahg-g
Copy link
Member Author

ahg-g commented Jul 1, 2020

/assign @Huang-Wei

Aldo is on vacation for the next couple of days.

Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally. Just one comment.

if attempts > 50 {
attempts = 50
}
metrics.PodSchedulingDuration.WithLabelValues(string(attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about explicitly expressing that internally we have already "bucked" the attempts (we can adjust the buckets like [5,10), [10, 20), [20, +inf):

metrics.PodSchedulingDuration.WithLabelValues(displayAttempts(podInfo.Attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))

func displayAttempts(n int) string {
	if n < 5 {
		return string(n)
	} else if n < 20 {
		return "[5, 20)"
	}
	return "[20, +inf)"
}

@ahg-g ahg-g force-pushed the ahg-attempts branch 3 times, most recently from ce6f48b to b7476e0 Compare July 1, 2020 21:32
@Huang-Wei
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020
@Huang-Wei
Copy link
Member

/lgtm

@ahg-g
Copy link
Member Author

ahg-g commented Jul 1, 2020

/lgtm

sorry, updated the comment, can you lgtm again :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020
@Huang-Wei
Copy link
Member

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 2, 2020

@ahg-g: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kind-ipv6 fa185728bf6fe7bdc269ce2de953385e89aa3e9f link /test pull-kubernetes-e2e-kind-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 15a9430 into kubernetes:master Jul 2, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jul 2, 2020
@ahg-g ahg-g deleted the ahg-attempts branch October 25, 2021 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants