breakdown PodSchedulingDuration by number of attempts #92650

ahg-g · 2020-06-30T15:06:15Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Breakdown PodSchedulingDuration by number of attempts, this is useful when monitoring the scheduler to understand how scheduling attempts impact latency.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add attempts label to scheduler's PodSchedulingDuration metric.

k8s-ci-robot · 2020-06-30T15:07:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2020-06-30T15:11:56Z

pkg/scheduler/scheduler.go

-			metrics.PodSchedulingDuration.Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
+
+			// We breakdown the pod scheduling duration by attempts (capped to a limit).
+			attempts := podInfo.Attempts


@logicalhan do we really need to cap? can I just use attempts as is?

while I don't think the number of attempts will have a wide range of values, leaving it unbounded is certainly not a good idea.

alculquicondor · 2020-06-30T17:26:23Z

pkg/scheduler/scheduler.go

+			if attempts > 50 {
+				attempts = 50
+			}
+			metrics.PodSchedulingDuration.WithLabelValues(string(attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))


Could you send a parent PR that changes the variable name InitialAttemptTimestamp to QueueuingTimestamp or something like that?

We should do that in a follow up PR to allow for patching this PR back if needed.

How about explicitly expressing that internally we have already "bucked" the attempts (we can adjust the buckets like [5,10), [10, 20), [20, +inf):

metrics.PodSchedulingDuration.WithLabelValues(displayAttempts(podInfo.Attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp)) func displayAttempts(n int) string { if n < 5 { return string(n) } else if n < 20 { return "[5, 20)" } return "[20, +inf)" }

The most important information from this is actually distinguishing pods scheduled from the first attempt, so perhaps we can change the label to first_attempt, and the value is true or false.

I'm inclined to the current version as it exposes more info (maybe not that useful for now), while also able to infer whether it's the first attempt or not.

sg, I increased the limit to 15, I think it is small enough cardinality, and will likely cover most cases. We should increase the limit for PodSchedulingAttempts, I think 5 is too low.

Shouldn't we display 15 as [15, +inf) or something similar? so that users would know it's not literally 15.

changed it to "15+", likely clear enough, sg?

ahg-g · 2020-06-30T19:43:56Z

/retest

ahg-g · 2020-06-30T19:44:17Z

/priority important-soon

ahg-g · 2020-07-01T16:34:44Z

/assign @Huang-Wei

Aldo is on vacation for the next couple of days.

Huang-Wei

LGTM generally. Just one comment.

Huang-Wei · 2020-07-01T19:13:04Z

pkg/scheduler/scheduler.go

+			if attempts > 50 {
+				attempts = 50
+			}
+			metrics.PodSchedulingDuration.WithLabelValues(string(attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))


How about explicitly expressing that internally we have already "bucked" the attempts (we can adjust the buckets like [5,10), [10, 20), [20, +inf):

metrics.PodSchedulingDuration.WithLabelValues(displayAttempts(podInfo.Attempts)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp)) func displayAttempts(n int) string { if n < 5 { return string(n) } else if n < 20 { return "[5, 20)" } return "[20, +inf)" }

Huang-Wei · 2020-07-01T21:34:01Z

/lgtm

Huang-Wei · 2020-07-01T21:35:49Z

/lgtm

ahg-g · 2020-07-01T21:35:50Z

/lgtm

sorry, updated the comment, can you lgtm again :)

Huang-Wei · 2020-07-02T00:10:22Z

/retest

k8s-ci-robot · 2020-07-02T00:37:30Z

@ahg-g: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-kind-ipv6	fa185728bf6fe7bdc269ce2de953385e89aa3e9f	link	`/test pull-kubernetes-e2e-kind-ipv6`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

fejta-bot · 2020-07-02T04:52:33Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

k8s-ci-robot requested review from alculquicondor and hex108 June 30, 2020 15:08

ahg-g commented Jun 30, 2020

View reviewed changes

alculquicondor reviewed Jun 30, 2020

View reviewed changes

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 30, 2020

ahg-g force-pushed the ahg-attempts branch 2 times, most recently from 2552052 to fa18572 Compare July 1, 2020 16:34

k8s-ci-robot assigned Huang-Wei Jul 1, 2020

Huang-Wei reviewed Jul 1, 2020

View reviewed changes

ahg-g force-pushed the ahg-attempts branch 3 times, most recently from ce6f48b to b7476e0 Compare July 1, 2020 21:32

breakdown PodSchedulingDuration by number of attempts

d1ea49b

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020

ahg-g force-pushed the ahg-attempts branch from b7476e0 to d1ea49b Compare July 1, 2020 21:34

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2020

k8s-ci-robot merged commit 15a9430 into kubernetes:master Jul 2, 2020

k8s-ci-robot added this to the v1.19 milestone Jul 2, 2020

github-actions bot mentioned this pull request Jul 8, 2020

Week Ending July 5, 2020 dev-obs/actus#190

Open

ahg-g deleted the ahg-attempts branch October 25, 2021 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

breakdown PodSchedulingDuration by number of attempts #92650

breakdown PodSchedulingDuration by number of attempts #92650

ahg-g commented Jun 30, 2020

k8s-ci-robot commented Jun 30, 2020

ahg-g Jun 30, 2020

ahg-g Jul 1, 2020

alculquicondor Jun 30, 2020

ahg-g Jun 30, 2020

Huang-Wei Jul 1, 2020

ahg-g Jul 1, 2020

Huang-Wei Jul 1, 2020

ahg-g Jul 1, 2020

Huang-Wei Jul 1, 2020

ahg-g Jul 1, 2020

Huang-Wei Jul 1, 2020

ahg-g commented Jun 30, 2020

ahg-g commented Jun 30, 2020

ahg-g commented Jul 1, 2020

Huang-Wei left a comment

Huang-Wei Jul 1, 2020

Huang-Wei commented Jul 1, 2020

Huang-Wei commented Jul 1, 2020

ahg-g commented Jul 1, 2020

Huang-Wei commented Jul 2, 2020

k8s-ci-robot commented Jul 2, 2020 •

edited

fejta-bot commented Jul 2, 2020

breakdown PodSchedulingDuration by number of attempts #92650

breakdown PodSchedulingDuration by number of attempts #92650

Conversation

ahg-g commented Jun 30, 2020

k8s-ci-robot commented Jun 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Jun 30, 2020

ahg-g commented Jun 30, 2020

ahg-g commented Jul 1, 2020

Huang-Wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Jul 1, 2020

Huang-Wei commented Jul 1, 2020

ahg-g commented Jul 1, 2020

Huang-Wei commented Jul 2, 2020

k8s-ci-robot commented Jul 2, 2020 • edited

fejta-bot commented Jul 2, 2020

k8s-ci-robot commented Jul 2, 2020 •

edited