Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some scheduler metrics(pending_pods and schedule_attempts_total) are not recorded. #87692

Merged

Conversation

everpeace
Copy link
Contributor

What type of PR is this?

/kind bug
/sig scheduling

What this PR does / why we need it:

fix two scheduler metrics (pending_pods and schedule_attempts_total) are not recorded.

Which issue(s) this PR fixes:

Fixes #87690

Special notes for your reviewer:

scheduler/metrics.{PodScheduleSuccesses, PodScheduleFailures, PodScheduleErrors} and scheduler/metrics.PendingPodsRecorders(ActivePods, UnschedulablePods, BackoffPods) seems to too early to be initialized.

All the k8s.io/component-base/metrics's metrics primitives are lazy metric. So, calling some_metric.With(labels) before registration returns noop metrics.

So I made mainly two modifications.

  • move scheduler/metrics.Registry() before crerating scheduler queue
  • delayed PodScheduleSuccesses, PodScheduleFailures, PodScheduleErrors metrics' initialization

Does this PR introduce a user-facing change?:

fixed two scheduler metrics (pending_pods and schedule_attempts_total) not being recorded

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

none.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 30, 2020
@everpeace everpeace changed the title Fix scheduler queue metrics(pending_pods and schedule_attempts_total) are not recorded. Fix some scheduler metrics(pending_pods and schedule_attempts_total) are not recorded. Jan 30, 2020
@everpeace everpeace force-pushed the fix-scheduler-queue-metrics branch 2 times, most recently from 05ed996 to 26b7452 Compare January 30, 2020 15:43
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 30, 2020
@everpeace
Copy link
Contributor Author

/retest

Copy link
Member

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @liu-cong

We should back port this to 1.17

@@ -291,6 +291,8 @@ func New(client clientset.Interface,
nodeInfoSnapshot: snapshot,
}

metrics.Register()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why moving this to here makes a difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, because it has to be done before initializing the scheduling_queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the reason. I wanted to find a more essential & safe way to Register metric instances, though.

@@ -262,6 +265,9 @@ func Register() {
legacyregistry.MustRegister(metric)
}
volumeschedulingmetrics.RegisterVolumeSchedulingMetrics()
PodScheduleSuccesses = scheduleAttempts.With(metrics.Labels{"result": "scheduled"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, perhaps we should follow the pendingPods metric approach, define functions that return the metric (see lines 275-292)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I guess we access those handlers directly in the scheduler, ok I guess this is fine.

@ahg-g
Copy link
Member

ahg-g commented Jan 30, 2020

Looking at v1.16 and v1.15, registering the metrics has always been done after creating the queue, I wonder why this is an issue now.

@liu-cong
Copy link
Contributor

We should back port this to 1.17

Isn't this a bug since the beginning so we need to backport to all applicable versions?

@ahg-g
Copy link
Member

ahg-g commented Jan 30, 2020

We should back port this to 1.17

Isn't this a bug since the beginning so we need to backport to all applicable versions?

Yes we do, I realized that this is a bug in 1.16 and 1.15 after I posted this.

@ahg-g
Copy link
Member

ahg-g commented Jan 30, 2020

I think commit 8da448d is the main culprit, it switched the scheduler from Prometheus to k8s wrapper in 1.16, but didn't fix the registration.

Ok, so this fix must be backported to 1.17 and 1.16 (just backport the call to Register to happen earlier since the attempts metric doesn't exist in 1.16)

@ahg-g
Copy link
Member

ahg-g commented Jan 30, 2020

/priority critical-urgent

@k8s-ci-robot
Copy link
Contributor

@ahg-g: The label(s) priority/critical-urget cannot be applied, because the repository doesn't have them

In response to this:

/priority critical-urget

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 30, 2020
@ahg-g
Copy link
Member

ahg-g commented Jan 30, 2020

/retest
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2020
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2020
@ahg-g
Copy link
Member

ahg-g commented Jan 30, 2020

/retest

@everpeace
Copy link
Contributor Author

Thanks for the quick review!!

/retest

@everpeace
Copy link
Contributor Author

everpeace commented Jan 31, 2020

hmm, although I'm not familiar with e2e test, I believe that my change would not affect e2e. I'm not sure what I could do other than /retest .

@everpeace
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2020
@haosdent
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2020
@haosdent
Copy link
Member

/retest

Copy link
Contributor

@draveness draveness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please squash the commit.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2020
because metric initializations are too early. This causes actual metric
instance become no-op.

modification made in thie commit to make sure actual metric instance won't be no-op metrics:

- re-initialize scheduler/metrics.PodSchedule{Successes, Failure, Errors} after metric creation
- scheduler/metrics.Register() should be called before initializing SchedulingQueue,
@everpeace
Copy link
Contributor Author

everpeace commented Jan 31, 2020

squashed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2020
@everpeace
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2020
@haosdent
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2020
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 7f0ea14 into kubernetes:master Jan 31, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 31, 2020
k8s-ci-robot added a commit that referenced this pull request Feb 4, 2020
…-upstream-release-1.16

Automated cherry pick of #87692: Fix pending_pods, schedule_attempts_total was not recorded
k8s-ci-robot added a commit that referenced this pull request Feb 4, 2020
…-upstream-release-1.17

Automated cherry pick of #87692: Fix pending_pods, schedule_attempts_total was not recorded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

some scheduler metrics(pending_pods, schedule_attempts_total) was not recorded.
8 participants