Revert "Surface more information about plugin scores in scheduler" #99914

Huang-Wei · 2021-03-07T03:57:22Z

What type of PR is this?

/kind regression
/sig scheduling

What this PR does / why we need it:

This PR reverts #99411.

Tried locally that change v(4) to v(5) works, but that doesn't fix the root cause - users who customized logging level to 5 or greater can still encounter this issue. So I proposed to revert #99411 and dig out why it regresses.

Which issue(s) this PR fixes:

Fixes #99913

NONE

This reverts commit d09a841.

k8s-ci-robot · 2021-03-07T03:57:29Z

@Huang-Wei: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei · 2021-03-07T03:57:55Z

/priority critical-urgent

/assign @damemi
/cc @ahg-g @alculquicondor

k8s-ci-robot · 2021-03-07T03:58:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2021-03-07T04:04:10Z

can you explain the regression?

ahg-g · 2021-03-07T04:04:36Z

but that doesn't fix the root cause - users who customized logging level to 5 or greater can still encounter this issue.

Which issue?

ahg-g · 2021-03-07T04:38:08Z

oh, you are referring to #99913

Well, the benchmark should be running with the default v=2 logging level. I am guessing that increasing the logging level to 3 or 4 is done sometimes temporarily.

That being said, I don't quite like the way #99411 was implemented, it seems a bit ad hoc.

Huang-Wei · 2021-03-07T06:29:49Z

oh, you are referring to #99913

Yes, I referred to #99913. Sorry, there was a typo.

Well, the benchmark should be running with the default v=2 logging level. I am guessing that increasing the logging level to 3 or 4 is done sometimes temporarily.

Here is the magic:

kubernetes/test/integration/framework/master_utils.go

Lines 131 to 136 in 90851a0

    
           // Ensure we log at least level 4 
        
           v := flag.Lookup("v").Value 
        
           level, _ := strconv.Atoi(v.String()) 
        
           if level < 4 { 
        
           	v.Set("4") 
        
           }

But using which logging level is not the key point - #99411 is beneficial only when the desired logging level is enabled, but when that level is enabled, performance downgrades.

damemi

The performance drop is unexpected, though I would like to continue working on this to find a more efficient way to surface information like this.

I think we made some decisions in the original PR that have affected this, such as duplicating the calculation loop in order to separate out the logging information into its own function.

Due to that, we are essentially doing much of this work twice (which would explain the ~50% drop in throughput). I think that some reasonable level of efficiency drop is acceptable, because like @ahg-g said this is not a default logging level. This much is not acceptable.

Thanks for catching this and opening the reversion @Huang-Wei. I'll work to get a new PR with performance testing. Any other suggestions for how to improve this are welcome.
/lgtm

Huang-Wei · 2021-03-07T21:27:04Z

Thanks for catching this and opening the reversion @Huang-Wei. I'll work to get a new PR with performance testing. Any other suggestions for how to improve this are welcome.

Thanks @damemi ! As code freeze is around the corner, reverting may be the most viable option to unblock other PRs.

And it's easy to reproduce the perf drop, reverting can buy us more time to find out the root cause.

Huang-Wei · 2021-03-07T21:27:08Z

/retest

alculquicondor · 2021-03-08T14:51:52Z

I don't think the performance drop is unexpected. But I don't think it's related to the calculation. It's probably about IO. We have 5000 nodes in those tests. That's a lot of writes.

I just didn't think that tests would run at level 4 :(

Revert "Surface more information about plugin scores in scheduler"

8f5416c

This reverts commit d09a841.

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 7, 2021

k8s-ci-robot assigned damemi Mar 7, 2021

k8s-ci-robot requested review from ahg-g and alculquicondor March 7, 2021 03:57

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 7, 2021

k8s-ci-robot requested review from adtac and wgliang March 7, 2021 03:58

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 7, 2021

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 7, 2021

damemi reviewed Mar 7, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2021

k8s-ci-robot merged commit 9a9a9b0 into kubernetes:master Mar 7, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 7, 2021

Huang-Wei deleted the throughput-regression branch March 8, 2021 02:29

damemi mentioned this pull request Jun 21, 2021

More debug logs to check scheduler spread algorithm #101820

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Surface more information about plugin scores in scheduler" #99914

Revert "Surface more information about plugin scores in scheduler" #99914

Huang-Wei commented Mar 7, 2021 •

edited

k8s-ci-robot commented Mar 7, 2021

Huang-Wei commented Mar 7, 2021

k8s-ci-robot commented Mar 7, 2021

ahg-g commented Mar 7, 2021

ahg-g commented Mar 7, 2021

ahg-g commented Mar 7, 2021

Huang-Wei commented Mar 7, 2021

damemi left a comment

Huang-Wei commented Mar 7, 2021

Huang-Wei commented Mar 7, 2021

alculquicondor commented Mar 8, 2021

Revert "Surface more information about plugin scores in scheduler" #99914

Revert "Surface more information about plugin scores in scheduler" #99914

Conversation

Huang-Wei commented Mar 7, 2021 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

k8s-ci-robot commented Mar 7, 2021

Huang-Wei commented Mar 7, 2021

k8s-ci-robot commented Mar 7, 2021

ahg-g commented Mar 7, 2021

ahg-g commented Mar 7, 2021

ahg-g commented Mar 7, 2021

Huang-Wei commented Mar 7, 2021

damemi left a comment

Choose a reason for hiding this comment

Huang-Wei commented Mar 7, 2021

Huang-Wei commented Mar 7, 2021

alculquicondor commented Mar 8, 2021

Huang-Wei commented Mar 7, 2021 •

edited