Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Faster scheduler" #78782

Merged

Conversation

@ahg-g
Copy link
Contributor

commented Jun 7, 2019

Reverts #77509

There is the possibility of a race condition: the cached allNodes slice could change while we are iterating through it trying to find a node that fits.

NONE
@Huang-Wei

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

I recalled this is a valid perf improvement. Instead of reverting the PR, can we just fix #78776 in a separate PR?

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone Jun 7, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

Given the timeframe, I think it makes sense to revert now and reintroduce in 1.16 with a reworked implementation

@ahg-g

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2019

I recalled this is a valid perf improvement. Instead of reverting the PR, can we just fix #78776 in a separate PR?

It is, but not significant compared to the "complexity" it introduces. The fix may not be straightforward, so it is probably safer to revert and unblock 1.15.

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

/lgtm
/approve

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 7, 2019

/kind bug
/priority critical-urgent

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit a1c5019 into kubernetes:master Jun 7, 2019

23 checks passed

cla/linuxfoundation ahg-g authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@misterikkit

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

I recalled this is a valid perf improvement. Instead of reverting the PR, can we just fix #78776 in a separate PR?

It is, but not significant compared to the "complexity" it introduces. The fix may not be straightforward, so it is probably safer to revert and unblock 1.15.

It's also easy to "revert revert" and make sure to fix the bug in the new PR. cough

@@ -462,8 +461,8 @@ func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v
if len(g.predicates) == 0 {
filtered = nodes
} else {
allNodes := g.cache.NodeTree().AllNodes()
numNodesToFind := g.numFeasibleNodesToFind(int32(len(allNodes)))
allNodes := int32(g.cache.NodeTree().NumNodes())

This comment has been minimized.

Copy link
@tedyu

tedyu Jun 7, 2019

Contributor

I assume holding lock over the course of checkNode execution would defeat the purpose of the original improvement.

@@ -167,7 +167,6 @@ type genericScheduler struct {
pvcLister corelisters.PersistentVolumeClaimLister
pdbLister algorithm.PDBLister
disablePreemption bool
lastIndex int

This comment has been minimized.

Copy link
@tedyu

tedyu Jun 8, 2019

Contributor

I moved the declaration of lastIndex into genericScheduler#findNodesThatFit
I also let AllNodes make copy of nt.allNodes

func (nt *NodeTree) AllNodes() []string {
	nt.mu.RLock()
	defer nt.mu.RUnlock()
	var nodes []string
	nodes = append(nodes, nt.allNodes...)
	return nodes
}

I wonder if the above is enough to handle the race condition.

@tedyu

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

Created #78806 to get test suite results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.