Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scheduler cache snapshot to improve scheduling throughput #74041

Merged
merged 2 commits into from Feb 21, 2019

Conversation

@bsalamat
Copy link
Contributor

bsalamat commented Feb 13, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
The scheduler snapshots its cache at the beginning of each scheduling cycle to have a persistent view of the cluster during the cycle. CPU profiling results shows that the scheduler spends a significant percentage of a scheduling cycle for snapshotting its cache.
This PR is an algorithmic optimization that dramatically improves performance of the cache snapshot logic.
The algorithm works by keeping the cache entries in a doubly linked list where the most recently updated entry is at the head and the rest of the entries are ordered by the recency of the updates. These entries have had a global generation number (before this PR). The snapshotting logic starts traversing the doubly linked list and updates entries in the cache snapshot that are updated after the last snapshot.

Our benchmarks shows the latency of scheduling a pod is improved over 20% from 8.5ms to 6.7ms in a 5000 node cluster when scheduling 10000 pods:

BEFORE:
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/1000Pods-12         	   10000	   8516461 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	107.756s

----------------------------------------
AFTER:

pkg: k8s.io/kubernetes/test/integration/scheduler_perf
BenchmarkScheduling/5000Nodes/1000Pods-12         	   10000	   6740178 ns/op
PASS
ok  	k8s.io/kubernetes/test/integration/scheduler_perf	89.193s

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Optimize scheduler cache snapshot algorithm to improve scheduling throughput.

/sig scheduling

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 13, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bsalamat bsalamat requested a review from wgliang Feb 13, 2019

@k8s-ci-robot k8s-ci-robot requested a review from resouer Feb 13, 2019

@bsalamat bsalamat force-pushed the bsalamat:optimize_snapshot branch from dcd47ef to a1ac298 Feb 14, 2019

@wgliang

This comment has been minimized.

Copy link
Member

wgliang commented Feb 14, 2019

LGTM
Thanks Bobby.

@bsalamat bsalamat force-pushed the bsalamat:optimize_snapshot branch from a1ac298 to 7e935ef Feb 14, 2019

@misterikkit
Copy link
Contributor

misterikkit left a comment

haven't reviewed tests yet

// NodeInfoSnapshot is a snapshot of cache NodeInfo. The scheduler takes a
// snapshot at the beginning of each scheduling cycle and uses it for its
// operations in that cycle.
type NodeInfoSnapshot struct {

This comment has been minimized.

@misterikkit

misterikkit Feb 14, 2019

Contributor

Is it reasonable for us to have two Snapshot types in this package? I don't think I've ever seen what the older one is used for.

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

The older one takes a snapshot of the cache for the "cache comparer". It is invoked when SIGUSR2 is sent to the scheduler.

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

I see. Well, you've written good comments. Hopefully that is enough to prevent confusion.

Show resolved Hide resolved pkg/scheduler/core/generic_scheduler.go
Show resolved Hide resolved pkg/scheduler/internal/cache/cache.go
nodeTree: newNodeTree(nil),
assumedPods: make(map[string]bool),
podStates: make(map[string]*podState),
imageStates: make(map[string]*imageState),
}
}

// Snapshot takes a snapshot of the current schedulerinternalcache. The method has performance impact,
// and should be only used in non-critical path.
// newNodeInfoListItem initializes a new NodeInfo.

This comment has been minimized.

@misterikkit

misterikkit Feb 14, 2019

Contributor

s/NodeInfo/nodeInfoListItem/

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

done

Show resolved Hide resolved pkg/scheduler/internal/cache/cache.go
}

// Snapshot takes a snapshot of the current scheduler cache. This is used for
// debugging purposes only and shouldn't be confused with SnapshotNodInfo

This comment has been minimized.

@misterikkit

misterikkit Feb 14, 2019

Contributor

typo in SnapshotNodInfo.

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

done

// Snapshot takes a snapshot of the current scheduler cache. This is used for
// debugging purposes only and shouldn't be confused with SnapshotNodInfo
// function.
// This method has performance impact, and should be only used in non-critical path.

This comment has been minimized.

@misterikkit

misterikkit Feb 14, 2019

Contributor

s/has performance impact/is expensive/

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

done

// Transient scheduler info is reset here.
info.TransientInfo.ResetTransientSchedulerInfo()
node.info.TransientInfo.ResetTransientSchedulerInfo()

This comment has been minimized.

@misterikkit

misterikkit Feb 14, 2019

Contributor

I know you're not changing this logic, but I don't like that it is here and I will look into it later.

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Yes, we should revisit this and move it out of here.

@bsalamat bsalamat force-pushed the bsalamat:optimize_snapshot branch from 7e935ef to 66a3c06 Feb 15, 2019

@bsalamat
Copy link
Contributor Author

bsalamat left a comment

Thanks, @misterikkit for your review. I addressed your comments. Given that github now shows "force push" diffs. I already rebased my commits.

Show resolved Hide resolved pkg/scheduler/internal/cache/cache.go
nodeTree: newNodeTree(nil),
assumedPods: make(map[string]bool),
podStates: make(map[string]*podState),
imageStates: make(map[string]*imageState),
}
}

// Snapshot takes a snapshot of the current schedulerinternalcache. The method has performance impact,
// and should be only used in non-critical path.
// newNodeInfoListItem initializes a new NodeInfo.

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

done

Show resolved Hide resolved pkg/scheduler/internal/cache/cache.go
}

// Snapshot takes a snapshot of the current scheduler cache. This is used for
// debugging purposes only and shouldn't be confused with SnapshotNodInfo

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

done

// Snapshot takes a snapshot of the current scheduler cache. This is used for
// debugging purposes only and shouldn't be confused with SnapshotNodInfo
// function.
// This method has performance impact, and should be only used in non-critical path.

This comment has been minimized.

@bsalamat

bsalamat Feb 15, 2019

Author Contributor

done

@misterikkit
Copy link
Contributor

misterikkit left a comment

Comments are mostly nits that can be ignored.

// beginning of every scheduling cycle.
// This function tracks generation number of NodeInfo and updates only the
// entries of an existing snapshot that have changed after the snapshot was taken.
func (cache *schedulerCache) SnapshotNodeInfo(nodeSnapshot *NodeInfoSnapshot) error {

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

I feel like we should keep "update" in the function name here.

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

done

// NodeInfoSnapshot is a snapshot of cache NodeInfo. The scheduler takes a
// snapshot at the beginning of each scheduling cycle and uses it for its
// operations in that cycle.
type NodeInfoSnapshot struct {

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

I see. Well, you've written good comments. Hopefully that is enough to prevent confusion.

if !reflect.DeepEqual(snapshot.Nodes, cache.nodes) {
t.Fatalf("expect \n%+v; got \n%+v", cache.nodes, snapshot.Nodes)
if len(snapshot.Nodes) != len(cache.nodes) {
t.Fatalf("Unequal number of nodes in the cache and its snapshot. expeted: %v, got: %v", len(cache.nodes), len(snapshot.Nodes))

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

seems odd that this test is using Fatalf instead of Errorf, but we don't need to fix it in this PR.

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

I changed them.

ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("test-pod%v", i),
Namespace: "test-ns",
UID: types.UID(fmt.Sprintf("test-puid%v", i)),

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

Is UID needed in tests? Is it nice-to-have in tests?

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Our scheduling queue uses pod UID. So, some of our tests need to have pod UID. The queue is not engaged in this test though, but having UID doesn't hurt.

updatedPods = append(updatedPods, updatedPod)
}

type operation struct {

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

I like what you did here. I think you could take it further by defining helper funcs like,

addNode := func(i int) func() {
  return func() {
    cache.AddNode(nodes[i])
  }
}

Specifically, the string lookup for operator does not spark joy. 😉

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Good suggestion. Thanks!

}

if len(test.expected) != len(cache.nodes) {
t.Errorf("Test: %v, unexpected number of nodes. Expected: %v, got: %v", test.name, len(test.expected), len(cache.nodes))

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

Don't log test name. Instead, use t.Run(test.name, ...)

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

done

i++
}

// If no snapshot was taken in operations, take one now for comparison.

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

If a snapshot was taken during operations, but it was not the last operation, I think the assertions below will fail.

Come to think of it, we aren't asserting anything about snapshots taken in the middle of the operations. Is that fine? (I think it's probably fine.) ((is it though?))

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

My goal from doing so was to ensure that a "rolling update" of the snapshot works fine. There are also test cases to check that one snapshot taken in the whole process also works well.

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

I just added some code to compare the snapshot with the cache every time the snap shot is updated. This will ensure that those midway snapshots are also correct.

t.Errorf("Test: %v, unexpected number of nodes. Expected: %v, got: %v", test.name, len(test.expected), len(cache.nodes))
}
var i int
for node := cache.headNode; node != nil; node = node.next {

This comment has been minimized.

@misterikkit

misterikkit Feb 16, 2019

Contributor

It may be obvious from reading the code, but could we add a comment to this and the following for loop that one is checking internal cache state and the other is checking snapshot data?

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

done

@bsalamat
Copy link
Contributor Author

bsalamat left a comment

Thanks, @misterikkit! Addressed the comments. PTAL.

// beginning of every scheduling cycle.
// This function tracks generation number of NodeInfo and updates only the
// entries of an existing snapshot that have changed after the snapshot was taken.
func (cache *schedulerCache) SnapshotNodeInfo(nodeSnapshot *NodeInfoSnapshot) error {

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

done

// Transient scheduler info is reset here.
info.TransientInfo.ResetTransientSchedulerInfo()
node.info.TransientInfo.ResetTransientSchedulerInfo()

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Yes, we should revisit this and move it out of here.

if !reflect.DeepEqual(snapshot.Nodes, cache.nodes) {
t.Fatalf("expect \n%+v; got \n%+v", cache.nodes, snapshot.Nodes)
if len(snapshot.Nodes) != len(cache.nodes) {
t.Fatalf("Unequal number of nodes in the cache and its snapshot. expeted: %v, got: %v", len(cache.nodes), len(snapshot.Nodes))

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

I changed them.

ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("test-pod%v", i),
Namespace: "test-ns",
UID: types.UID(fmt.Sprintf("test-puid%v", i)),

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Our scheduling queue uses pod UID. So, some of our tests need to have pod UID. The queue is not engaged in this test though, but having UID doesn't hurt.

updatedPods = append(updatedPods, updatedPod)
}

type operation struct {

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Good suggestion. Thanks!

}

if len(test.expected) != len(cache.nodes) {
t.Errorf("Test: %v, unexpected number of nodes. Expected: %v, got: %v", test.name, len(test.expected), len(cache.nodes))

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

done

t.Errorf("Test: %v, unexpected number of nodes. Expected: %v, got: %v", test.name, len(test.expected), len(cache.nodes))
}
var i int
for node := cache.headNode; node != nil; node = node.next {

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

done

i++
}

// If no snapshot was taken in operations, take one now for comparison.

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

My goal from doing so was to ensure that a "rolling update" of the snapshot works fine. There are also test cases to check that one snapshot taken in the whole process also works well.

@bsalamat bsalamat force-pushed the bsalamat:optimize_snapshot branch 3 times, most recently from f8e1d48 to cbdcd63 Feb 20, 2019

@misterikkit
Copy link
Contributor

misterikkit left a comment

/lgtm

index int
var cache *schedulerCache
var snapshot NodeInfoSnapshot
addNode := func(i int) func() {

This comment has been minimized.

@misterikkit

misterikkit Feb 20, 2019

Contributor

Oh, optionally you could write this as addNode := func(i int) operation {

This comment has been minimized.

@bsalamat

bsalamat Feb 20, 2019

Author Contributor

Done. Thanks for your review!

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 20, 2019

@bsalamat bsalamat force-pushed the bsalamat:optimize_snapshot branch from cbdcd63 to 337cb70 Feb 20, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Feb 20, 2019

@misterikkit

This comment has been minimized.

Copy link
Contributor

misterikkit commented Feb 20, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 20, 2019

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Feb 21, 2019

Looks like the kubemark-e2e-gce-big has never passed in the past several runs. I cross my fingers and send a retest.

/retest

@k8s-ci-robot k8s-ci-robot merged commit 50328e0 into kubernetes:master Feb 21, 2019

16 checks passed

cla/linuxfoundation bsalamat authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@bsalamat bsalamat deleted the bsalamat:optimize_snapshot branch Feb 21, 2019

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Feb 21, 2019

Our scalability results after this PR are in and available in our perf dashboard.
I am glad to announce that average scheduling throughput has reached 100 pods/s in a 5000 node cluster. This was an important milestone for improving scalability of Kubernetes scheduler.

Choose "gce-5000Nodes > Scheduler > SchedulingThroughput" in the perf dashboard to see the results. Latest at the time of this comment is run 314.
Our scalability tests are rate limited at 100 pods/s. So, the scheduler might be able to achieve even higher throughput if the rate limit is lifted.

@wgliang

This comment has been minimized.

Copy link
Member

wgliang commented Feb 22, 2019

It looks great, but I don't know if you noticed that our binding time (Perc99/Perc90) has increased a lot. Choose "gce-5000Nodes > Scheduler > SchedulingLantency->binding" in the perf dashboard.

@bsalamat

This comment has been minimized.

Copy link
Contributor Author

bsalamat commented Feb 23, 2019

It looks great, but I don't know if you noticed that our binding time (Perc99/Perc90) has increased a lot. Choose "gce-5000Nodes > Scheduler > SchedulingLantency->binding" in the perf dashboard.

Thanks for noticing that. Average binding time has not changed, but 90th percentile has increased a lot. I think I know the reason. We have a rate limit of 100 qps to the API server. Scheduler's throughput now exceeds that limit and can schedule more than 100 pods per second. As a result pods wait longer in binding stage and we see an increase in binding latency.

@wgliang

This comment has been minimized.

Copy link
Member

wgliang commented Feb 24, 2019

Sounds reasonable. Thank you for your reply. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.