-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: internal data structure optimization #81068
scheduler: internal data structure optimization #81068
Conversation
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stopped the review because I think this implementation doesn't work. Specifically, it's not enough to rely in 2 critical paths when we can have "almost" arbitrary removal/addition of pods (see the comments for details).
We have 2 options:
- Implement a priority queue (maybe we can reuse some already implemented),
- Do a linear search for the minimum each time we add a node. Maybe this is enough, because preemption is not a common case. But we can support this with performance tests.
Additionally, some memory can be saved if we store a map from node to topology keys (instead of pod to topologyKeys).
topologyPairsPodSpreadMap := &topologyPairsPodSpreadMap{ | ||
// topologyKeyToMinPodsMap will be initialized with proper size later. | ||
topologyPairsMaps: newTopologyPairsMaps(), | ||
// TODO(Huang-Wei): It might be possible to use "make(map[topologyPair]*int32)". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider implementing a well defined data structure in our utilities library. Atomic map counters is a very common use case for implementing scheduling algorithms. I'll give it a thought.
// is definitely greater than "paths[1].matchNum". | ||
if v == paths[0].topologyValue { | ||
paths[0].matchNum = curMatch | ||
// paths[0].matchNum has increased by 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, at least 2 keys had the same count, say (4, 4). After adding, we get (5, 4), so we need to swap to (4, 5). This works if those were the 2 only keys with that count. If we had (4, 4, 4), then after adding, the state should be (4, 4, 5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[2]critialPath
in preemption case doesn't always hold the minimum two paths. We only need to guarantee path[0]
always holds the minimum matchNum, and that's also our predicate algorithm relies on - compare to the minMatch.
So in your case, it's fine that at some point it becomes [4,5] (one of the original 4 gets lost). At anytime we must keep the minMatch tracked well.
a2d0499
to
87d5fad
Compare
TestPreemptWithPermitPlugin flake +1: #81238 |
/retest |
Recent (3rd) commit basically involves a number of changes due to the renaming, but it doesn't change the core logic/algorithm of this PR. |
87d5fad
to
a907482
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's leave the 2 critical paths implementation, documenting its caveats (in PR description and code comments).
@@ -337,21 +347,16 @@ func newTopologyPairsMaps() *topologyPairsMaps { | |||
|
|||
func (m *topologyPairsMaps) addTopologyPair(pair topologyPair, pod *v1.Pod) { | |||
podFullName := schedutil.GetPodFullName(pod) | |||
m.addTopologyPairWithoutPods(pair) | |||
if m.topologyPairToPods[pair] == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just reverting the initial changes to affinity structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct :)
// We don't record all critical paths here. | ||
// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum; | ||
// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum. | ||
tpKeyToCriticalPaths map[string][2]*criticalPath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 implementations for updating this array. Let's make it a type:
type criticalPaths [2]*struct{topologyValue string, matchNum int32}
with only one function: func (*criticalPaths) update(topologyValue string, matchNum int32)
and a constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we are choosing int32 here, whereas we are using int64 for the priorities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make them consistent, I'd like to change int64 in priorities to int32, any preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is counting pods, that should be fine.
} | ||
|
||
// sortCriticalPaths is only served for testing purpose. | ||
func (c *podSpreadCache) sortCriticalPaths() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to metadata_test.go
?
a907482
to
6841314
Compare
/retest |
1 similar comment
/retest |
tpPairToMatchNum: make(map[topologyPair]int32), | ||
} | ||
for tpKey, paths := range c.tpKeyToCriticalPaths { | ||
copyPaths := [2]*criticalPath{ | ||
copyPaths := criticalPaths{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: criticalPaths{paths[0], paths[1]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! It's an Array so we can simply put its value (as a copy) here.
@@ -1635,7 +1635,7 @@ func TestPodSpreadCache_removePod(t *testing.T) { | |||
} else { | |||
deletedPod = tt.deletedPod | |||
} | |||
podSpreadCache.removePod(deletedPod, tt.preemptor, tt.nodes[tt.nodeIdx]) | |||
podSpreadCache.updatePod(deletedPod, tt.preemptor, tt.nodes[tt.nodeIdx], -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big fan of the numbers. Maybe just add addPod
and removePod
that call updatePod
? But this is a nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
/lgtm |
6841314
to
3056b2c
Compare
/hold cancel Address updated and commits (except for the benchmark one) have been squashed. @alculquicondor PTAL. |
/hold @ahg-g wants to do a quick review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! this is clearly more efficient and easier to reason about compared to the original implementation.
type podSpreadCache struct { | ||
// We don't record all critical paths here. | ||
// If there is only one criticalPath, criticalPaths[0] always holds the minimum matchNum; | ||
// If there are multiple criticalPaths, keep arbitrary 2 of them holding the minimum matchNum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a little confusing, to me it reads as if we are keeping two values only when we have more than two with the same minimum matchNum, which is not the case. I would re-phrase this whole three-line comment as: we keep track of the two topology values with the minimum matchNum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds too short... How about:
// We record 2 critical paths instead of all critical paths here.
// criticalPaths[0].matchNum always holds the minimum matching number.
// criticalPaths[1].matchNum is always greater or equal to criticalPaths[0].matchNum,
// but it's not guaranteed to be the 2nd minimum match number.
} else { | ||
// `tpVal` doesn't exist | ||
if num < paths[0].matchNum { | ||
// update paths[1] with paths[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with paths[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
j := (i + 1) % 2 // the other index | ||
// adjust i and j to make sure paths[0] always holds the minimum match num. | ||
if (i < j && paths[i].matchNum > paths[j].matchNum) || | ||
(i > j && paths[i].matchNum < paths[j].matchNum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this condition be simplified as: if (paths[0].matchNum > paths[1].matchNum)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good.
Another (fancy but may hard to read) way I thought was: i < j == paths[i].matchNum > paths[j].matchNum
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much better now :)
// CAVEAT: the reason that `[2]criticalPath` can work is based on the implementation of current | ||
// preemption algorithm - we only preempt pods on the same node, instead of pods on multiple nodes. | ||
// If we plan to turn to a more complex algorithm like "arbitrary pods on multiple nodes", this | ||
// structure needs to be revisited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to also clarify that this works also because each node is evaluated on a separate copy of the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG.
3056b2c
to
1f78a6c
Compare
@ahg-g Comments addressed. PTAL. |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- go test k8s.io/kubernetes/pkg/scheduler/algorithm/predicates -benchmem -run=^$ -bench .
- Rename 'topologyPairsPodSpreadMap' to 'podSpreadCache' - New struct `criticalPaths criticalPaths` - Add unified method `*criticalPaths.update()` for: - regular update - addPod in preemption case - remotePod in preemption case
1f78a6c
to
8f559ea
Compare
@ahg-g @alculquicondor Commits squashed. Please help to relabel /lgtm. Thanks for the efforts reviewing this! |
/hold cancel |
/lgtm |
/retest |
1 similar comment
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Optimize the data structure of
topologySpreadMap
podSpreadCache
to reduce memory and runtime overhead, and also boost the performance of predicates and preemption of EvenPodsSpread.Here is the new structure:
CAVEAT: this PR (in particular the
[2]critialPath
structure) can work is based on the implementation of the current preemption algorithm:Fact 1: we only preempt pods on the same node, instead of pods on multiple nodes.
Fact 2: each node is evaluated on a separate copy of the metadata during its preemption cycle.
If we plan to turn to a more complex algorithm like "arbitrary pods on multiple nodes", this enhancement needs to be revisited.
Which issue(s) this PR fixes:
Part of #77284. Resolve comment1 and comment2.
Special notes for your reviewer:
The performance looks good (~6X in time, ~4X in space). Let's focus on functionality reviewing.
Before
After
Does this PR introduce a user-facing change?:
/sig scheduling
/assign @ahg-g @alculquicondor
cc/ @bsalamat @liu-cong @draveness @krmayankk