Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use named array instead of array in normalizing score #80901

Merged

Conversation

@draveness
Copy link
Member

commented Aug 2, 2019

/kind feature
/sig scheduling
/priority important-soon
/assign @bsalamat @ahg-g

What this PR does / why we need it:

I was refactoring the interpod-affinity priority with scheduling framework and found out it is quite hard to connect the node which passing to the Score plugin to the NodeScoreList passing to the NormalizeScore with index.

// nodename here
func (ipa *InterPodAffinityPlugin) Score(pc *framework.PluginContext, pod *v1.Pod, nodeName string) (int, *framework.Status) {

// index to score here
func (ipa *InterPodAffinityPlugin) NormalizeScore(pc *framework.PluginContext, pod *v1.Pod, scores framework.NodeScoreList) *framework.Status { 

The priority functions are using a HostPriority which contains node name and score, it would be better to provide the node name information in normalizing score phase.

We have two options for optimising this:

  1. Use a nodeName -> score map, which this PR try to introduce
  2. Use an array of HostPriority liked struct, ex:
type NodeScore struct {
  Name string
  Score int
}

I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this? @bsalamat @ahg-g

Which issue(s) this PR fixes:

ref: #80667

Does this PR introduce a user-facing change?:

Use a named array instead of a score array in normalizing-score phase.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

/retest

@alculquicondor

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/hold
Changes to the framework interfaces are considered user facing. Can we open an issue for this to start a discussion?

@alculquicondor

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/cc @liu-cong
who has been looking into normalize score interface as well

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

@alculquicondor: GitHub didn't allow me to request PR reviews from the following users: liu-cong.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @liu-cong
who has been looking into normalize score interface as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Changes to the framework interfaces are considered user facing. Can we open an issue for this to start a discussion?

I sent this PR to start a discussion about normalizing score phase as we met problems when transforming the original priority functions into the scheduling framework. Feel free to leave comments and opinions in this PR.

@liu-cong

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

/assign

@liu-cong

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

This does involve an interface change, please document it in Does this PR introduce a user-facing change?.

This seems to be a reasonable change. I cannot think of any objections to it besides more memory usage. I will wait for opinions from others.

@alculquicondor

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

It's not clear to me why is the change needed. Normalization should be totally independent from the node information.

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2019

It's not clear to me why is the change needed. Normalization should be totally independent from the node information.

Ideally, normalization should be totally independent of the node information, many of the priority functions now indeed do not use node name in HostPriority, but we have some real cases that need other nodeName and information during the normalizing phase.

Currently, the reduce phase in priority function is similar to the normalizing phase in the scheduling framework, these reduce function accepts a list of nodeName and score as an argument, but the normalize score function only accepts a list of score which lacks node name information for now.

If you checked out the priority functions with reducing phase, some of them use nodeName to get more information, ex: SelectorSpread use nodeName to get zoneID.

And we need to provide node names in the normalizing phase because of the following reasons:

  1. Keep consistent with the original reduce priority functions (HostPriority), otherwise, we can't migrate some of them to the scheduling framework, ex: SelectorSpread, InterPodAffinity and EvenPodSpreading
  2. Good for debugging purpose, ex:
klog.Infof("%v -> %v: EvenPodsSpreadPriority, Score: (%d)", pod.Name, nodeName, *score)
klog.Infof("%v -> %v: InterPodAffinityPriority, Score: (%d)", pod.Name, node.Name, int(fScore))
"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),

InterPodAffinity and EvenPodsSpread priorities are using deprecated pattern and quite different from the other priority function, in this WIP PR #80898, I tried to refactor it with the scheduling framework and met this problem. Even if we could find another approach to refactor them without node name, SelectorSpread also need node name to get zone id.

Based on the reasons above, I'd like to add node name to normalizing score phase.

cc/ @bsalamat @ahg-g

@ahg-g
Copy link
Member

left a comment

Can you elaborate why you are recommending a map over a "named array"?

score, status := pl.Score(pc, pod, nodes[index].Name)
if !status.IsSuccess() {
errCh.SendErrorWithCancel(fmt.Errorf(status.Message()), cancel)
return
}
pluginToNodeScoreMap[pl.Name()][index] = score
pluginToNodeScoreMap[pl.Name()][nodeName] = score

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 6, 2019

Member

I suspect you need to synchronize access when writing to the map (a pre-allocated slice doesn't) since this practically adds a new key-value to it.

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 6, 2019

Can you elaborate why you are recommending a map over a "named array"?

I don't have a preference. What do you think of these two options?

@ahg-g

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Can you elaborate why you are recommending a map over a "named array"?

I don't have a preference. What do you think of these two options?

I think it is better to follow the sig-arch recommendation.

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 6, 2019

I think it is better to follow the sig-arch recommendation.

I'll make the change.

@draveness draveness force-pushed the draveness:feature/use-map-instead-of-array branch from b152920 to 68dc5da Aug 6, 2019

@alculquicondor

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

/lgtm

@Huang-Wei
Copy link
Member

left a comment

@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes slice here are the ones that passed Predicates. So the Score() function still doesn't have the handle of all nodes info?

	// Run the Score plugins.
	scoresMap, scoreStatus := framework.RunScorePlugins(pluginContext, pod, nodes)

(The nodes come from one of the parameters of PrioritizeNodes())

for _, scoreList := range scoresMap {
for i := range nodes {
result[i].Score += scoreList[i]
for _, nodeScoreList := range scoresMap {

This comment has been minimized.

Copy link
@Huang-Wei

Huang-Wei Aug 7, 2019

Member

Can this be merged into previous for i := range nodes loop, after L806?

This comment has been minimized.

Copy link
@draveness

draveness Aug 8, 2019

Author Member

Good catch, Done

@draveness draveness force-pushed the draveness:feature/use-map-instead-of-array branch from 58dda26 to aa5f9fd Aug 8, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Aug 8, 2019

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

@draveness I may miss something, but my impression is that this PR still doesn't resolve the core issue of the deprecated priorities - which is: the nodes slice here are the ones that passed Predicates. So the Score() function still doesn't have the handle of all nodes info?

We could get nodes from the frameworkHandle.NodeInfoSnapshot(), I could give an implementation of refactoring interpod-affinity for review after this gets merged. If it does not work well and something goes wrong, I'll propose to add nodes to Score function. And even if it doesn't work well, use a named array instead could make the API easier to use. Does this make sense to you? @Huang-Wei

/lgtm

All being said, I am not opposed to having a named array (map) to keep score of nodes. I think it will be easier to use.

/hold cancel

I'll cancel the hold

@Huang-Wei

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

We could get nodes from the frameworkHandle.NodeInfoSnapshot()

Looks good.

I could give an implementation of refactoring interpod-affinity for review after this gets merged.

Thanks. Let's see if it works.

@Huang-Wei

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

/lgtm

Will leave /approval to @bsalamat or @ahg-g .

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

/retest

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2019

Friendly ping @bsalamat and @ahg-g for approval. Thanks!

@@ -31,10 +31,16 @@ import (
type Code int

// NodeScoreList declares a list of nodes and their scores.
type NodeScoreList []int
type NodeScoreList []NodeScore

This comment has been minimized.

Copy link
@bsalamat

bsalamat Aug 9, 2019

Member

Wouldn't it be better to use a map from node names to node score (map[string]int) instead of NodeScoreList? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.

This comment has been minimized.

Copy link
@draveness

draveness Aug 9, 2019

Author Member

Wouldn't it be better to use a map from node names to node score (map[string]int) instead of NodeScoreList? With the current data structure, finding the score of a node requires iterating over a NodeScoreList.

I don't have a preference here

I recalled that in the API convention documents, kubernetes prefer the named array to map. I could go with the second option if it is the convention we must follow here, but I prefer the first for now. What do you think of this?

This comment has been minimized.

Copy link
@bsalamat

bsalamat Aug 9, 2019

Member

We have to think carefully about this question: "Do we need to find score of a particular node in our code?" If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.

This comment has been minimized.

Copy link
@draveness

draveness Aug 9, 2019

Author Member

We have to think carefully about this question: "Do we need to find score of a particular node in our code?"

+1

If the answer is no, we should definitely go with the current List instead of a map. Lists are faster to access compared to maps if we don't need to search for something.

For the implementation of InterPodAffinity and EvenPodSpread, we don't need to find the score of a particular node in NormalizeScore, and we only use node name in the named array (or map) to find score from the internal data structure.

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 9, 2019

Member

Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.

for i := range nodes {
result[i].Score += scoreList[i]
for j := range scoresMap {
result[i].Score += scoresMap[j][i].Score

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 9, 2019

Member

I suspect that the original code will be faster because it has better cache locality when reading scores from the scoresMap.

This comment has been minimized.

Copy link
@draveness

draveness Aug 12, 2019

Author Member

We need more evidence on this, before that I prefer to keep it this way.

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 13, 2019

Member

In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.

The other thing in this nested loop that could make it slower is that it makes n*m map lookups (n=number of nodes, m = number of priority functions/score plugins), while if we make the node loop inside, the number of scoresMap lookups will be zero. So it might be even faster (and easier to read) to flip the whole thing like this:

for i := range nodes {
		result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
}

for j, priorityList := range priorityConfigs {
		for i := range nodes {
			result[i].Score += priorityList[i].Score * priorityConfigs[j].Weight
		}
}

for _, scoreList := range scoresMap {
		for i := range nodes {
			result[i].Score += scoreList[i].Score
		}
}

We need more evidence on this, before that I prefer to keep it this way.
The theory says this change makes things slower, so if you don't have evidence to the contrary, then we shouldn't make the change, especially that it is not related to the purpose of the PR "using named array in normalize scores".

This comment has been minimized.

Copy link
@draveness

draveness Aug 13, 2019

Author Member

In theory sequential (or strided) reads are the fastest form of memory access patterns because of prefetching and cache locality.

I know the theory of the cache locality and here are the benchmark numbers:

package main

import "testing"

func locality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) {
	for _, scores := range scoresMap {
		for i := range scores {
			result[i] = result[i] + scores[i]
		}
	}

	for _, scores := range anotherScoresMap {
		for i := range scores {
			result[i] = result[i] + scores[i]
		}
	}
}

func nonlocality(result []int, scoresMap map[int][]int, anotherScoresMap map[int][]int) {
	for i := range result {
		for j := range scoresMap {
			result[i] = result[i] + scoresMap[j][i]
		}

		for j := range anotherScoresMap {
			result[i] = result[i] + anotherScoresMap[j][i]
		}
	}
}

func BenchmarkLocality(b *testing.B) {
	result := make([]int, 5000)
	scores := make(map[int][]int, 10)
	anotherScores := make(map[int][]int, 10)

	for i := 0; i < 10; i++ {
		scores[i] = make([]int, 5000)
		anotherScores[i] = make([]int, 5000)
		for j := 0; j < 5000; j++ {
			scores[i][j] = j
			anotherScores[i][j] = j
		}
	}

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		locality(result, scores, anotherScores)
	}
}

func BenchmarkNonLocality(b *testing.B) {
	result := make([]int, 5000)
	scores := make(map[int][]int, 10)
	anotherScores := make(map[int][]int, 10)

	for i := 0; i < 10; i++ {
		scores[i] = make([]int, 5000)
		anotherScores[i] = make([]int, 5000)
		for j := 0; j < 5000; j++ {
			scores[i][j] = j
			anotherScores[i][j] = j
		}
	}

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		locality(result, scores, anotherScores)
	}
}

$ go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/golang
BenchmarkLocality-8      	   30000	     44607 ns/op
BenchmarkNonLocality-8   	   30000	     44245 ns/op
PASS

According to the benchmarks, I prefer to keep it this way.

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 13, 2019

Member

Strange, it is either golang's compiler is so good, or new architectures have such powerful memory controllers. I would like to do the same test in C++ to see if I get different results.

Do you have a theory why they are the same?

This comment has been minimized.

Copy link
@alculquicondor

alculquicondor Aug 13, 2019

Member

The thing is that the number of priorities (j values) is very small, and so is the number of nodes. We are only talking about ~200kb, which can fit in the caches pretty easily.

Minor note in the posted benchmark: it should be map[string][]int. but again, we are only doing lookups among 10 values.

On the other hand, I don't understand the motivation to do the change. I'm not going to hold the review.

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 13, 2019

Member

I re ran the test and set the sizes to two orders of magnitude bigger, no difference.

The issue about cache is not only about size, but also what you move to the cache (how much of the cache line being moved from memory is actually used), so it is about effective memory bandwidth.

@@ -31,10 +31,16 @@ import (
type Code int

// NodeScoreList declares a list of nodes and their scores.
type NodeScoreList []int
type NodeScoreList []NodeScore

This comment has been minimized.

Copy link
@ahg-g

ahg-g Aug 9, 2019

Member

Unless we have a strong reason to use map (which I doubt we will), my recommendation is to use a list.

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

Hi @bsalamat, do you have any thoughts on this?

@ahg-g

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

/lgtm

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

/lgtm

Hi @ahg-g, can I get approval on this or I have to bother Bobby?

@ahg-g

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, draveness

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 14, 2019

@draveness

This comment has been minimized.

Copy link
Member Author

commented Aug 14, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 1906650 into kubernetes:master Aug 14, 2019

22 of 23 checks passed

pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation draveness authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 14, 2019

@draveness draveness deleted the draveness:feature/use-map-instead-of-array branch Aug 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.