Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubernetes: When scheduling, spread between zones if labeled #17915

Merged
merged 3 commits into from
Dec 17, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/devel/scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ will filter out nodes that don't have at least that much resources available (co
as the capacity of the node minus the sum of the resource requests of the containers that
are already running on the node). Second, it applies a set of "priority functions"
that rank the nodes that weren't filtered out by the predicate check. For example,
it tries to spread Pods across nodes while at the same time favoring the least-loaded
it tries to spread Pods across nodes and zones while at the same time favoring the least-loaded
nodes (where "load" here is sum of the resource requests of the containers running on the node,
divided by the node's capacity).
Finally, the node with the highest priority is chosen
Expand Down
2 changes: 1 addition & 1 deletion docs/devel/scheduler_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Currently, Kubernetes scheduler provides some practical priority functions, incl
- `LeastRequestedPriority`: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
- `CalculateNodeLabelPriority`: Prefer nodes that have the specified label.
- `BalancedResourceAllocation`: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
- `CalculateSpreadPriority`: Spread Pods by minimizing the number of Pods belonging to the same service on the same node.
- `CalculateSpreadPriority`: Spread Pods by minimizing the number of Pods belonging to the same service on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
- `CalculateAntiAffinityPriority`: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.

The details of the above priority functions can be found in [plugin/pkg/scheduler/algorithm/priorities](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithm/priorities/). Kubernetes uses some, but not all, of these priority functions by default. You can see which ones are used by default in [plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go](http://releases.k8s.io/HEAD/plugin/pkg/scheduler/algorithmprovider/defaults/defaults.go). Similar as predicates, you can combine the above priority functions and assign weight factors (positive number) to them as you want (check [scheduler.md](scheduler.md) for how to customize).
Expand Down
148 changes: 116 additions & 32 deletions plugin/pkg/scheduler/algorithm/priorities/selector_spreading.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,20 @@ package priorities
import (
"github.com/golang/glog"
"k8s.io/kubernetes/pkg/api"
"k8s.io/kubernetes/pkg/api/unversioned"
"k8s.io/kubernetes/pkg/labels"
"k8s.io/kubernetes/plugin/pkg/scheduler/algorithm"
schedulerapi "k8s.io/kubernetes/plugin/pkg/scheduler/api"
)

// The maximum priority value to give to a node
// Prioritiy values range from 0-maxPriority
const maxPriority = 10

// When zone information is present, give 2/3 of the weighting to zone spreading, 1/3 to node spreading
// TODO: Any way to justify this weighting?
const zoneWeighting = 2.0 / 3.0

type SelectorSpread struct {
serviceLister algorithm.ServiceLister
controllerLister algorithm.ControllerLister
Expand All @@ -37,11 +46,34 @@ func NewSelectorSpreadPriority(serviceLister algorithm.ServiceLister, controller
return selectorSpread.CalculateSpreadPriority
}

// CalculateSpreadPriority spreads pods by minimizing the number of pods belonging to the same service or replication controller. It counts number of pods that run under
// Services or RCs as the pod being scheduled and tries to minimize the number of conflicts. I.e. pushes scheduler towards a Node where there's a smallest number of
// pods which match the same selectors of Services and RCs as current pod.
// Helper function that builds a string identifier that is unique per failure-zone
// Returns empty-string for no zone
func getZoneKey(node *api.Node) string {
labels := node.Labels
if labels == nil {
return ""
}

region, _ := labels[unversioned.LabelZoneRegion]
failureDomain, _ := labels[unversioned.LabelZoneFailureDomain]

if region == "" && failureDomain == "" {
return ""
}

// We include the null character just in case region or failureDomain has a colon
// (We do assume there's no null characters in a region or failureDomain)
// As a nice side-benefit, the null character is not printed by fmt.Print or glog
return region + ":\x00:" + failureDomain
}

// CalculateSpreadPriority spreads pods across hosts and zones, considering pods belonging to the same service or replication controller.
// When a pod is scheduled, it looks for services or RCs that match the pod, then finds existing pods that match those selectors.
// It favors nodes that have fewer existing matching pods.
// i.e. it pushes the scheduler towards a node where there's the smallest number of
// pods which match the same service selectors or RC selectors as the pod being scheduled.
// Where zone information is included on the nodes, it favors nodes in zones with fewer existing matching pods.
func (s *SelectorSpread) CalculateSpreadPriority(pod *api.Pod, podLister algorithm.PodLister, nodeLister algorithm.NodeLister) (schedulerapi.HostPriorityList, error) {
var maxCount int
var nsPods []*api.Pod

selectors := make([]labels.Selector, 0)
Expand Down Expand Up @@ -76,35 +108,87 @@ func (s *SelectorSpread) CalculateSpreadPriority(pod *api.Pod, podLister algorit
return nil, err
}

counts := map[string]int{}
if len(nsPods) > 0 {
for _, pod := range nsPods {
matches := false
for _, selector := range selectors {
if selector.Matches(labels.Set(pod.ObjectMeta.Labels)) {
matches = true
break
}
}
if matches {
counts[pod.Spec.NodeName]++
// Compute the maximum number of pods hosted on any node
if counts[pod.Spec.NodeName] > maxCount {
maxCount = counts[pod.Spec.NodeName]
}
// Count similar pods by node
countsByNodeName := map[string]int{}
for _, pod := range nsPods {
// When we are replacing a failed pod, we often see the previous deleted version
// while scheduling the replacement. Ignore the previous deleted version for spreading
// purposes (it can still be considered for resource restrictions etc.)
if pod.DeletionTimestamp != nil {
glog.V(2).Infof("skipping pending-deleted pod: %s/%s", pod.Namespace, pod.Name)
continue
}
matches := false
for _, selector := range selectors {
if selector.Matches(labels.Set(pod.ObjectMeta.Labels)) {
matches = true
break
}
}
if !matches {
continue
}

countsByNodeName[pod.Spec.NodeName]++
}

// Aggregate by-node information
// Compute the maximum number of pods hosted on any node
maxCountByNodeName := 0
for _, count := range countsByNodeName {
if count > maxCountByNodeName {
maxCountByNodeName = count
}
}

// Count similar pods by zone, if zone information is present
countsByZone := map[string]int{}
for i := range nodes.Items {
node := &nodes.Items[i]

count, found := countsByNodeName[node.Name]
if !found {
continue
}

zoneId := getZoneKey(node)
if zoneId == "" {
continue
}

countsByZone[zoneId] += count
}

// Aggregate by-zone information
// Compute the maximum number of pods hosted in any zone
haveZones := len(countsByZone) != 0
maxCountByZone := 0
for _, count := range countsByZone {
if count > maxCountByZone {
maxCountByZone = count
}
}

result := []schedulerapi.HostPriority{}
//score int - scale of 0-10
// 0 being the lowest priority and 10 being the highest
for _, node := range nodes.Items {
// initializing to the default/max node score of 10
fScore := float32(10)
if maxCount > 0 {
fScore = 10 * (float32(maxCount-counts[node.Name]) / float32(maxCount))
//score int - scale of 0-maxPriority
// 0 being the lowest priority and maxPriority being the highest
for i := range nodes.Items {
node := &nodes.Items[i]
// initializing to the default/max node score of maxPriority
fScore := float32(maxPriority)
if maxCountByNodeName > 0 {
fScore = maxPriority * (float32(maxCountByNodeName-countsByNodeName[node.Name]) / float32(maxCountByNodeName))
}

// If there is zone information present, incorporate it
if haveZones {
zoneId := getZoneKey(node)
if zoneId != "" {
zoneScore := maxPriority * (float32(maxCountByZone-countsByZone[zoneId]) / float32(maxCountByZone))
fScore = (fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore)
}
}

result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: int(fScore)})
glog.V(10).Infof(
"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, node.Name, int(fScore),
Expand Down Expand Up @@ -177,13 +261,13 @@ func (s *ServiceAntiAffinity) CalculateAntiAffinityPriority(pod *api.Pod, podLis

numServicePods := len(nsServicePods)
result := []schedulerapi.HostPriority{}
//score int - scale of 0-10
// 0 being the lowest priority and 10 being the highest
//score int - scale of 0-maxPriority
// 0 being the lowest priority and maxPriority being the highest
for node := range labeledNodes {
// initializing to the default/max node score of 10
fScore := float32(10)
// initializing to the default/max node score of maxPriority
fScore := float32(maxPriority)
if numServicePods > 0 {
fScore = 10 * (float32(numServicePods-podCounts[labeledNodes[node]]) / float32(numServicePods))
fScore = maxPriority * (float32(numServicePods-podCounts[labeledNodes[node]]) / float32(numServicePods))
}
result = append(result, schedulerapi.HostPriority{Host: node, Score: int(fScore)})
}
Expand Down