Skip to content

Commit

Permalink
add KEP-521: Cluster Accurate Replica Estimator
Browse files Browse the repository at this point in the history
Signed-off-by: Garrybest <garrybest@foxmail.com>
  • Loading branch information
Garrybest committed Aug 4, 2021
1 parent 33a214b commit 31c50a4
Show file tree
Hide file tree
Showing 2 changed files with 176 additions and 0 deletions.
176 changes: 176 additions & 0 deletions docs/proposals/scheduling/521-replica-estimator/README.md
@@ -0,0 +1,176 @@
---
title: Cluster Accurate Replica Estimator
authors:
- "@Garrybest"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2021-08-03
---

# Cluster Accurate Replica Estimator

## Summary

As workloads in multi-clusters become more heterogeneous, it is natural that they have different scheduling needs such as NodeSelector, NodeAffinity and Tolerations. Meanwhile, the scheduler could not perceive free node resource of every node in member cluster accurately.

This KEP proposes a new component, Karmada Cluster Accurate Replica Estimator, to enhance the accurate scheduling for Karmada Scheduler. It aims for estimating available replicas and return the result back to Karmada Scheduler for decision reference。

## Motivation

There is currently no right way for Karmada Scheduler to perceive the dynamic node resource and pod request resource of the member clusters. The scheduler could only know the total resource situation of a member cluster according to the `NodeSummary` and `ResourceSummary` in `Cluster.Status`. If a workload specifies the replica resource claim, we should know how many available replicas a member cluster could produce in case of propagating too many replicas, which may lead to pending pods that fail to be scheduled.

The Cluster Accurate Replica Estimator aims to fix these problems.

### Goals

- Make the available replica estimation more acurate for scheduler decision reference.
- Allow user to specify node claim such as `NodeAffinity`, `NodeSelector` and `Tolerations` for multi-cluster scheduling.

### Non-Goals

- Estimating how many pods have failed to been scheduled for rescheduling decision reference.
- Estimating available replicas groups in group and gang scheduling.

## Proposal

This proposal gives a new component to estimate the maximum available replica of a workload. When assigning replicas, the Karmada Scheduler parallelly requests the corresponding Cluster Accurate Replica Estimators for estimation by gRPC request.

This proposal is divided into several steps, see below:

- [ ] `ResourceBinding` API changes to add `NodeAffinity`, `NodeSelector` and `Tolerations`.
- [ ] Definition of proto and gRPC struct file.
- [ ] Estimator client transform in Karmada Scheduler.
- [ ] Estimator server development in Karmada Cluster Accurate Replica Estimator.
- [ ] Deployment script of Karmada Cluster Accurate Replica Estimator.
- [ ] Associated docs and architecture diagram as a supplement.

### User Stories
#### Story 1
Imagine that we have 1 workload and 2 member clusters, and the `ReplicaDivisionPreference` is `Aggregated`.
- Condition:
- Cluster A has 10 nodes and each has 8 cpu remaining.
- Cluster B has 2 nodes and each has 16 cpu remaining.
- Workload has 1 replica and requests for 12 cpu.
- Result:
- Workload will be scheduled to Cluster A because it has more cpu remaining in total. However, it won't work because there is no node that could match the 12-cpu request.
- Only Cluster B can handle the workload's request but its total resource does not have a competitiveness.

#### Story 2
Imagine that 1 workload has a `NodeSelector` of "key=value", and the `ReplicaDivisionPreference` is `Aggregated`.
- Condition:
- Cluster A has 10 nodes and each has 16 cpu remaining.
- Cluster B has 2 nodes, which have a label of "key=value" and each node has 16 cpu remaining.
- Workload has 1 replica with a `NodeSelector` of "key=value" and requests for 12 cpu.
- Result:
- Workload will be scheduled to Cluster A because it has more cpu remaining in total. However, it won't work because there is no node that could match the `NodeSelector`.
- Only Cluster B can handle the workload's request but its total resource does not have a competitiveness.

## Design Details

The cluster maximum available replicas result has a significant influence with the scheduling. To solve this problem, we could change the way that we estimate the cluster available replicas. Note that it is now estimated by resource summary of a cluster. This function could be converted into a new scheduler plugin type.

Here's the architecture design diagram.

![design](https://github.com/karmada-io/karmada/tree/master/docs/proposals/scheduling/521-replica-estimator/design.png)

### API Changes

First, the API of `ResourceBinding` must be changed. `NodeAffinity`, `NodeSelector` and `Tolerations` should be added as a representative for node claim along with the existed `ResourceRequest`.

```go
// ReplicaRequirements represents the requirements required by each replica.
type ReplicaRequirements struct {
// A node selector represents the union of the results of one or more label queries
// over a set of nodes; that is, it represents the OR of the selectors represented
// by the node selector terms.
// +optional
NodeAffinity *corev1.NodeSelector `json:"nodeAffinity,omitempty"`

// NodeSelector is a selector which must be true for the pod to fit on a node.
// Selector which must match a node's labels for the pod to be scheduled on that node.
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`

// If specified, the pod's tolerations.
// +optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`

// ResourceRequest represents the resources required by each replica.
// +optional
ResourceRequest corev1.ResourceList `json:"resourceRequest,omitempty"`
}
```

After the change, the `Object Reference` will be like this. The only difference is converting `corev1.ResourceList` to `*ReplicaRequirements` required by each replica.

```go
// ObjectReference contains enough information to locate the referenced object inside current cluster.
type ObjectReference struct {
// APIVersion represents the API version of the referent.
APIVersion string `json:"apiVersion"`

// Kind represents the Kind of the referent.
Kind string `json:"kind"`

// Namespace represents the namespace for the referent.
// For non-namespace scoped resources(e.g. 'ClusterRole'),do not need specify Namespace,
// and for namespace scoped resources, Namespace is required.
// If Namespace is not specified, means the resource is non-namespace scoped.
// +optional
Namespace string `json:"namespace,omitempty"`

// Name represents the name of the referent.
Name string `json:"name"`

// ResourceVersion represents the internal version of the referenced object, that can be used by clients to
// determine when object has changed.
// +optional
ResourceVersion string `json:"resourceVersion,omitempty"`

// ReplicaRequirements represents the requirements required by each replica.
// +optional
ReplicaRequirements *ReplicaRequirements `json:"replicaRequirements,omitempty"`

// Replicas represents the replica number of the referencing resource.
// +optional
Replicas int32 `json:"replicas,omitempty"`
}
```

### Karmada Scheduler

First, the existing plugins in Karmada Scheduler such as ClusterAffinity, APIInstalled and TaintToleration will select the suitable clusters.

Based on this prefilter result, when assigning replicas, the Karmada Scheduler could try to calculate cluster max available replicas by starting gRPC requests concurrently to the Cluster Accurate Replica Estimator. At last, the Cluster Accurate Replica Estimator will soon return how many available replicas that the cluster could produce. Then the Karmada Scheduler assgin replicas into different clusters in terms of the estimation result.

Furthermore, replica estimation can be considered as a new scheduler plugin.
We could implement this by modifying function calClusterAvailableReplicas to a interface. The previous estimation method, based on `ResourceSummary` in `Cluster.Status`, is able to be a default normal estimation approach.

### Karmada Cluster Accurate Replica Estimator

Cluster Accurate Replica Estimator is a independent component that works as a gRPC server. Before its server starts, a pod and node informer associated with a member cluster will be created as a cache. Once the cache has been synced, the gRPC server would start and serve the incoming scheduler request as a replica estimator.

There are five steps for the estimator to do a replica estimation:

- Verify whether the request meets the requirements.
- Find all nodes that matches the node claim.
- List nodes by label selector.
- Filter nodes by node affinity.
- Filter schedulable nodes by taints and tolerations.
- Estimate max available replicas in every filter node.
- Get pods that assigned to the node.
- Calculate node idle resource.
- Calculate how many replicas that the node can be divided into.
- Return the sum of all node max available replicas.

### Test Plan

- Unit Test covering:
- Core changes in Karmada Scheduler that consists of gRPC connection establishment, replica estimation request sending.
- Core changes in Karmada Cluster Accurate Replica Estimator that consists of node filtering and node idle resource calculation.
- E2E Test covering:
- Deploy Karmada Cluster Accurate Replica Estimator.
- Specify different node claim in a workload and test the scheduler result.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 31c50a4

Please sign in to comment.