Skip to content

Commit

Permalink
Merge pull request #250 from joelsmith/rebase
Browse files Browse the repository at this point in the history
OCPCLOUD-1851: Upstream rebase to CA 1.26.1 and VPA 0.13
  • Loading branch information
openshift-merge-robot committed Feb 23, 2023
2 parents 9a0de22 + f166a59 commit c58c53b
Show file tree
Hide file tree
Showing 2,656 changed files with 315,249 additions and 167,332 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Expand Up @@ -17,7 +17,7 @@ jobs:
- name: Set up Go
uses: actions/setup-go@v2
with:
go-version: 1.18.1
go-version: 1.19

- uses: actions/checkout@v2
with:
Expand Down
182 changes: 182 additions & 0 deletions balancer/proposals/balancer.md
@@ -0,0 +1,182 @@

# KEP - Balancer

## Introduction

One of the problems that the users are facing when running Kubernetes deployments is how to
deploy pods across several domains and keep them balanced and autoscaled at the same time.
These domains may include:

* Cloud provider zones inside a single region, to ensure that the application is still up and running, even if one of the zones has issues.
* Different types of Kubernetes nodes. These may involve nodes that are spot/preemptible, or of different machine families.

A single Kuberentes deployment may either leave the placement entirely up to the scheduler
(most likely leading to something not entirely desired, like all pods going to a single domain) or
focus on a single domain (thus not achieving the goal of being in two or more domains).

PodTopologySpreading solves the problem a bit, but not completely. It allows only even spreading
and once the deployment gets skewed it doesn’t do anything to rebalance. Pod topology spreading
(with skew and/or ScheduleAnyway flag) is also just a hint, if skewed placement is available and
allowed then Cluster Autoscaler is not triggered and the user ends up with a skewed deployment.
A user could specify a strict pod topolog spreading but then, in case of problems the deployment
would not move its pods to the domains that are available. The growth of the deployment would also
be totally blocked as the available domains would be too much skewed.

Thus, if full flexibility is needed, the only option is to have multiple deployments, targeting
different domains. This setup however creates one big problem. How to consistently autoscale multiple
deployments? The simplest idea - having multiple HPAs is not stable, due to different loads, race
conditions or so, some domains may grow while the others are shrunk. As HPAs and deployments are
not connected anyhow, the skewed setup will not fix itself automatically. It may eventually come to
a semi-balanced state but it is not guaranteed.


Thus there is a need for some component that will:

* Keep multiple deployments aligned. For example it may keep an equal ratio between the number of
pods in one deployment and the other. Or put everything to the first and overflow to the second and so on.
* React to individual deployment problems should it be zone outage or lack of spot/preemptible vms.
* Actively try to rebalance and get to the desired layout.
* Allow to autoscale all deployments with a single target, while maintaining the placement policy.

## Balancer

Balancer is a stand-alone controller, living in userspace (or in control plane, if needed) exposing
a CRD API object, also called Balancer. Each balancer object has pointers to multiple deployments
or other pod-controlling objects that expose the Scale subresource. Balancer periodically checks
the number of running and problematic pods inside each of the targets, compares it with the desired
number of replicas, constraints and policies and adjusts the number of replicas on the targets,
should some of them run too many or too few of them. To allow being an HPA target Balancer itself
exposes the Scale subresource.

## Balancer API

```go
// Balancer is an object used to automatically keep the desired number of
// replicas (pods) distributed among the specified set of targets (deployments
// or other objects that expose the Scale subresource).
type Balancer struct {
metav1.TypeMeta
// Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
// +optional
metav1.ObjectMeta
// Specification of the Balancer behavior.
// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status.
Spec BalancerSpec
// Current information about the Balancer.
// +optional
Status BalancerStatus
}

// BalancerSpec is the specification of the Balancer behavior.
type BalancerSpec struct {
// Targets is a list of targets between which Balancer tries to distribute
// replicas.
Targets []BalancerTarget
// Replicas is the number of pods that should be distributed among the
// declared targets according to the specified policy.
Replicas int32
// Selector that groups the pods from all targets together (and only those).
// Ideally it should match the selector used by the Service built on top of the
// Balancer. All pods selectable by targets' selector must match to this selector,
// however target's selector don't have to be a superset of this one (although
// it is recommended).
Selector metav1.LabelSelector
// Policy defines how the balancer should distribute replicas among targets.
Policy BalancerPolicy
}

// BalancerTarget is the declaration of one of the targets between which the balancer
// tries to distribute replicas.
type BalancerTarget struct {
// Name of the target. The name can be later used to specify
// additional balancer details for this target.
Name string
// ScaleTargetRef is a reference that points to a target resource to balance.
// The target needs to expose the Scale subresource.
ScaleTargetRef hpa.CrossVersionObjectReference
// MinReplicas is the minimum number of replicas inside of this target.
// Balancer will set at least this amount on the target, even if the total
// desired number of replicas for Balancer is lower.
// +optional
MinReplicas *int32
// MaxReplicas is the maximum number of replicas inside of this target.
// Balancer will set at most this amount on the target, even if the total
// desired number of replicas for the Balancer is higher.
// +optional
MaxReplicas *int32
}

// BalancerPolicyName is the name of the balancer Policy.
type BalancerPolicyName string
const (
PriorityPolicyName BalancerPolicyName = "priority"
ProportionalPolicyName BalancerPolicyName = "proportional"
)

// BalancerPolicy defines Balancer policy for replica distribution.
type BalancerPolicy struct {
// PolicyName decides how to balance replicas across the targets.
// Depending on the name one of the fields Priorities or Proportions must be set.
PolicyName BalancerPolicyName
// Priorities contains detailed specification of how to balance when balancer
// policy name is set to Priority.
// +optional
Priorities *PriorityPolicy
// Proportions contains detailed specification of how to balance when
// balancer policy name is set to Proportional.
// +optional
Proportions *ProportionalPolicy
// Fallback contains specification of how to recognize and what to do if some
// replicas fail to start in one or more targets. No fallback happens if not-set.
// +optional
Fallback *Fallback
}

// PriorityPolicy contains details for Priority-based policy for Balancer.
type PriorityPolicy struct {
// TargetOrder is the priority-based list of Balancer targets names. The first target
// on the list gets the replicas until its maxReplicas is reached (or replicas
// fail to start). Then the replicas go to the second target and so on. MinReplicas
// is guaranteed to be fulfilled, irrespective of the order, presence on the
// list, and/or total Balancer's replica count.
TargetOrder []string
}

// ProportionalPolicy contains details for Proportion-based policy for Balancer.
type ProportionalPolicy struct {
// TargetProportions is a map from Balancer targets names to rates. Replicas are
// distributed so that the max difference between the current replica share
// and the desired replica share is minimized. Once a target reaches maxReplicas
// it is removed from the calculations and replicas are distributed with
// the updated proportions. MinReplicas is guaranteed for a target, irrespective
// of the total Balancer's replica count, proportions or the presence in the map.
TargetProportions map[string]int32
}

// Fallback contains information how to recognize and handle replicas
// that failed to start within the specified time period.
type Fallback struct {
// StartupTimeout defines how long will the Balancer wait before considering
// a pending/not-started pod as blocked and starting another replica in some other
// target. Once the replica is finally started, replicas in other targets
// may be stopped.
StartupTimeout metav1.Duration
}

// BalancerStatus describes the Balancer runtime state.
type BalancerStatus struct {
// Replicas is an actual number of observed pods matching Balancer selector.
Replicas int32
// Selector is a query over pods that should match the replicas count. This is same
// as the label selector but in the string format to avoid introspection
// by clients. The string will be in the same format as the query-param syntax.
// More info about label selectors: http://kubernetes.io/docs/user-guide/labels#label-selectors
Selector string
// Conditions is the set of conditions required for this Balancer to work properly,
// and indicates whether or not those conditions are met.
// +optional
// +patchMergeKey=type
// +patchStrategy=merge
Conditions []metav1.Condition
}
```
7 changes: 3 additions & 4 deletions builder/OWNERS
@@ -1,10 +1,9 @@
approvers:
- aleksandra-malinowska
- losipiuk
- maciekpytel
- mwielgus
reviewers:
- aleksandra-malinowska
- losipiuk
- maciekpytel
- mwielgus
emeritus_approvers:
- aleksandra-malinowska # 2022-09-30
- losipiuk # 2022-09-30
2 changes: 1 addition & 1 deletion charts/cluster-autoscaler/Chart.yaml
Expand Up @@ -11,4 +11,4 @@ name: cluster-autoscaler
sources:
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
type: application
version: 9.20.1
version: 9.21.1
1 change: 1 addition & 0 deletions charts/cluster-autoscaler/README.md
Expand Up @@ -367,6 +367,7 @@ Though enough for the majority of installations, the default PodSecurityPolicy _
| serviceMonitor.annotations | object | `{}` | Annotations to add to service monitor |
| serviceMonitor.enabled | bool | `false` | If true, creates a Prometheus Operator ServiceMonitor. |
| serviceMonitor.interval | string | `"10s"` | Interval that Prometheus scrapes Cluster Autoscaler metrics. |
| serviceMonitor.metricRelabelings | object | `{}` | MetricRelabelConfigs to apply to samples before ingestion. |
| serviceMonitor.namespace | string | `"monitoring"` | Namespace which Prometheus is running in. |
| serviceMonitor.path | string | `"/metrics"` | The path to scrape for metrics; autoscaler exposes `/metrics` (this is standard) |
| serviceMonitor.selector | object | `{"release":"prometheus-operator"}` | Default to kube-prometheus install (CoreOS recommended), but should be set according to Prometheus install. |
Expand Down
3 changes: 3 additions & 0 deletions charts/cluster-autoscaler/templates/_helpers.tpl
Expand Up @@ -70,10 +70,13 @@ Return the appropriate apiVersion for podsecuritypolicy.
{{- $kubeTargetVersion := default .Capabilities.KubeVersion.GitVersion .Values.kubeTargetVersionOverride }}
{{- if semverCompare "<1.10-0" $kubeTargetVersion -}}
{{- print "extensions/v1beta1" -}}
{{- if semverCompare ">1.21-0" $kubeTargetVersion -}}
{{- print "policy/v1" -}}
{{- else -}}
{{- print "policy/v1beta1" -}}
{{- end -}}
{{- end -}}
{{- end -}}

{{/*
Return the appropriate apiVersion for podDisruptionBudget.
Expand Down
5 changes: 5 additions & 0 deletions charts/cluster-autoscaler/templates/deployment.yaml
Expand Up @@ -59,6 +59,11 @@ spec:
- --nodes={{ .minSize }}:{{ .maxSize }}:{{ .name }}
{{- end }}
{{- end }}
{{- if eq .Values.cloudProvider "rancher" }}
{{- if .Values.cloudConfigPath }}
- --cloud-config={{ .Values.cloudConfigPath }}
{{- end }}
{{- end }}
{{- if eq .Values.cloudProvider "aws" }}
{{- if .Values.autoDiscovery.clusterName }}
- --node-group-auto-discovery=asg:tag={{ tpl (join "," .Values.autoDiscovery.tags) . }}
Expand Down
4 changes: 4 additions & 0 deletions charts/cluster-autoscaler/templates/servicemonitor.yaml
Expand Up @@ -20,6 +20,10 @@ spec:
- port: {{ .Values.service.portName }}
interval: {{ .Values.serviceMonitor.interval }}
path: {{ .Values.serviceMonitor.path }}
{{- if .Values.serviceMonitor.metricRelabelings }}
metricRelabelings:
{{ tpl (toYaml .Values.serviceMonitor.metricRelabelings | indent 6) . }}
{{- end }}
namespaceSelector:
matchNames:
- {{.Release.Namespace}}
Expand Down
3 changes: 3 additions & 0 deletions charts/cluster-autoscaler/values.yaml
Expand Up @@ -344,6 +344,9 @@ serviceMonitor:
path: /metrics
# serviceMonitor.annotations -- Annotations to add to service monitor
annotations: {}
## [RelabelConfig](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.RelabelConfig)
# serviceMonitor.metricRelabelings -- MetricRelabelConfigs to apply to samples before ingestion.
metricRelabelings: {}

## Custom PrometheusRule to be defined
## The value is evaluated as a template, so, for example, the value can depend on .Release or .Chart
Expand Down

0 comments on commit c58c53b

Please sign in to comment.