Skip to content

Commit

Permalink
priority and fairness: add production readiness review
Browse files Browse the repository at this point in the history
Signed-off-by: Adhityaa Chandrasekar <adtac@google.com>
  • Loading branch information
Adhityaa Chandrasekar committed Oct 5, 2020
1 parent eb0150d commit 64fb717
Show file tree
Hide file tree
Showing 2 changed files with 211 additions and 30 deletions.
@@ -1,27 +1,4 @@
---
title: Priority and Fairness for API Server Requests
authors:
- "@MikeSpreitzer"
- "@yue9944882"
owning-sig: sig-api-machinery
participating-sigs:
- wg-multitenancy
reviewers:
- "@deads2k"
- "@lavalamp"
approvers:
- "@deads2k"
- "@lavalamp"
editor: TBD
creation-date: 2019-02-28
last-updated: 2019-02-28
status: implementable
see-also:
replaces:
superseded-by:
---

# Priority and Fairness for API Server Requests
# KEP-1040: Priority and Fairness for API Server Requests

## Table of Contents

Expand Down Expand Up @@ -76,6 +53,13 @@ superseded-by:
- [Design Considerations](#design-considerations)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
Expand All @@ -91,8 +75,8 @@ For enhancements that make changes to code or processes/procedures in core Kuber

Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released.

- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [ ] KEP approvers have set the KEP status to `implementable`
- [x] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [x] KEP approvers have set the KEP status to `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
Expand All @@ -119,7 +103,7 @@ https://speakerdeck.com/sttts/kubernetes-api-codebase-tour?slide=18 .

## Motivation

Today the apiserver has a simple mechanism for protectimg itself
Today the apiserver has a simple mechanism for protecting itself
against CPU and memory overloads: max-in-flight limits for mutating
and for readonly requests. Apart from the distinction between
mutating and readonly, no other distinctions are made among requests;
Expand Down Expand Up @@ -268,7 +252,6 @@ yet but we think may be interesting to consider in the future.
- Thread additional information along the paths needed to enable more
precisely targeted avoidance of priority inversions.


## Proposal

In short, this proposal is about generalizing the existing
Expand Down Expand Up @@ -406,7 +389,6 @@ with namespace then the bad behavior will be spread among all the
queues of that schema's priority. Administrators need to make a good
choice for how flows are distinguished.


#### Queue Assignment Proof of Concept

The following golang code shows a simple recursive technique to
Expand Down Expand Up @@ -551,7 +533,6 @@ func main() {
}
```


### Resource Limits

#### Primary CPU and Memory Protection
Expand Down Expand Up @@ -2025,6 +2006,152 @@ Beta:
- Automatically manages versions of mandatory/suggested configuration
- Discrimates paginated LIST requests

## Production Readiness Review Questionnaire

<!--
Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md.
The production readiness review questionnaire must be completed for features in
v1.19 or later, but is non-blocking at this time. That is, approval is not
required in order to be in the release.
In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.
The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->

### Feature Enablement and Rollback

* **How can this feature be enabled / disabled in a live cluster?**
- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: APIPriorityAndFairness
- Components depending on the feature gate:
- kube-apiserver

* **Does enabling the feature change any default behavior?** Yes, requests that
weren't rejected before could get rejected while requests that were rejected
previously may be allowed. Performance of kube-apiserver under heavy load
will likely be different too.

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?** Yes.

* **What happens if we reenable the feature if it was previously rolled back?**
The feature will be restored.

* **Are there any tests for feature enablement/disablement?** No. Manual tests
will be run before switching feature gate to beta.

### Rollout, Upgrade and Rollback Planning

* **How can a rollout fail? Can it impact already running workloads?** A
misconfiguration could cause apiserver requests to be rejected, which could
have widespread impact such as: (1) locking an administrator out of their
system, (2) rejecting controller requests, thereby bringing a lot of things
to a halt, (3) dropping node heartbeats, which may result in overloading
other nodes, (4) rejecting kube-proxy requests to apiserver, thereby breaking
existing workloads, (5) dropping leader election requests, resulting in HA
failure, or any combination of the above.

* **What specific metrics should inform a rollback?** An abnormal spike in the
`apiserver_flowcontrol_rejected_requests_total` metric should potentially be
viewed as a sign that kube-apiserver is rejecting requests, potentially
incorrectly. The `apiserver_flowcontrol_request_queue_length_after_enqueue`
metric getting too close to the configured queue length could be a sign of
insufficient queue size (or a system overload), which can be precursor to
rejected requests.

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
No. Manual tests will be run before switching feature gate to beta.

* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?** Yes, `--max-requests-inflights` will be
deprecated in favor of APF.

### Monitoring Requirements

* **How can an operator determine if the feature is in use by workloads?**
If the `apiserver_flowcontrol_dispatched_requests_total` metric is non-zero,
this feature is in use. Note that this isn't a workload feature, but a
control plane one.

* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
- [x] Metrics
- Metric name: `apiserver_flowcontrol_request_queue_length_after_enqueue`
- Components exposing the metric: kube-apiserver

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
No SLOs are proposed for the above SLI.

* **Are there any missing metrics that would be useful to have to improve observability
of this feature?** No.

### Dependencies

* **Does this feature depend on any specific services running in the cluster?**
No.

### Scalability

<!-- I don't have the knowledge to answer some questions in this section. -->

* **Will enabling / using this feature result in any new API calls?** No.

* **Will enabling / using this feature result in introducing new API types?**
Yes, a new flowcontrol API group, configuration types, and status types are
introduced. See `staging/src/k8s.io/api/flowcontrol/v1alpha1/types.go` for a
full list.

* **Will enabling / using this feature result in any new calls to the cloud
provider?** No.

* **Will enabling / using this feature result in increasing size or count of
the existing API objects?** No.

* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?** Yes, a non-negligible latency
is added to API calls to kube-apiserver. While [preliminary tests](https://github.com/tkashem/graceful/blob/master/priority-fairness/filter-latency/readme.md)
shows that the API server latency is still well within the existing SLOs,
more thorough testing needs to be performed.

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?** The proposed
flowcontrol logic in request handling in kube-apiserver will increase the CPU
and memory overheads involved in serving each request.

### Troubleshooting

* **How does this feature react if the API server and/or etcd is unavailable?**
The feature is itself within the API server. Etcd being unavailable would
likely cause kube-apiserver to fail at processing incoming requests.

* **What are other known failure modes?** A misconfiguration could reject
requests incorrectly. See the rollout and monitoring sections for details on
which metrics to watch to detect such a failure (see the `kep.yaml` file for
the full list of metrics). The following kube-apiserver log messages could
also indicate potential issues:
- "Unable to list PriorityLevelConfiguration objects"
- "Unable to list FlowSchema objects"
<!-- Should there be a higher V error message when requests are rejected? -->

* **What steps should be taken if SLOs are not being met to determine the
problem?** No SLOs are proposed.

[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

## Implementation History

Expand Down
54 changes: 54 additions & 0 deletions keps/sig-api-machinery/1040-priority-and-fairness/kep.yaml
@@ -0,0 +1,54 @@
title: Priority and Fairness for API Server Requests
kep-number: 1040
authors:
- "@MikeSpreitzer"
- "@yue9944882"
owning-sig: sig-api-machinery
participating-sigs:
- wg-multitenancy
- sig-scheduling
status: implementable
reviewers:
- "@deads2k"
- "@lavalamp"
- "@ahg-g"
- "@wojtek-t"
approvers:
- "@deads2k"
- "@lavalamp"
prr-approvers:
- "@wojtek-t"
creation-date: 2019-02-28

# The target maturity stage in the current dev cycle for this KEP.
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.20"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.18"
beta: "v1.20"
stable: "v1.22"

# The following PRR answers are required at alpha release.
# List the feature gate name and the components for which it must be enabled.
feature-gates:
- name: APIPriorityAndFairness
components:
- kube-apiserver
disable-supported: true

# The following PRR answers are required at beta release.
metrics:
- apiserver_flowcontrol_rejected_requests_total
- apiserver_flowcontrol_dispatched_requests_total
- apiserver_flowcontrol_current_inqueue_requests
- apiserver_flowcontrol_request_queue_length_after_enqueue
- apiserver_flowcontrol_request_concurrency_limit
- apiserver_flowcontrol_current_executing_requests
- apiserver_flowcontrol_request_wait_duration_seconds
- apiserver_flowcontrol_request_execution_seconds

0 comments on commit 64fb717

Please sign in to comment.