-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: policy-based federated resource placement #292
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,371 @@ | ||
# Policy-based Federated Resource Placement | ||
|
||
This document proposes a design for policy-based control over placement of | ||
Federated resources. | ||
|
||
Tickets: | ||
|
||
- https://github.com/kubernetes/kubernetes/issues/39982 | ||
|
||
Authors: | ||
|
||
- Torin Sandall (torin@styra.com, tsandall@github) and Tim Hinrichs | ||
(tim@styra.com). | ||
- Based on discussions with Quinton Hoole (quinton.hoole@huawei.com, | ||
quinton-hoole@github), Nikhil Jindal (nikhiljindal@github). | ||
|
||
## Background | ||
|
||
Resource placement is a policy-rich problem affecting many deployments. | ||
Placement may be based on company conventions, external regulation, pricing and | ||
performance requirements, etc. Furthermore, placement policies evolve over time | ||
and vary across organizations. As a result, it is difficult to anticipate the | ||
policy requirements of all users. | ||
|
||
A simple example of a placement policy is | ||
|
||
> Certain apps must be deployed on clusters in EU zones with sufficient PCI | ||
> compliance. | ||
|
||
The [Kubernetes Cluster | ||
Federation](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation.md#policy-engine-and-migrationreplication-controllers) | ||
design proposal includes a pluggable policy engine component that decides how | ||
applications/resources are placed across federated clusters. | ||
|
||
Currently, the placement decision can be controlled for Federated ReplicaSets | ||
using the `federation.kubernetes.io/replica-set-preferences` annotation. In the | ||
future, the [Cluster | ||
Selector](https://github.com/kubernetes/kubernetes/issues/29887) annotation will | ||
provide control over placement of other resources. The proposed design supports | ||
policy-based control over both of these annotations (as well as others). | ||
|
||
This proposal is based on a POC built using the Open Policy Agent project. [This | ||
short video (7m)](https://www.youtube.com/watch?v=hRz13baBhfg) provides an | ||
overview and demo of the POC. | ||
|
||
## Design | ||
|
||
The proposed design uses the [Open Policy Agent](http://www.openpolicyagent.org) | ||
project (OPA) to realize the policy engine component from the Federation design | ||
proposal. OPA is an open-source, general purpose policy engine that includes a | ||
declarative policy language and APIs to answer policy queries. | ||
|
||
The proposed design allows administrators to author placement policies and have | ||
them automatically enforced when resources are created or updated. The design | ||
also covers support for automatic remediation of resource placement when policy | ||
(or the relevant state of the world) changes. | ||
|
||
In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes | ||
in the same cluster as the Federation Control Plane: | ||
|
||
![Architecture](https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU/pub?w=807&h=407) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Following that link downloaded the drawing for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Intent was for the image to show inline when viewing the rendered version. I'll see if that other link works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it needs to be the |
||
|
||
The proposed design is divided into following sections: | ||
|
||
1. Control over the initial placement decision (admission controller) | ||
1. Remediation of resource placement (opa-kube-sync/remediator) | ||
1. Replication of Kubernetes resources (opa-kube-sync/replicator) | ||
1. Management and storage of policies (ConfigMap) | ||
|
||
### 1. Initial Placement Decision | ||
|
||
To provide policy-based control over the initial placement decision, we propose | ||
a new admission controller that integrates with OPA: | ||
|
||
When admitting requests, the admission controller executes an HTTP API call | ||
against OPA. The API call passes the JSON representation of the resource in the | ||
message body. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rather than a new admission plugin and a synchronous webhook call at creation time, why not persist the resource without any placement annotations, and let an external controller observe/update the resource with the proper placement annotations? until those placement annotations are present, no spreading would be done that would be more in line with the initializer proposal (just implemented weakly using annotations) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @liggitt Yes, I think that the initializer proposal might be a better solution, once it is finalized and implemented. But until then, I don't think that we have a good way to prevent the scheduler from doing placement/spreading until the annotations have been applied by the policy agent. So I'd suggest that we keep the implementation as a admission plugin, and add it to the bucket of admission plugins that need to be ported to the initializer pattern. Make sense, or am I talking garbage? I will confess that I have not yet waded through the full initializer proposal megathread. |
||
|
||
The response from OPA contains the desired value for the resource’s annotations | ||
(defined in policy by the administrator). The admission controller updates the | ||
annotations on the resource and admits the request: | ||
|
||
![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288) | ||
|
||
The admission controller updates the resource by **merging** the annotations in | ||
the response with existing annotations on the resource. If there are overlapping | ||
annotation keys the admission controller replaces the existing value with the | ||
value from the response. | ||
|
||
#### Example Policy Engine Query: | ||
|
||
```http | ||
POST /v1/data/io/k8s/federation/admission HTTP/1.1 | ||
Content-Type: application/json | ||
``` | ||
|
||
```json | ||
{ | ||
"input": { | ||
"apiVersion": "extensions/v1beta1", | ||
"kind": "ReplicaSet", | ||
"metadata": { | ||
"annotations": { | ||
"policy.federation.alpha.kubernetes.io/eu-jurisdiction-required": "true", | ||
"policy.federation.alpha.kubernetes.io/pci-compliance-level": "2" | ||
}, | ||
"creationTimestamp": "2017-01-23T16:25:14Z", | ||
"generation": 1, | ||
"labels": { | ||
"app": "nginx-eu" | ||
}, | ||
"name": "nginx-eu", | ||
"namespace": "default", | ||
"resourceVersion": "364993", | ||
"selfLink": "/apis/extensions/v1beta1/namespaces/default/replicasets/nginx-eu", | ||
"uid": "84fab96d-e188-11e6-ac83-0a580a54020e" | ||
}, | ||
"spec": { | ||
"replicas": 4, | ||
"selector": {...}, | ||
"template": {...}, | ||
} | ||
} | ||
} | ||
``` | ||
|
||
#### Example Policy Engine Response: | ||
|
||
```http | ||
HTTP/1.1 200 OK | ||
Content-Type: application/json | ||
``` | ||
|
||
```json | ||
{ | ||
"result": { | ||
"annotations": { | ||
"federation.kubernetes.io/replica-set-preferences": { | ||
"clusters": { | ||
"gce-europe-west1": { | ||
"weight": 1 | ||
}, | ||
"gce-europe-west2": { | ||
"weight": 1 | ||
} | ||
}, | ||
"rebalance": true | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
> This example shows the policy engine returning the replica-set-preferences. | ||
> The policy engine could similarly return a desired value for other annotations | ||
> such as the Cluster Selector annotation. | ||
|
||
#### Conflicts | ||
|
||
A conflict arises if the developer and the policy define different values for an | ||
annotation. In this case, the developer's intent is provided as a policy query | ||
input and the policy author's intent is encoded in the policy itself. Since the | ||
policy is the only place where both the developer and policy author intents are | ||
known, the policy (or policy engine) should be responsible for resolving the | ||
conflict. | ||
|
||
There are a few options for handling conflicts. As a concrete example, this is | ||
how a policy author could handle invalid clusters/conflicts: | ||
|
||
``` | ||
package io.k8s.federation.admission | ||
|
||
errors["requested replica-set-preferences includes invalid clusters"] { | ||
invalid_clusters = developer_clusters - policy_defined_clusters | ||
invalid_clusters != set() | ||
} | ||
|
||
annotations["replica-set-preferences"] = value { | ||
value = developer_clusters & policy_defined_clusters | ||
} | ||
|
||
# Not shown here: | ||
# | ||
# policy_defined_clusters[...] { ... } | ||
# developer_clusters[...] { ... } | ||
``` | ||
|
||
The admission controller will execute a query against | ||
/io/k8s/federation/admission and if the policy detects an invalid cluster, the | ||
"errors" key in the response will contain a non-empty array. In this case, the | ||
admission controller will deny the request. | ||
|
||
```http | ||
HTTP/1.1 200 OK | ||
Content-Type: application/json | ||
``` | ||
|
||
```json | ||
{ | ||
"result": { | ||
"errors": [ | ||
"requested replica-set-preferences includes invalid clusters" | ||
], | ||
"annotations": { | ||
"federation.kubernetes.io/replica-set-preferences": { | ||
... | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
This example shows how the policy could handle conflicts when the author's | ||
intent is to define clusters that MAY be used. If the author's intent is to | ||
define what clusters MUST be used, then the logic would not use intersection. | ||
|
||
#### Configuration | ||
|
||
The admission controller requires configuration for the OPA endpoint: | ||
|
||
``` | ||
{ | ||
"EnforceSchedulingPolicy": { | ||
"url": “https://opa.federation.svc.cluster.local:8181/v1/data/io/k8s/federation/annotations”, | ||
"token": "super-secret-token-value" | ||
} | ||
} | ||
``` | ||
|
||
- `url` specifies the URL of the policy engine API to query. The query response | ||
contains the annotations to apply to the resource. | ||
- `token` specifies a static token to use for authentication when contacting the | ||
policy engine. In the future, other authentication schemes may be supported. | ||
|
||
The configuration file is provided to the federation-apiserver with the | ||
`--admission-control-config-file` command line argument. | ||
|
||
The admission controller is enabled in the federation-apiserver by providing the | ||
`--admission-control` command line argument. E.g., | ||
`--admission-control=AlwaysAdmit,EnforceSchedulingPolicy`. | ||
|
||
The admission controller will be enabled by default. | ||
|
||
#### Error Handling | ||
|
||
The admission controller is designed to **fail closed** if policies have been | ||
created. | ||
|
||
Request handling may fail because of: | ||
|
||
- Serialization errors | ||
- Request timeouts or other network errors | ||
- Authentication or authorization errors from the policy engine | ||
- Other unexpected errors from the policy engine | ||
|
||
In the event of request timeouts (or other network errors) or back-pressure | ||
hints from the policy engine, the admission controller should retry after | ||
applying a backoff. The admission controller should also create an event so that | ||
developers can identify why their resources are not being scheduled. | ||
|
||
Policies are stored as ConfigMap resources in a well-known namespace. This | ||
allows the admission controller to check if one or more policies exist. If one | ||
or more policies exist, the admission controller will fail closed. Otherwise | ||
the admission controller will **fail open**. | ||
|
||
### 2. Remediation of Resource Placement | ||
|
||
When policy changes or the environment in which resources are deployed changes | ||
(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might | ||
need to be moved for them to obey the placement policy. Sometimes administrators | ||
may decide to remediate manually, other times they may want Kubernetes to | ||
remediate automatically. | ||
|
||
To automatically reschedule resources onto desired clusters, we introduce a | ||
remediator component (**opa-kube-sync**) that is deployed as a sidecar with OPA. | ||
|
||
![Remediation](https://docs.google.com/drawings/d/1ehuzwUXSpkOXzOUGyBW0_7jS8pKB4yRk_0YRb1X4zsY/pub?w=812&h=288) | ||
|
||
The notifications sent to the remediator by OPA specify the new value for | ||
annotations such as replica-set-preferences. | ||
|
||
When the remediator component (in the sidecar) receives the notification it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does the remediator get auth credentials to send that request to federation-apiserver? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently the container gets a kubeconfig mounted. I think this would be the same mechanism used by the federation-controller-manager? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. add that to the doc There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
sends a PATCH request to the federation-apiserver to update the affected | ||
resource. This way, the actual rebalancing of ReplicaSets is still handled by | ||
the [Rescheduling | ||
Algorithm](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md) | ||
in the Federated ReplicaSet controller. | ||
|
||
The remediator component must be deployed with a kubeconfig for the | ||
federation-apiserver so that it can identify itself when sending the PATCH | ||
requests. We can use the same mechanism that is used for the | ||
federation-controller-manager (which also needs ot identify itself when sending | ||
requests to the federation-apiserver.) | ||
|
||
### 3. Replication of Kubernetes Resources | ||
|
||
Administrators must be able to author policies that refer to properties of | ||
Kubernetes resources. For example, assuming the following sample policy (in | ||
English): | ||
|
||
> Certain apps must be deployed on Clusters in EU zones with sufficient PCI | ||
> compliance. | ||
|
||
The policy definition must refer to the geographic region and PCI compliance | ||
rating of federated clusters. Today, the geographic region is stored as an | ||
attribute on the cluster resource and the PCI compliance rating is an example of | ||
data that may be included in a label or annotation. | ||
|
||
When the policy engine is queried for a placement decision (e.g., by the | ||
admission controller), it must have access to the data representing the | ||
federated clusters. | ||
|
||
To provide OPA with the data representing federated clusters as well as other | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry this is not clear to me. Does it only watch "cluster" resources (to see the region and pci compliance annotations on them) or other resources as well? Why does it need to watch other resources if it does that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Policy authors may want to make placement decisions based on resources other than "clusters". Does that make sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes but it is not clear to me how we support that use case with this proposal. Can you give an example and how it will work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've updated the proposal with details on how this'll work. Covered implementation details as well as design goals and future improvements. |
||
Kubernetes resource types (such as federated ReplicaSets), we use a sidecar | ||
container that is deployed alongside OPA. The sidecar (“opa-kube-sync”) is | ||
responsible for replicating Kubernetes resources into OPA: | ||
|
||
![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288) | ||
|
||
The sidecar/replicator component will implement the (somewhat common) list/watch | ||
pattern against the federation-apiserver: | ||
|
||
- Initially, it will GET all resources of a particular type. | ||
- Subsequently, it will GET with the **watch** and **resourceVersion** | ||
parameters set and process add, remove, update events accordingly. | ||
|
||
Each resource received by the sidecar/replicator component will be pushed into | ||
OPA. The sidecar will likely rely on one of the existing Kubernetes Go client | ||
libraries to handle the low-level list/watch behavior. | ||
|
||
As new resource types are introduced in the federation-apiserver, the | ||
sidecar/replicator component will need to be updated to support them. As a | ||
result, the sidecar/replicator component must be designed so that it is easy to | ||
add support for new resource types. | ||
|
||
Eventually, the sidecar/replicator component may allow admins to configure which | ||
resource types are replicated. As an optimization, the sidecar may eventually | ||
analyze policies to determine which resource properties are requires for policy | ||
evaluation. This would allow it to replicate the minimum amount of data into | ||
OPA. | ||
|
||
### 4. Policy Management | ||
|
||
Policies are written in a text-based, declarative language supported by OPA. The | ||
policies can be loaded into the policy engine either on startup or via HTTP | ||
APIs. | ||
|
||
To avoid introducing additional persistent state, we propose storing policies | ||
in ConfigMap resources in the Federation Control Plane inside a well-known | ||
namespace (e.g., `kube-federationscheduling-policy`). The ConfigMap resources | ||
will be replicated into the policy engine by the sidecar. | ||
|
||
The sidecar can establish a watch on the ConfigMap resources in the Federation | ||
Control Plane. This will enable hot-reloading of policies whenever they change. | ||
|
||
## Applicability to Other Policy Engines | ||
|
||
This proposal was designed based on a POC with OPA, but it can be applied to | ||
other policy engines as well. The admission and remediation components are | ||
comprised of two main pieces of functionality: (i) applying annotation values to | ||
federated resources and (ii) asking the policy engine for annotation values. The | ||
code for applying annotation values is completely independent of the policy | ||
engine. The code that asks the policy engine for annotation values happens both | ||
within the admission and remediation components. In the POC, asking OPA for | ||
annotation values amounts to a simple RESTful API call that any other policy | ||
engine could implement. | ||
|
||
## Future Work | ||
|
||
- This proposal uses ConfigMaps to store and manage policies. In the future, we | ||
want to introduce a first-class **Policy** API resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterSelector as proposed in that issue can not replace replica set preferences.
Cluster selector is for filtering the clusters where a resource is created.
replica set preferences allows weights, which helps bursts modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this paragraph to clarify. The intent is to communicate that in the future, placement of other resources may be controlled with policy by defining cluster selector values. Does this make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated mention of Cluster Selector annotation to clarify.