-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: policy-based federated resource placement #292
Proposal: policy-based federated resource placement #292
Conversation
|
||
When admitting requests, the admission controller executes an HTTP API call | ||
against OPA. The API call passes the JSON representation of the resource in the | ||
message body. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than a new admission plugin and a synchronous webhook call at creation time, why not persist the resource without any placement annotations, and let an external controller observe/update the resource with the proper placement annotations?
until those placement annotations are present, no spreading would be done
that would be more in line with the initializer proposal (just implemented weakly using annotations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Yes, I think that the initializer proposal might be a better solution, once it is finalized and implemented. But until then, I don't think that we have a good way to prevent the scheduler from doing placement/spreading until the annotations have been applied by the policy agent. So I'd suggest that we keep the implementation as a admission plugin, and add it to the bucket of admission plugins that need to be ported to the initializer pattern. Make sense, or am I talking garbage? I will confess that I have not yet waded through the full initializer proposal megathread.
@liggitt We need a way to apply the annotations before the controller sees the new resource, otherwise it will spread it, by default. |
Thanks for sending the proposal @tsandall |
(or the relevant state of the world) changes. | ||
|
||
In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes | ||
in the same cluster as the Federated Control Plane: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Federation Control Plane
e96056a
to
6a22395
Compare
|
||
Currently, the placement decision can be controlled for Federated ReplicaSets | ||
using the `federation.kubernetes.io/replica-set-preferences` annotation. In the | ||
future, the [Cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterSelector as proposed in that issue can not replace replica set preferences.
Cluster selector is for filtering the clusters where a resource is created.
replica set preferences allows weights, which helps bursts modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this paragraph to clarify. The intent is to communicate that in the future, placement of other resources may be controlled with policy by defining cluster selector values. Does this make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated mention of Cluster Selector annotation to clarify.
|
||
## Design | ||
|
||
The proposed design uses of the [Open Policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uses the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes | ||
in the same cluster as the Federation Control Plane: | ||
|
||
![Architecture](https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU/pub?w=807&h=407) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following that link downloaded the drawing for me.
https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent was for the image to show inline when viewing the rendered version. I'll see if that other link works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to be the .../pub?...
link in order for it show up inline. I will leave it as-is unless you object.
```json | ||
{ | ||
"result": { | ||
"federation.kubernetes.io/replica-set-preferences": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work for replicasets.
We should use the cluster selector annotation suggested in #344 which will work for all resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only one example where the policy defined a value for the replica-set-preferences annotation. The policy could similarly define a value for the cluster selection annotation and then the result would contain that (or both--it's up to the policy author).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this section with a comment to explain that other annotations can be returned by the policy engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to push the "submit review" button last week.
Serialization errors Request timeouts or other network errors Unexpected errors | ||
from the policy engine | ||
|
||
If the administrator has not defined a policy yet, the response from the policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a production deployment this aspect should probably be configurable (whether to admit or deny in the case of failure). And one option should be to deny for now, and retry until policy can be applied, before admitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, I misunderstood your statements above. I now see that you fail closed, not open, as I misunderstood. But I think the comment about retries still applies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will update to include a point about the types of errors that we will retry on. I think that i) serialization errors and ii) unexpected internal errors returned by the policy engine SHOULD NOT be tried. Network errors like timeouts SHOULD be retried.
} | ||
``` | ||
|
||
The configuration also contains a list of usernames that identify senders whose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how this works. Perhaps elaborate a little further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an instance of premature optimization. The admission controller queries the policy engine when resources are created OR updated. As a result, when the remediator performs a PATCH on a resource's annotations, the admission controller will perform a redundant query against the policy engine. This was intended to avoid that. I'm going to remove this for now as we can start simple and easily add this back if needed.
|
||
When policy changes or the environment in which resources are deployed changes | ||
(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might | ||
need to be moved for them to obey the placement policy. Sometimes operators may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'operators' is a term that has come to mean something different in the context of kubernetes.
https://coreos.com/blog/introducing-operators.html
Perhaps use 'administrators' instead here, for that reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
To automatically reschedule resources onto desired clusters, we introduce the | ||
sidecar (**opa-kube-sync**) to receive notifications from OPA that indicate | ||
changes in placement that would bring resources back into policy compliance. The | ||
sidecar reacts to notifications from OPA by updating the resources in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be explicit here about how these changes in annotations result in actual movement of resources. I think (but you should confirm) that the scheduler in the federated replicaset controller actually performs the move, via it's reconciliation process.
Note also that there is the potential for confusion (by readers of your doc) between this form of resource movement, and the Rescheduling Algorithm described in the above doc, so it might be worth adding a brief clarification in that regard here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will add more detail to this section to explain the mechanics.
Note also that there is the potential for confusion (by readers of your doc) between this form of resource movement, and the Rescheduling Algorithm described in the above doc, so it might be worth adding a brief clarification in that regard here.
I'll update the section to clarify that the policy engine only sends the remediator the desired value for the annotations and that the actual rebalancing is still handled by the algorithm in the controller. Thanks!
|
||
## Open Questions | ||
|
||
1. Instead of storing the policies in ConfigMap resources in the host cluster, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's perfectly reasonable to store the configmaps in the same cluster that the policy agent is running in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove this point.
controller is enabled in the federation-apiserver, it will deny requests if | ||
it cannot communicate with the policy engine (because it “fails closed” by | ||
design). | ||
1. If policies are stored in a ConfigMap resource then the policy engine must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigMaps are API resources, so surely these concerns don't apply? Or am I misunderstanding. I understand that ConfigMaps can be consumed as environment variables or attached volumes, but they can surely be consumed as API Resources instead (via a get+watch), and hence changes to the policy could be detected via the watch triggering, and not require an agent restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see your point. Initially, we can simply support ConfigMap attached as volumes. Eventually, we can support a mode where ConfigMaps are consumed as API resources (via get+watch). Either way, the policies are stored the same way. OK, I like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
injection via volumemount/envvar is preferable, for minimal coupling to the kubernetes API. volumemounted configmaps will update with changes to configmap resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
When admitting requests, the admission controller executes an HTTP API call | ||
against OPA. The API call passes the JSON representation of the resource in the | ||
message body. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Yes, I think that the initializer proposal might be a better solution, once it is finalized and implemented. But until then, I don't think that we have a good way to prevent the scheduler from doing placement/spreading until the annotations have been applied by the policy agent. So I'd suggest that we keep the implementation as a admission plugin, and add it to the bucket of admission plugins that need to be ported to the initializer pattern. Make sense, or am I talking garbage? I will confess that I have not yet waded through the full initializer proposal megathread.
I've updated the proposal in response to comments from @nikhiljindal and @quinton-hoole. Let me know if anything else ought to be improved. Thanks! |
I haven't had a chance to look at this yet re: admission but we've got project wide icebergs around admission and initializaers and external calls - I don't want you guys to hit them. I promise to review in next day or so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awaiting response from @liggitt
Sorry, fat fingered the above comment. Happy to wait for your valued input @smarterclayton :-) |
ping @smarterclayton |
![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288) | ||
|
||
The update is performed by **merging** the annotations in the response with | ||
existing annotations on the resource. If there are overlapping keys, the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry merging and choosing the value defined by policy in case of overlapping keys seems contradicting to me.
To clarify, if the replicaset already has the following annotation from developer:
"clusters":{
"gce-europe-west1": {
"weight": 2
},
"gce-europe-west2": {
"weight": 1
},
}
and OPA comes up with the following annotation:
"clusters":{
"gce-europe-west1": {
"weight": 1
},
"gce-europe-west2": {
"weight": 1
},
"gce-europe-west3": {
"weight": 1
},
}
based on "eu-jurisdiction-required": "true"
policy annotation then, the final annotation will be the following, right?
"clusters":{
"gce-europe-west1": {
"weight": 2
},
"gce-europe-west2": {
"weight": 1
},
}
- Developers annotation should not be changed if it adhers to the policy.
- It should really be an intersection of the 2 annotations. Developer should be allowed to restrict the clusters further, but should not be allowed to add a cluster that is restricted by admin policy.
- Since the example here uses replicaset annotation, it is important to note that developers "weights" and "rebalance" conditions will not be changed. That point is moot for cluster selector annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely understand the example above but this comment raises an interesting question!
There's a conflict if the developer and the policy define different values for the same annotation. In cases like this, resolution should happen in the policy itself because that is the only place where it is known (a) what the policy author intended and (b) what the developer desired. (a) is encoded in the policy and (b) is provided as input to the policy query. In this case, the admission controller doesn't really know.
In other words, it depends on the semantics of the policy. If the policy defines what is required, then the value produced by the policy must be used. On the other hand, if the policy defines what is permitted, then the intersection can be used. Either way, the only place this is well-known is in the policy itself.
The alternative would be to support different merge strategies in the admission controller and having config to pick the one that gets used. The downside to this approach would the lack of flexibility.
So, I think we should keep the proposal the same here. Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I agree that the policy engine should be the place to merge. It comes up with the final annotation value and no extra merging logic in admission controller.
I think we are on the same page wrt where the merging should happen. My comment above was not on that. It was on how the merging should happen. I wanted to point out the merge intricacies. I dont think the statement If there are overlapping keys, the value defined by policy is chosen
is correct as pointed out in my example above
|
||
- Serialization errors | ||
- Request timeouts or other network errors | ||
- Unexpected errors from the policy engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, Auth errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to include auth/n and auth/z errors.
|
||
In the event of request timeouts (or other network errors) or back-pressure | ||
hints from the policy engine, the admission controller should retry after | ||
applying a backoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also create an event so that developers know why their resources are not being scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to require events to provide good visibility to the developers.
"kind": "ReplicaSet", | ||
"metadata": { | ||
"annotations": { | ||
"eu-jurisdiction-required": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a prefix to indicate that this annotation is for the policy engine. Something like policy.federation.alpha.kubernetes.io/eu-jurisdiction-required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can update the example to namespace the annotation key.
|
||
The admission controller is enabled in the federation-apiserver by providing the | ||
`--admission-control` command line argument. E.g., | ||
`--admission-control=AlwaysAdmit,OPA`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be enabled by default in alpha? (related to the comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update based on decision from thread above.
admission controller), it must have access to the data representing the | ||
federated clusters. | ||
|
||
To provide OPA with the data representing federated clusters as well as other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is not clear to me. Does it only watch "cluster" resources (to see the region and pci compliance annotations on them) or other resources as well? Why does it need to watch other resources if it does that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Policy authors may want to make placement decisions based on resources other than "clusters". Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but it is not clear to me how we support that use case with this proposal. Can you give an example and how it will work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the proposal with details on how this'll work. Covered implementation details as well as design goals and future improvements.
OPA. The sidecar (“opa-kube-sync”) is responsible for replicating Kubernetes | ||
resources into OPA: | ||
|
||
![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cant the OPA directly setup a watch with federation-apiserver. This side car seems an unnecessary pass through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now OPA is not dependant on Kubernetes--this is nice in some ways (modularity) but cumbersome in others (e.g., it requires a sidecar or some other mechanism to integrate). In the future we could look at moving the integration into OPA proper. Either way, both OPA and opa-kube-sync are running the same pod so it should not affect the content of the proposal. If it's OK with you I will leave it as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. thanks for the explanation
To avoid introducing additional persistent state, we propose storing policies in | ||
ConfigMap resources in the host cluster. The ConfigMap can be mounted into the | ||
policy engine’s container. The policy engine will load the policies on startup. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mention that eventually we want to introduce a Policy API resource.
Maybe add a "Future Work" section at the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! I'm sorry I forgot about this. I'll update the Future Work section to cover this.
“opa”: { | ||
“baseURL”: “http://opa.federation.svc.cluster.local:8181/v1”, | ||
“annotationsPath”: “/data/io/k8s/federation/annotations”, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admission controller will also require auth credentials to be able to send requests to OPA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. OPA supports token-based authentication so this should be extended to include a token for the federation-apiserver to use.
``` | ||
|
||
- `baseURL` specifies the URL to reach the policy engine. | ||
- `annotationsPath` specifies the path of the document that defines desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is not clear at all. Can you give an example? Is this a list of all valid policy related annotations like "eu-jurisdiction-required", "pci-compliance", etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The baseURL and annotations path are concatenated to make the full URL used for the policy engine query.
So the actual HTTP request/response pair looks like...
Request:
POST <annotationsPath> HTTP/1.1
Content-Type: application/json
{"input": <replica-set-json-representation>}
Response:
HTTP/1.1 200 OK
Content-Type: application/json
{"result": {"replica-set-preferences": ..., ...}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok. Why keep them as 2 separate fields and not just a single concatenated field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
If the administrator has not defined a policy yet, the response from the policy | ||
engine indicates no annotations are defined. The admission controller does not | ||
treat this as an error, i.e., it admits the request. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add notes on how do we ensure that we will not break existing clusters that dont have a policy engine running?
Not everyone will use this new feature and hence most clusters will not have a policy engine running (specially while this feature is alpha).
Some options:
- This new admission controller will be off by default for alpha. When it goes beta/GA, admins can turn it off if they dont want this feature.
- Admission controller will be written in such a way that it queries the policy engine only if there is a policy defined. Admins will then have to ensure that policy engine is up before defining a policy.
I like the second option but that requires the admission controller to know how and where the policies are stored. I think we can go with the 1st option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the second option...
What do you think of changing the proposal as follows:
-
The policy engine would be deployed via the federation-apiserver instead of via the apiserver in the host cluster.
-
The admission controller running in the federation-apiserver can check if the policy engine has been deployed. EDIT: I'm not sure if admission controllers get access to the DB, please correct me if that's wrong, however, when I reviewed existing ones in the apiserver, I'm fairly certain some of them have access to resources outside of the immediate request.
-
If the policy engine HAS NOT been deployed, the admission controller admits immediately.
OR
- If the policy engine HAS been deployed, the admission controller contacts the policy engine normally and fails closed the way it's already been described.
Your first option also works though so if you think that's the best approach, let's go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding a bit more on step (2) from above.
The admission controller could use the client provided by the framework to check if a specific service has been created (the service namespace and name could come from configuration). If the service does not exist, the admission controller admits immediately. If the service does exist, the URL would be constructed based on the service spec. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another thought...
We have the admission controller enabled by default, however, the default behaviour is to fail-open. When handling a request, the admission controller would read an annotation on a resource representing the policy engine (e.g., the service or replicaset). The annotation would indicate whether the admission controller should fail-open or fail-closed.
This way, once the policy engine was deployed the annotation could be updated to fail-closed. This could be done manually by an admin or automatically by the policy engine itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like those ideas but I think the switch should happen when admin defines a policy not when they create the policy engine service.
Lets say I am an admin and I dont know about the internal implementation. I read about this new policy API resource and I create a new policy. I will expect that all new resources I create should instantly start adhering to these policies.
Admission controller should not allow bypassing a defined policy because the policy engine is not running.
The problem with the second approach that I suggested is that we know that we are going to change the way policies are stored (will be stored in configmap initially but will change that to a Policy API object), so we will need to update the admission controller as well then. But I think we can live with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say I am an admin and I dont know about the internal implementation. I read about this new policy API resource and I create a new policy. I will expect that all new resources I create should instantly start adhering to these policies. Admission controller should not allow bypassing a defined policy because the policy engine is not running.
This brings up a good point which is that I think any policy resource we add to the API will need to have some "status" recorded on it. This way, policy engines would essentially be controllers (in the Kubernetes sense) for these policy resources. This is important because one of the things we'll want is for the policy to be parsed, compiled, analyzed, etc. before the user is told that it's been accepted. Furthermore, the status could also be made to reflect whether the policy is actually loaded into the policy engine and being enforced. So, creation of these policy resources would probably asynchronous, meaning the client would have to check the status to find out whether it's being enforced.
Given the above, we could enable the admission controller by default and rely on the fail-closed annotation (mentioned in previous comment) as follows...
-
Initially, the apiserver starts and the admission controller will fail open.
-
If at some point, admin toggles the fail-closed annotation to true, then requests would start failing if either (a) policy engine was not deployed or (b) policies have not been loaded into the engine.
-
If the fail-closed annotation is still missing/false, and a policy resource is created, the policy engine would be notified (asynchronously) as usual. The policy engine would parse/compile/install the policy and then call back to the apiserver to automatically set the fail-closed annotation.
We can approximate this as close as possible today by just doing (3) when the policy engine starts and loads the policy via volume mount.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhiljindal I've posted replies to most/all of your recent comments. Thanks for all the input!
![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288) | ||
|
||
The update is performed by **merging** the annotations in the response with | ||
existing annotations on the resource. If there are overlapping keys, the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely understand the example above but this comment raises an interesting question!
There's a conflict if the developer and the policy define different values for the same annotation. In cases like this, resolution should happen in the policy itself because that is the only place where it is known (a) what the policy author intended and (b) what the developer desired. (a) is encoded in the policy and (b) is provided as input to the policy query. In this case, the admission controller doesn't really know.
In other words, it depends on the semantics of the policy. If the policy defines what is required, then the value produced by the policy must be used. On the other hand, if the policy defines what is permitted, then the intersection can be used. Either way, the only place this is well-known is in the policy itself.
The alternative would be to support different merge strategies in the admission controller and having config to pick the one that gets used. The downside to this approach would the lack of flexibility.
So, I think we should keep the proposal the same here. Does this make sense?
"kind": "ReplicaSet", | ||
"metadata": { | ||
"annotations": { | ||
"eu-jurisdiction-required": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can update the example to namespace the annotation key.
|
||
- Serialization errors | ||
- Request timeouts or other network errors | ||
- Unexpected errors from the policy engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to include auth/n and auth/z errors.
|
||
In the event of request timeouts (or other network errors) or back-pressure | ||
hints from the policy engine, the admission controller should retry after | ||
applying a backoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to require events to provide good visibility to the developers.
If the administrator has not defined a policy yet, the response from the policy | ||
engine indicates no annotations are defined. The admission controller does not | ||
treat this as an error, i.e., it admits the request. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the second option...
What do you think of changing the proposal as follows:
-
The policy engine would be deployed via the federation-apiserver instead of via the apiserver in the host cluster.
-
The admission controller running in the federation-apiserver can check if the policy engine has been deployed. EDIT: I'm not sure if admission controllers get access to the DB, please correct me if that's wrong, however, when I reviewed existing ones in the apiserver, I'm fairly certain some of them have access to resources outside of the immediate request.
-
If the policy engine HAS NOT been deployed, the admission controller admits immediately.
OR
- If the policy engine HAS been deployed, the admission controller contacts the policy engine normally and fails closed the way it's already been described.
Your first option also works though so if you think that's the best approach, let's go with that.
|
||
The admission controller is enabled in the federation-apiserver by providing the | ||
`--admission-control` command line argument. E.g., | ||
`--admission-control=AlwaysAdmit,OPA`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! EnforceSchedulingPolicy
is fine by me. What do you think about EnforceExternalPolicy
or EnforceExternalAnnotationPolicy
? The latter is a bit long IMO.
The notifications sent to the remediator by OPA specify the new value for | ||
annotations such as replica-set-preferences. | ||
|
||
When the remediator component (in the sidecar) receives the notification it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the container gets a kubeconfig mounted. I think this would be the same mechanism used by the federation-controller-manager?
admission controller), it must have access to the data representing the | ||
federated clusters. | ||
|
||
To provide OPA with the data representing federated clusters as well as other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Policy authors may want to make placement decisions based on resources other than "clusters". Does that make sense?
OPA. The sidecar (“opa-kube-sync”) is responsible for replicating Kubernetes | ||
resources into OPA: | ||
|
||
![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now OPA is not dependant on Kubernetes--this is nice in some ways (modularity) but cumbersome in others (e.g., it requires a sidecar or some other mechanism to integrate). In the future we could look at moving the integration into OPA proper. Either way, both OPA and opa-kube-sync are running the same pod so it should not affect the content of the proposal. If it's OK with you I will leave it as-is.
To avoid introducing additional persistent state, we propose storing policies in | ||
ConfigMap resources in the host cluster. The ConfigMap can be mounted into the | ||
policy engine’s container. The policy engine will load the policies on startup. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! I'm sorry I forgot about this. I'll update the Future Work section to cover this.
@nikhiljindal I've updated the proposal to address a few of your comments. Also, I had another thought regarding how to have the admission controller enabled by default without affecting existing clusters. Let know what you think. |
Thanks @tsandall I see 2 main open comments:
Looking forward to your replies |
@nikhiljindal I've pushed another commit addressing the merge question and added more thoughts on the deployment/fail-closed question. It's getting close! |
11178c5
to
fe1aad7
Compare
Summarizing the discussion with @tsandall
|
Please don't assume reservations of namespaces without broader coordination. We've discussed reserving |
Thanks for pointing that out @liggitt |
I'd suggest prefixing with federation as well, e.g. |
fe1aad7
to
1d6db23
Compare
1d6db23
to
7e5f275
Compare
/cc @nikhiljindal I've updated the proposal to reflect our discussion and the notes that you captured above. This includes the name the |
Thanks @tsandall LGTM. Will merge if no one else has any comments. |
…olicy Proposal: policy-based federated resource placement
…olicy Proposal: policy-based federated resource placement
* Adding self to member list * Add self to Developers, too. * Updated to be sorted alpha
This a proposal for enabling policy-based control over placement of federated resources (e.g., ReplicaSets).
ref: kubernetes/kubernetes#39982
/cc @quinton-hoole @nikhiljindal