Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: policy-based federated resource placement #292

Merged
merged 1 commit into from
Apr 10, 2017

Conversation

tsandall
Copy link
Contributor

This a proposal for enabling policy-based control over placement of federated resources (e.g., ReplicaSets).

ref: kubernetes/kubernetes#39982

/cc @quinton-hoole @nikhiljindal

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 25, 2017

When admitting requests, the admission controller executes an HTTP API call
against OPA. The API call passes the JSON representation of the resource in the
message body.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than a new admission plugin and a synchronous webhook call at creation time, why not persist the resource without any placement annotations, and let an external controller observe/update the resource with the proper placement annotations?

until those placement annotations are present, no spreading would be done

that would be more in line with the initializer proposal (just implemented weakly using annotations)

cc @smarterclayton

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Yes, I think that the initializer proposal might be a better solution, once it is finalized and implemented. But until then, I don't think that we have a good way to prevent the scheduler from doing placement/spreading until the annotations have been applied by the policy agent. So I'd suggest that we keep the implementation as a admission plugin, and add it to the bucket of admission plugins that need to be ported to the initializer pattern. Make sense, or am I talking garbage? I will confess that I have not yet waded through the full initializer proposal megathread.

@ghost ghost self-assigned this Jan 25, 2017
@ghost
Copy link

ghost commented Jan 25, 2017

@liggitt We need a way to apply the annotations before the controller sees the new resource, otherwise it will spread it, by default.

@nikhiljindal
Copy link
Contributor

Thanks for sending the proposal @tsandall
cc @kubernetes/sig-federation-misc

(or the relevant state of the world) changes.

In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes
in the same cluster as the Federated Control Plane:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Federation Control Plane


Currently, the placement decision can be controlled for Federated ReplicaSets
using the `federation.kubernetes.io/replica-set-preferences` annotation. In the
future, the [Cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterSelector as proposed in that issue can not replace replica set preferences.
Cluster selector is for filtering the clusters where a resource is created.
replica set preferences allows weights, which helps bursts modes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this paragraph to clarify. The intent is to communicate that in the future, placement of other resources may be controlled with policy by defining cluster selector values. Does this make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated mention of Cluster Selector annotation to clarify.


## Design

The proposed design uses of the [Open Policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uses the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

In the proposed design, the policy engine (OPA) is deployed on top of Kubernetes
in the same cluster as the Federation Control Plane:

![Architecture](https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU/pub?w=807&h=407)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following that link downloaded the drawing for me.
https://docs.google.com/drawings/d/1kL6cgyZyJ4eYNsqvic8r0kqPJxP9LzWVOykkXnTKafU worked fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent was for the image to show inline when viewing the rendered version. I'll see if that other link works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs to be the .../pub?... link in order for it show up inline. I will leave it as-is unless you object.

```json
{
"result": {
"federation.kubernetes.io/replica-set-preferences": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work for replicasets.
We should use the cluster selector annotation suggested in #344 which will work for all resources.

Copy link
Contributor Author

@tsandall tsandall Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only one example where the policy defined a value for the replica-set-preferences annotation. The policy could similarly define a value for the cluster selection annotation and then the result would contain that (or both--it's up to the policy author).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this section with a comment to explain that other annotations can be returned by the policy engine.

@nikhiljindal nikhiljindal self-assigned this Feb 8, 2017
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to push the "submit review" button last week.

Serialization errors Request timeouts or other network errors Unexpected errors
from the policy engine

If the administrator has not defined a policy yet, the response from the policy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a production deployment this aspect should probably be configurable (whether to admit or deny in the case of failure). And one option should be to deny for now, and retry until policy can be applied, before admitting.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, I misunderstood your statements above. I now see that you fail closed, not open, as I misunderstood. But I think the comment about retries still applies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will update to include a point about the types of errors that we will retry on. I think that i) serialization errors and ii) unexpected internal errors returned by the policy engine SHOULD NOT be tried. Network errors like timeouts SHOULD be retried.

}
```

The configuration also contains a list of usernames that identify senders whose
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how this works. Perhaps elaborate a little further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an instance of premature optimization. The admission controller queries the policy engine when resources are created OR updated. As a result, when the remediator performs a PATCH on a resource's annotations, the admission controller will perform a redundant query against the policy engine. This was intended to avoid that. I'm going to remove this for now as we can start simple and easily add this back if needed.


When policy changes or the environment in which resources are deployed changes
(e.g. a cluster’s PCI compliance rating gets up/down-graded), resources might
need to be moved for them to obey the placement policy. Sometimes operators may
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'operators' is a term that has come to mean something different in the context of kubernetes.
https://coreos.com/blog/introducing-operators.html
Perhaps use 'administrators' instead here, for that reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

To automatically reschedule resources onto desired clusters, we introduce the
sidecar (**opa-kube-sync**) to receive notifications from OPA that indicate
changes in placement that would bring resources back into policy compliance. The
sidecar reacts to notifications from OPA by updating the resources in the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be explicit here about how these changes in annotations result in actual movement of resources. I think (but you should confirm) that the scheduler in the federated replicaset controller actually performs the move, via it's reconciliation process.

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md

Note also that there is the potential for confusion (by readers of your doc) between this form of resource movement, and the Rescheduling Algorithm described in the above doc, so it might be worth adding a brief clarification in that regard here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will add more detail to this section to explain the mechanics.

Note also that there is the potential for confusion (by readers of your doc) between this form of resource movement, and the Rescheduling Algorithm described in the above doc, so it might be worth adding a brief clarification in that regard here.

I'll update the section to clarify that the policy engine only sends the remediator the desired value for the annotations and that the actual rebalancing is still handled by the algorithm in the controller. Thanks!


## Open Questions

1. Instead of storing the policies in ConfigMap resources in the host cluster,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's perfectly reasonable to store the configmaps in the same cluster that the policy agent is running in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove this point.

controller is enabled in the federation-apiserver, it will deny requests if
it cannot communicate with the policy engine (because it “fails closed” by
design).
1. If policies are stored in a ConfigMap resource then the policy engine must be
Copy link

@ghost ghost Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigMaps are API resources, so surely these concerns don't apply? Or am I misunderstanding. I understand that ConfigMaps can be consumed as environment variables or attached volumes, but they can surely be consumed as API Resources instead (via a get+watch), and hence changes to the policy could be detected via the watch triggering, and not require an agent restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see your point. Initially, we can simply support ConfigMap attached as volumes. Eventually, we can support a mode where ConfigMaps are consumed as API resources (via get+watch). Either way, the policies are stored the same way. OK, I like it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

injection via volumemount/envvar is preferable, for minimal coupling to the kubernetes API. volumemounted configmaps will update with changes to configmap resources

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Thanks, that's very handy. Is the idea that clients need to poll the volumemounted configmap? Of does a watch-based approach (e.g. inotify ) work better? I've just not had a chance whether it works yet.


When admitting requests, the admission controller executes an HTTP API call
against OPA. The API call passes the JSON representation of the resource in the
message body.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Yes, I think that the initializer proposal might be a better solution, once it is finalized and implemented. But until then, I don't think that we have a good way to prevent the scheduler from doing placement/spreading until the annotations have been applied by the policy agent. So I'd suggest that we keep the implementation as a admission plugin, and add it to the bucket of admission plugins that need to be ported to the initializer pattern. Make sense, or am I talking garbage? I will confess that I have not yet waded through the full initializer proposal megathread.

@tsandall
Copy link
Contributor Author

tsandall commented Feb 9, 2017

I've updated the proposal in response to comments from @nikhiljindal and @quinton-hoole. Let me know if anything else ought to be improved. Thanks!

@ghost
Copy link

ghost commented Feb 9, 2017

Thanks @tsandall .

@liggitt Are you OK with my response above regarding initializers vs admission controllers? I think that's all that we still need to agree on before merging this proposal.

@smarterclayton
Copy link
Contributor

I haven't had a chance to look at this yet re: admission but we've got project wide icebergs around admission and initializaers and external calls - I don't want you guys to hit them. I promise to review in next day or so

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting response from @liggitt

@ghost
Copy link

ghost commented Feb 10, 2017

Sorry, fat fingered the above comment. Happy to wait for your valued input @smarterclayton :-)

@ghost
Copy link

ghost commented Feb 21, 2017

ping @smarterclayton

![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288)

The update is performed by **merging** the annotations in the response with
existing annotations on the resource. If there are overlapping keys, the value
Copy link
Contributor

@nikhiljindal nikhiljindal Feb 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry merging and choosing the value defined by policy in case of overlapping keys seems contradicting to me.
To clarify, if the replicaset already has the following annotation from developer:

"clusters":{
        "gce-europe-west1": {
          "weight": 2
        },
        "gce-europe-west2": {
          "weight": 1
        },
}

and OPA comes up with the following annotation:

"clusters":{
        "gce-europe-west1": {
          "weight": 1
        },
      "gce-europe-west2": {
          "weight": 1
        },
      "gce-europe-west3": {
          "weight": 1
        },
}

based on "eu-jurisdiction-required": "true" policy annotation then, the final annotation will be the following, right?

"clusters":{
        "gce-europe-west1": {
          "weight": 2
        },
       "gce-europe-west2": {
          "weight": 1
        },
}
  1. Developers annotation should not be changed if it adhers to the policy.
  2. It should really be an intersection of the 2 annotations. Developer should be allowed to restrict the clusters further, but should not be allowed to add a cluster that is restricted by admin policy.
  3. Since the example here uses replicaset annotation, it is important to note that developers "weights" and "rebalance" conditions will not be changed. That point is moot for cluster selector annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand the example above but this comment raises an interesting question!

There's a conflict if the developer and the policy define different values for the same annotation. In cases like this, resolution should happen in the policy itself because that is the only place where it is known (a) what the policy author intended and (b) what the developer desired. (a) is encoded in the policy and (b) is provided as input to the policy query. In this case, the admission controller doesn't really know.

In other words, it depends on the semantics of the policy. If the policy defines what is required, then the value produced by the policy must be used. On the other hand, if the policy defines what is permitted, then the intersection can be used. Either way, the only place this is well-known is in the policy itself.

The alternative would be to support different merge strategies in the admission controller and having config to pick the one that gets used. The downside to this approach would the lack of flexibility.

So, I think we should keep the proposal the same here. Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree that the policy engine should be the place to merge. It comes up with the final annotation value and no extra merging logic in admission controller.

I think we are on the same page wrt where the merging should happen. My comment above was not on that. It was on how the merging should happen. I wanted to point out the merge intricacies. I dont think the statement If there are overlapping keys, the value defined by policy is chosen is correct as pointed out in my example above


- Serialization errors
- Request timeouts or other network errors
- Unexpected errors from the policy engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Auth errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to include auth/n and auth/z errors.


In the event of request timeouts (or other network errors) or back-pressure
hints from the policy engine, the admission controller should retry after
applying a backoff.
Copy link
Contributor

@nikhiljindal nikhiljindal Feb 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also create an event so that developers know why their resources are not being scheduled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to require events to provide good visibility to the developers.

"kind": "ReplicaSet",
"metadata": {
"annotations": {
"eu-jurisdiction-required": "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a prefix to indicate that this annotation is for the policy engine. Something like policy.federation.alpha.kubernetes.io/eu-jurisdiction-required

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can update the example to namespace the annotation key.


The admission controller is enabled in the federation-apiserver by providing the
`--admission-control` command line argument. E.g.,
`--admission-control=AlwaysAdmit,OPA`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be enabled by default in alpha? (related to the comment above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update based on decision from thread above.

admission controller), it must have access to the data representing the
federated clusters.

To provide OPA with the data representing federated clusters as well as other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is not clear to me. Does it only watch "cluster" resources (to see the region and pci compliance annotations on them) or other resources as well? Why does it need to watch other resources if it does that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy authors may want to make placement decisions based on resources other than "clusters". Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it is not clear to me how we support that use case with this proposal. Can you give an example and how it will work?

Copy link
Contributor Author

@tsandall tsandall Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the proposal with details on how this'll work. Covered implementation details as well as design goals and future improvements.

OPA. The sidecar (“opa-kube-sync”) is responsible for replicating Kubernetes
resources into OPA:

![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant the OPA directly setup a watch with federation-apiserver. This side car seems an unnecessary pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now OPA is not dependant on Kubernetes--this is nice in some ways (modularity) but cumbersome in others (e.g., it requires a sidecar or some other mechanism to integrate). In the future we could look at moving the integration into OPA proper. Either way, both OPA and opa-kube-sync are running the same pod so it should not affect the content of the proposal. If it's OK with you I will leave it as-is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. thanks for the explanation

To avoid introducing additional persistent state, we propose storing policies in
ConfigMap resources in the host cluster. The ConfigMap can be mounted into the
policy engine’s container. The policy engine will load the policies on startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention that eventually we want to introduce a Policy API resource.
Maybe add a "Future Work" section at the bottom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I'm sorry I forgot about this. I'll update the Future Work section to cover this.

“opa”: {
“baseURL”: “http://opa.federation.svc.cluster.local:8181/v1”,
“annotationsPath”: “/data/io/k8s/federation/annotations”,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admission controller will also require auth credentials to be able to send requests to OPA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. OPA supports token-based authentication so this should be extended to include a token for the federation-apiserver to use.

```

- `baseURL` specifies the URL to reach the policy engine.
- `annotationsPath` specifies the path of the document that defines desired
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is not clear at all. Can you give an example? Is this a list of all valid policy related annotations like "eu-jurisdiction-required", "pci-compliance", etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baseURL and annotations path are concatenated to make the full URL used for the policy engine query.

So the actual HTTP request/response pair looks like...

Request:

POST <annotationsPath> HTTP/1.1
Content-Type: application/json

{"input": <replica-set-json-representation>}

Response:

HTTP/1.1 200 OK
Content-Type: application/json

{"result": {"replica-set-preferences": ..., ...}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok. Why keep them as 2 separate fields and not just a single concatenated field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

If the administrator has not defined a policy yet, the response from the policy
engine indicates no annotations are defined. The admission controller does not
treat this as an error, i.e., it admits the request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add notes on how do we ensure that we will not break existing clusters that dont have a policy engine running?
Not everyone will use this new feature and hence most clusters will not have a policy engine running (specially while this feature is alpha).

Some options:

  • This new admission controller will be off by default for alpha. When it goes beta/GA, admins can turn it off if they dont want this feature.
  • Admission controller will be written in such a way that it queries the policy engine only if there is a policy defined. Admins will then have to ensure that policy engine is up before defining a policy.

I like the second option but that requires the admission controller to know how and where the policies are stored. I think we can go with the 1st option.

Copy link
Contributor Author

@tsandall tsandall Feb 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the second option...

What do you think of changing the proposal as follows:

  1. The policy engine would be deployed via the federation-apiserver instead of via the apiserver in the host cluster.

  2. The admission controller running in the federation-apiserver can check if the policy engine has been deployed. EDIT: I'm not sure if admission controllers get access to the DB, please correct me if that's wrong, however, when I reviewed existing ones in the apiserver, I'm fairly certain some of them have access to resources outside of the immediate request.

  3. If the policy engine HAS NOT been deployed, the admission controller admits immediately.

OR

  1. If the policy engine HAS been deployed, the admission controller contacts the policy engine normally and fails closed the way it's already been described.

Your first option also works though so if you think that's the best approach, let's go with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding a bit more on step (2) from above.

The admission controller could use the client provided by the framework to check if a specific service has been created (the service namespace and name could come from configuration). If the service does not exist, the admission controller admits immediately. If the service does exist, the URL would be constructed based on the service spec. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another thought...

We have the admission controller enabled by default, however, the default behaviour is to fail-open. When handling a request, the admission controller would read an annotation on a resource representing the policy engine (e.g., the service or replicaset). The annotation would indicate whether the admission controller should fail-open or fail-closed.

This way, once the policy engine was deployed the annotation could be updated to fail-closed. This could be done manually by an admin or automatically by the policy engine itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like those ideas but I think the switch should happen when admin defines a policy not when they create the policy engine service.

Lets say I am an admin and I dont know about the internal implementation. I read about this new policy API resource and I create a new policy. I will expect that all new resources I create should instantly start adhering to these policies.
Admission controller should not allow bypassing a defined policy because the policy engine is not running.

The problem with the second approach that I suggested is that we know that we are going to change the way policies are stored (will be stored in configmap initially but will change that to a Policy API object), so we will need to update the admission controller as well then. But I think we can live with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say I am an admin and I dont know about the internal implementation. I read about this new policy API resource and I create a new policy. I will expect that all new resources I create should instantly start adhering to these policies. Admission controller should not allow bypassing a defined policy because the policy engine is not running.

This brings up a good point which is that I think any policy resource we add to the API will need to have some "status" recorded on it. This way, policy engines would essentially be controllers (in the Kubernetes sense) for these policy resources. This is important because one of the things we'll want is for the policy to be parsed, compiled, analyzed, etc. before the user is told that it's been accepted. Furthermore, the status could also be made to reflect whether the policy is actually loaded into the policy engine and being enforced. So, creation of these policy resources would probably asynchronous, meaning the client would have to check the status to find out whether it's being enforced.

Given the above, we could enable the admission controller by default and rely on the fail-closed annotation (mentioned in previous comment) as follows...

  1. Initially, the apiserver starts and the admission controller will fail open.

  2. If at some point, admin toggles the fail-closed annotation to true, then requests would start failing if either (a) policy engine was not deployed or (b) policies have not been loaded into the engine.

  3. If the fail-closed annotation is still missing/false, and a policy resource is created, the policy engine would be notified (asynchronously) as usual. The policy engine would parse/compile/install the policy and then call back to the apiserver to automatically set the fail-closed annotation.

We can approximate this as close as possible today by just doing (3) when the policy engine starts and loads the policy via volume mount.

What do you think?

Copy link
Contributor Author

@tsandall tsandall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhiljindal I've posted replies to most/all of your recent comments. Thanks for all the input!

![InitialPlacement](https://docs.google.com/drawings/d/1c9PBDwjJmdv_qVvPq0sQ8RVeZad91vAN1XT6K9Gz9k8/pub?w=812&h=288)

The update is performed by **merging** the annotations in the response with
existing annotations on the resource. If there are overlapping keys, the value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand the example above but this comment raises an interesting question!

There's a conflict if the developer and the policy define different values for the same annotation. In cases like this, resolution should happen in the policy itself because that is the only place where it is known (a) what the policy author intended and (b) what the developer desired. (a) is encoded in the policy and (b) is provided as input to the policy query. In this case, the admission controller doesn't really know.

In other words, it depends on the semantics of the policy. If the policy defines what is required, then the value produced by the policy must be used. On the other hand, if the policy defines what is permitted, then the intersection can be used. Either way, the only place this is well-known is in the policy itself.

The alternative would be to support different merge strategies in the admission controller and having config to pick the one that gets used. The downside to this approach would the lack of flexibility.

So, I think we should keep the proposal the same here. Does this make sense?

"kind": "ReplicaSet",
"metadata": {
"annotations": {
"eu-jurisdiction-required": "true",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can update the example to namespace the annotation key.


- Serialization errors
- Request timeouts or other network errors
- Unexpected errors from the policy engine
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to include auth/n and auth/z errors.


In the event of request timeouts (or other network errors) or back-pressure
hints from the policy engine, the admission controller should retry after
applying a backoff.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to require events to provide good visibility to the developers.

If the administrator has not defined a policy yet, the response from the policy
engine indicates no annotations are defined. The admission controller does not
treat this as an error, i.e., it admits the request.

Copy link
Contributor Author

@tsandall tsandall Feb 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the second option...

What do you think of changing the proposal as follows:

  1. The policy engine would be deployed via the federation-apiserver instead of via the apiserver in the host cluster.

  2. The admission controller running in the federation-apiserver can check if the policy engine has been deployed. EDIT: I'm not sure if admission controllers get access to the DB, please correct me if that's wrong, however, when I reviewed existing ones in the apiserver, I'm fairly certain some of them have access to resources outside of the immediate request.

  3. If the policy engine HAS NOT been deployed, the admission controller admits immediately.

OR

  1. If the policy engine HAS been deployed, the admission controller contacts the policy engine normally and fails closed the way it's already been described.

Your first option also works though so if you think that's the best approach, let's go with that.


The admission controller is enabled in the federation-apiserver by providing the
`--admission-control` command line argument. E.g.,
`--admission-control=AlwaysAdmit,OPA`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! EnforceSchedulingPolicy is fine by me. What do you think about EnforceExternalPolicy or EnforceExternalAnnotationPolicy? The latter is a bit long IMO.

The notifications sent to the remediator by OPA specify the new value for
annotations such as replica-set-preferences.

When the remediator component (in the sidecar) receives the notification it
Copy link
Contributor Author

@tsandall tsandall Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the container gets a kubeconfig mounted. I think this would be the same mechanism used by the federation-controller-manager?

admission controller), it must have access to the data representing the
federated clusters.

To provide OPA with the data representing federated clusters as well as other
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy authors may want to make placement decisions based on resources other than "clusters". Does that make sense?

OPA. The sidecar (“opa-kube-sync”) is responsible for replicating Kubernetes
resources into OPA:

![Replication](https://docs.google.com/drawings/d/1XjdgszYMDHD3hP_2ynEh_R51p7gZRoa1DBTi4yq1rc0/pub?w=812&h=288)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now OPA is not dependant on Kubernetes--this is nice in some ways (modularity) but cumbersome in others (e.g., it requires a sidecar or some other mechanism to integrate). In the future we could look at moving the integration into OPA proper. Either way, both OPA and opa-kube-sync are running the same pod so it should not affect the content of the proposal. If it's OK with you I will leave it as-is.

To avoid introducing additional persistent state, we propose storing policies in
ConfigMap resources in the host cluster. The ConfigMap can be mounted into the
policy engine’s container. The policy engine will load the policies on startup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! I'm sorry I forgot about this. I'll update the Future Work section to cover this.

@ghost ghost assigned smarterclayton Mar 1, 2017
@tsandall
Copy link
Contributor Author

tsandall commented Mar 1, 2017

@nikhiljindal I've updated the proposal to address a few of your comments. Also, I had another thought regarding how to have the admission controller enabled by default without affecting existing clusters. Let know what you think.

@nikhiljindal
Copy link
Contributor

nikhiljindal commented Mar 8, 2017

Thanks @tsandall I see 2 main open comments:

Looking forward to your replies

@tsandall
Copy link
Contributor Author

tsandall commented Mar 9, 2017

@nikhiljindal I've pushed another commit addressing the merge question and added more thoughts on the deployment/fail-closed question. It's getting close!

@nikhiljindal
Copy link
Contributor

Summarizing the discussion with @tsandall

  • Admission controller will be enabled by default. It will act as a simple pass through when there are no policies defined (so wont break existing clusters without policy engine running). Once a policy is created, admission controller will return error if it cant reach the policy engine.
  • Policy engine will return an error if admin and developer policies are in conflict. This error will be returned to the user.
  • In alpha implementation, policies will be stored as configmaps in a fixed namespace (scheduling-policy) so that the admission controller can fetch them without needing a config. We will move to an API resource in future, when we wont need the namespace.
  • policy configmaps will be created in federation control plane and will use Federation: cluster selector based placement proposal #344 to not propagate them to underlying clusters.

@liggitt
Copy link
Member

liggitt commented Mar 13, 2017

In alpha implementation, policies will be stored as configmaps in a fixed namespace (scheduling-policy)

Please don't assume reservations of namespaces without broader coordination. We've discussed reserving kube-... namespaces for system use.

@nikhiljindal
Copy link
Contributor

nikhiljindal commented Mar 13, 2017

Thanks for pointing that out @liggitt
Using kube-scheduling-policy sg.
Also note that this is only for the alpha implementation. Eventually, we want to have an API resource for storing the policies when we will not need a reserved namespace.

@liggitt
Copy link
Member

liggitt commented Mar 13, 2017

I'd suggest prefixing with federation as well, e.g. kube-federation-scheduling-policy... it's a mouthful, but makes it clear what component is responsible for the namespace

@tsandall
Copy link
Contributor Author

/cc @nikhiljindal I've updated the proposal to reflect our discussion and the notes that you captured above. This includes the name the kube-federation-scheduling-policy namespace name.

@nikhiljindal
Copy link
Contributor

Thanks @tsandall LGTM.

Will merge if no one else has any comments.

@nikhiljindal nikhiljindal merged commit 943e9b7 into kubernetes:master Apr 10, 2017
ruebenramirez pushed a commit to ruebenramirez/community that referenced this pull request Apr 22, 2017
…olicy

Proposal: policy-based federated resource placement
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
…olicy

Proposal: policy-based federated resource placement
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
* Adding self to member list

* Add self to Developers, too.

* Updated to be sorted alpha
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants