Add support for gating node creation on TPM-based remote attestation #23834

mjg59 · 2016-04-04T20:28:57Z

This is a straw-man proposal for the moment - I'm interested in wider feedback on design decisions.

This PR adds an admission controller that flags nodes as unscheduleable and untrusted if they fail to meet an admin-defined TPM policy, making it possible to ensure that systems have booted in a trusted configuration before being used to run jobs. TPM measurements are provided by an external TPM daemon that is part of the go-tspi project and should be run on the workers. A config file is defined for the admission controller, which controls whether the system should permit previously unknown TPMs, whether the system should trigger re-attestation of node state on a regular basis and whether the policy should be sourced from the filesystem or from the k8s API.

The questions I'm looking to resolve are:

Is this considered useful within Kubernetes?
If so, should this be maintained as part of Kubernetes or as a primarily out-of-tree admission controller plugin?
If out of tree, should in-tree API objects be defined or should it use third party resources?

This change is

Add a flag to permit a node to be marked as untrusted, permitting later diagnostics.

Pull in the dependencies required to communicate with remote TPMs

Add an admission control plugin for TPMs. This makes it possible to validate that a worker has booted in a known good configuration before permitting it to create or update a node.

k8s-bot · 2016-04-04T20:29:14Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-04-04T20:30:20Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

mjg59 · 2016-04-04T20:30:29Z

pkg/api/types.go

@@ -1667,6 +1667,9 @@ type NodeSpec struct {

 	// Unschedulable controls node schedulability of new pods. By default node is schedulable.
 	Unschedulable bool `json:"unschedulable,omitempty"`
+
+	// Untrusted indicates that the node is not trusted. By default node is trusted.
+	Untrusted bool `json:"untrusted,omitempty"`


I've deliberately left out the corresponding autogenerated code updates for ease of review - I'll obviously add them if we want to merge this at some point.

k8s-bot · 2016-04-04T20:30:47Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-github-robot · 2016-04-04T20:38:01Z

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

bgrant0607 · 2016-04-19T22:59:17Z

cc @derekwaynecarr @liggitt @roberthbailey

bgrant0607 · 2016-04-19T22:59:41Z

cc @mikedanese

roberthbailey · 2016-04-19T23:45:07Z

FYI, I'll be OOO until next week so I won't be able to look at this until then.

mikedanese · 2016-04-20T00:03:35Z

Why is the TPM not checked by the the CSR approval process described in the TLS Bootstrap proposal? It seems like the approval process could very easily challenge the TPM and approve the CSR only if the node passes the challenge. What are the pros of this approach?

liggitt · 2016-04-20T00:13:49Z

The node TLS serving cert approval process is orthogonal to the node being schedulable.

This particular proposal puts a long running operation in the API path. It seems like it would be better modeled as only letting a "node verifier" process mark nodes schedulable, which is a particular example of the need for field-level authorization. If we had that, nodes could be allowed to register themselves but not mark themselves schedulable, and another process could be authorized to mark them schedulable, moving the TPM process out of the API server.

mjg59 · 2016-04-20T20:40:48Z

@mikedanese Possession of the TLS certificates only indicates that the machine was trusted at the time the TLS certificates were issued - it says nothing about whether the machine is currently in a trustworthy state (eg, an attacker who has a means to reboot the system and is able to inject a different kernel into the deployment mechanism can create an untrusted system that may still have trusted TLS certificates on the filesystem)

mjg59 · 2016-04-20T20:46:06Z

@liggitt My understanding of the current codebase is that there's no explicit sense in which a node decides to be schedulable - instead, nodes request that they not be schedulable. So what I imagine in your suggestion is:

An admission controller that forcibly sets the unschedulable flag
An independent process that sweeps unschedulable nodes, requests attestation and then clears the unschedulable flag

with appropriate authorization configuration such that only the independent process has permission to clear this flag? Right now there isn't an obviously good way for us to distinguish between nodes that are unschedulable because they haven't been validated yet and nodes that are unschedulable because they asked to be (eg, a master) so we'd need to make that possible.

I'm happy to rework this along those lines if there's broad consensus that keeping this out of process would be preferable.

mikedanese · 2016-04-20T22:02:13Z

Possession of the TLS certificates only indicates that the machine was trusted at the time the TLS certificates were issued - it says nothing about whether the machine is currently in a trustworthy state (eg, an attacker who has a means to reboot the system and is able to inject a different kernel into the deployment mechanism can create an untrusted system that may still have trusted TLS certificates on the filesystem)

This is greatly mitigated by only issuing short lived certs.

The tpm is only checked on creation and update of the Node object. IIUC, I can compromise a node and the certs will continue to work as long the node object is unmodified. What coverage does this add?

liggitt · 2016-04-20T22:08:38Z

The node's TLS serving certs don't really have anything to do with whether a pod can be scheduled to a node,since the node queries the API for its workloads.

bgrant0607 · 2016-04-20T22:16:43Z

Re. node schedulability, taints are a better answer than the current unschedulable field:
https://github.com/kubernetes/kubernetes/blob/master/docs/design/taint-toleration-dedicated.md
#17190, #24134

bgrant0607 · 2016-04-20T22:18:33Z

In fact, if taints were used, I don't think we'd need the API field.

bgrant0607 · 2016-04-20T22:20:25Z

Re: "3) If out of tree, should in-tree API objects be defined or should it use third party resources?"

If out of tree and using generic extension mechanisms (ThirdPartyResource, annotations, taints, etc.), it would be easier to prototype.

mikedanese · 2016-04-20T22:30:14Z

The node's TLS serving certs don't really have anything to do with whether a pod can be scheduled to a node,since the node queries the API for its workloads.

Revoking authorization from a node is a great way to make it unscheduled :)

Ok ignoring the TLS stuff, I can compromise a node in a way that modifies the tpm hash of the machine and the machine will remain "trusted" (i.e. trusted bit set in nodespec) until the node object is touched. What operations in the system should require a strong guarantee (i.e. checked at the time of executing the operation) that the tpm hash is valid? Is it possible using the proposed scheme to make that guarantee? What operations in the system should require a less strong guarantee (i.e. checked sometime in the last minutes) that the tpm hash is valid?

mjg59 · 2016-04-21T00:20:00Z

@mikedanese TPM state is revalidated periodically, and the node marked as untrusted if validation fails. This is independent of whether the node object has been updated (although in practice the node object is updated frequently due to heartbeats - we don't want to depend on that, though, since a compromised node could simply cease sending them)

mjg59 · 2016-04-21T00:22:24Z

@bgrant0607 Taints look pretty much perfect for this, thanks! My current prototype is using annotations rather than extending the node type, and I think we'll continue using them for now in order to provide information about which TPM policy validated the node.

k8s-github-robot · 2016-04-21T01:32:35Z

@mjg59 PR needs rebase

mjg59 · 2016-04-28T19:54:13Z

Thought about this some more, and reworking it as an admission controller that defaults nodes to a tainted state along with an external agent that performs TPM attestation and clears the taint seems like a preferable approach. It also has the benefit of allowing other trust agents to be plugged in using the same infrastructure.

davidopp · 2016-05-15T23:22:05Z

Taints PR is very close to being merged -- #24134

k8s-bot · 2016-06-14T23:07:52Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-15T00:36:00Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-23T22:09:25Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

jessfraz · 2016-07-07T17:48:30Z

the taints PR was merged 0:) I can review if you want to update

erictune · 2016-07-08T20:00:58Z

I don't understand the bigger picture for how this provides security.

If someone gets access to the node's system software and changes it, or otherwise boots a node with non-approved software, then possibly they can:

receive traffic which may contain PII via K8s services or ingressess
read/write sensitive data from volumes attached to the node
use the node's identity on the network to communicate with remote services which use source ip, etc, to identify or block requests.
use certs, passwords, or other keys already present on the node to interact with sensitive remote services
use Kubernetes certs on the node pull secrets from the apiserver, and then do the previous item.
talk to the docker hub and pull and run any images.

Probably I am missing something, but it seem like marking it as "unschedulable" doesn't prevent any of these things from happening.

k8s-bot · 2016-07-12T16:19:51Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-12T16:21:07Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-12T16:25:52Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-12T16:31:06Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-12T16:35:11Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-12T16:36:33Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-12T16:40:02Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-19T19:57:42Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-25T23:55:39Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-07-31T15:36:27Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

bgrant0607 · 2016-07-31T16:02:10Z

@apelisse The messages from k8s-bot asking whether the PR is reasonable to test don't notify anyone specifically. They should notify the assignee. Ref #869

bgrant0607 · 2016-07-31T16:02:52Z

ok to test

k8s-bot · 2016-07-31T16:12:25Z

GCE e2e build/test failed for commit 144b3c9.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

bgrant0607 · 2016-07-31T17:50:27Z

@mjg59 I'm going to close this for now. Please reopen if/when you would like to continue with this, though I suggest starting with a proposal first.

Matthew Garrett added 4 commits April 4, 2016 13:17

Add untrusted flag to nodespec

789f187

Add a flag to permit a node to be marked as untrusted, permitting later diagnostics.

Add TPM dependencies

8a33bc3

Pull in the dependencies required to communicate with remote TPMs

Add TPM admission plugin and support code

81d39f2

Add an admission control plugin for TPMs. This makes it possible to validate that a worker has booted in a known good configuration before permitting it to create or update a node.

Add initial docs

144b3c9

googlebot added the cla: yes label Apr 4, 2016

mjg59 reviewed Apr 4, 2016
View reviewed changes

k8s-github-robot assigned davidopp Apr 4, 2016

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/old-docs size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 4, 2016

k8s-github-robot added do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Apr 4, 2016

bgrant0607 assigned erictune and unassigned davidopp Apr 19, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2016

k8s-github-robot assigned davidopp and unassigned erictune May 6, 2016

bgrant0607 closed this Jul 31, 2016

Add support for gating node creation on TPM-based remote attestation #23834

Add support for gating node creation on TPM-based remote attestation #23834

Conversation

mjg59 commented Apr 4, 2016 • edited by mikedanese

k8s-bot commented Apr 4, 2016

k8s-bot commented Apr 4, 2016

mjg59 Apr 4, 2016

Choose a reason for hiding this comment

k8s-bot commented Apr 4, 2016

k8s-github-robot commented Apr 4, 2016

bgrant0607 commented Apr 19, 2016

bgrant0607 commented Apr 19, 2016

roberthbailey commented Apr 19, 2016

mikedanese commented Apr 20, 2016

liggitt commented Apr 20, 2016 • edited

mjg59 commented Apr 20, 2016

mjg59 commented Apr 20, 2016

mikedanese commented Apr 20, 2016

liggitt commented Apr 20, 2016

bgrant0607 commented Apr 20, 2016

bgrant0607 commented Apr 20, 2016

bgrant0607 commented Apr 20, 2016

mikedanese commented Apr 20, 2016

mjg59 commented Apr 21, 2016

mjg59 commented Apr 21, 2016

k8s-github-robot commented Apr 21, 2016

mjg59 commented Apr 28, 2016

davidopp commented May 15, 2016

k8s-bot commented Jun 14, 2016

k8s-bot commented Jun 15, 2016

k8s-bot commented Jun 23, 2016

jessfraz commented Jul 7, 2016

erictune commented Jul 8, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 12, 2016

k8s-bot commented Jul 19, 2016

k8s-bot commented Jul 25, 2016

k8s-bot commented Jul 31, 2016

bgrant0607 commented Jul 31, 2016

bgrant0607 commented Jul 31, 2016

k8s-bot commented Jul 31, 2016

bgrant0607 commented Jul 31, 2016

mjg59 commented Apr 4, 2016 •

edited by mikedanese

liggitt commented Apr 20, 2016 •

edited