Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for gating node creation on TPM-based remote attestation #23834

Closed
wants to merge 4 commits into from

Conversation

mjg59
Copy link

@mjg59 mjg59 commented Apr 4, 2016

This is a straw-man proposal for the moment - I'm interested in wider feedback on design decisions.

This PR adds an admission controller that flags nodes as unscheduleable and untrusted if they fail to meet an admin-defined TPM policy, making it possible to ensure that systems have booted in a trusted configuration before being used to run jobs. TPM measurements are provided by an external TPM daemon that is part of the go-tspi project and should be run on the workers. A config file is defined for the admission controller, which controls whether the system should permit previously unknown TPMs, whether the system should trigger re-attestation of node state on a regular basis and whether the policy should be sourced from the filesystem or from the k8s API.

The questions I'm looking to resolve are:

  1. Is this considered useful within Kubernetes?
  2. If so, should this be maintained as part of Kubernetes or as a primarily out-of-tree admission controller plugin?
  3. If out of tree, should in-tree API objects be defined or should it use third party resources?

This change is Reviewable

Matthew Garrett added 4 commits April 4, 2016 13:17
Add a flag to permit a node to be marked as untrusted, permitting later
diagnostics.
Pull in the dependencies required to communicate with remote TPMs
Add an admission control plugin for TPMs. This makes it possible to validate
that a worker has booted in a known good configuration before permitting it
to create or update a node.
@k8s-bot
Copy link

k8s-bot commented Apr 4, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

1 similar comment
@k8s-bot
Copy link

k8s-bot commented Apr 4, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@@ -1667,6 +1667,9 @@ type NodeSpec struct {

// Unschedulable controls node schedulability of new pods. By default node is schedulable.
Unschedulable bool `json:"unschedulable,omitempty"`

// Untrusted indicates that the node is not trusted. By default node is trusted.
Untrusted bool `json:"untrusted,omitempty"`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deliberately left out the corresponding autogenerated code updates for ease of review - I'll obviously add them if we want to merge this at some point.

@k8s-bot
Copy link

k8s-bot commented Apr 4, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/old-docs size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 4, 2016
@k8s-github-robot
Copy link

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

@k8s-github-robot k8s-github-robot added do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Apr 4, 2016
@bgrant0607 bgrant0607 assigned erictune and unassigned davidopp Apr 19, 2016
@bgrant0607
Copy link
Member

@bgrant0607
Copy link
Member

cc @mikedanese

@roberthbailey
Copy link
Contributor

FYI, I'll be OOO until next week so I won't be able to look at this until then.

@mikedanese
Copy link
Member

Why is the TPM not checked by the the CSR approval process described in the TLS Bootstrap proposal? It seems like the approval process could very easily challenge the TPM and approve the CSR only if the node passes the challenge. What are the pros of this approach?

@liggitt
Copy link
Member

liggitt commented Apr 20, 2016

The node TLS serving cert approval process is orthogonal to the node being schedulable.

This particular proposal puts a long running operation in the API path. It seems like it would be better modeled as only letting a "node verifier" process mark nodes schedulable, which is a particular example of the need for field-level authorization. If we had that, nodes could be allowed to register themselves but not mark themselves schedulable, and another process could be authorized to mark them schedulable, moving the TPM process out of the API server.

@mjg59
Copy link
Author

mjg59 commented Apr 20, 2016

@mikedanese Possession of the TLS certificates only indicates that the machine was trusted at the time the TLS certificates were issued - it says nothing about whether the machine is currently in a trustworthy state (eg, an attacker who has a means to reboot the system and is able to inject a different kernel into the deployment mechanism can create an untrusted system that may still have trusted TLS certificates on the filesystem)

@mjg59
Copy link
Author

mjg59 commented Apr 20, 2016

@liggitt My understanding of the current codebase is that there's no explicit sense in which a node decides to be schedulable - instead, nodes request that they not be schedulable. So what I imagine in your suggestion is:

  1. An admission controller that forcibly sets the unschedulable flag
  2. An independent process that sweeps unschedulable nodes, requests attestation and then clears the unschedulable flag

with appropriate authorization configuration such that only the independent process has permission to clear this flag? Right now there isn't an obviously good way for us to distinguish between nodes that are unschedulable because they haven't been validated yet and nodes that are unschedulable because they asked to be (eg, a master) so we'd need to make that possible.

I'm happy to rework this along those lines if there's broad consensus that keeping this out of process would be preferable.

@mikedanese
Copy link
Member

Possession of the TLS certificates only indicates that the machine was trusted at the time the TLS certificates were issued - it says nothing about whether the machine is currently in a trustworthy state (eg, an attacker who has a means to reboot the system and is able to inject a different kernel into the deployment mechanism can create an untrusted system that may still have trusted TLS certificates on the filesystem)

This is greatly mitigated by only issuing short lived certs.

The tpm is only checked on creation and update of the Node object. IIUC, I can compromise a node and the certs will continue to work as long the node object is unmodified. What coverage does this add?

@liggitt
Copy link
Member

liggitt commented Apr 20, 2016

The node's TLS serving certs don't really have anything to do with whether a pod can be scheduled to a node,since the node queries the API for its workloads.

@bgrant0607
Copy link
Member

Re. node schedulability, taints are a better answer than the current unschedulable field:
https://github.com/kubernetes/kubernetes/blob/master/docs/design/taint-toleration-dedicated.md
#17190, #24134

@bgrant0607
Copy link
Member

In fact, if taints were used, I don't think we'd need the API field.

@bgrant0607
Copy link
Member

Re: "3) If out of tree, should in-tree API objects be defined or should it use third party resources?"

If out of tree and using generic extension mechanisms (ThirdPartyResource, annotations, taints, etc.), it would be easier to prototype.

@mikedanese
Copy link
Member

The node's TLS serving certs don't really have anything to do with whether a pod can be scheduled to a node,since the node queries the API for its workloads.

Revoking authorization from a node is a great way to make it unscheduled :)

Ok ignoring the TLS stuff, I can compromise a node in a way that modifies the tpm hash of the machine and the machine will remain "trusted" (i.e. trusted bit set in nodespec) until the node object is touched. What operations in the system should require a strong guarantee (i.e. checked at the time of executing the operation) that the tpm hash is valid? Is it possible using the proposed scheme to make that guarantee? What operations in the system should require a less strong guarantee (i.e. checked sometime in the last minutes) that the tpm hash is valid?

@mjg59
Copy link
Author

mjg59 commented Apr 21, 2016

@mikedanese TPM state is revalidated periodically, and the node marked as untrusted if validation fails. This is independent of whether the node object has been updated (although in practice the node object is updated frequently due to heartbeats - we don't want to depend on that, though, since a compromised node could simply cease sending them)

@mjg59
Copy link
Author

mjg59 commented Apr 21, 2016

@bgrant0607 Taints look pretty much perfect for this, thanks! My current prototype is using annotations rather than extending the node type, and I think we'll continue using them for now in order to provide information about which TPM policy validated the node.

@k8s-github-robot
Copy link

@mjg59 PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 21, 2016
@mjg59
Copy link
Author

mjg59 commented Apr 28, 2016

Thought about this some more, and reworking it as an admission controller that defaults nodes to a tainted state along with an external agent that performs TPM attestation and clears the taint seems like a preferable approach. It also has the benefit of allowing other trust agents to be plugged in using the same infrastructure.

@k8s-github-robot k8s-github-robot assigned davidopp and unassigned erictune May 6, 2016
@davidopp
Copy link
Member

Taints PR is very close to being merged -- #24134

@k8s-bot
Copy link

k8s-bot commented Jun 14, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link

k8s-bot commented Jun 15, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@jessfraz
Copy link
Contributor

jessfraz commented Jul 7, 2016

the taints PR was merged 0:) I can review if you want to update

@erictune
Copy link
Member

erictune commented Jul 8, 2016

I don't understand the bigger picture for how this provides security.

If someone gets access to the node's system software and changes it, or otherwise boots a node with non-approved software, then possibly they can:

  • receive traffic which may contain PII via K8s services or ingressess
  • read/write sensitive data from volumes attached to the node
  • use the node's identity on the network to communicate with remote services which use source ip, etc, to identify or block requests.
  • use certs, passwords, or other keys already present on the node to interact with sensitive remote services
  • use Kubernetes certs on the node pull secrets from the apiserver, and then do the previous item.
  • talk to the docker hub and pull and run any images.

Probably I am missing something, but it seem like marking it as "unschedulable" doesn't prevent any of these things from happening.

@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

9 similar comments
@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 19, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 25, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link

k8s-bot commented Jul 31, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@bgrant0607
Copy link
Member

@apelisse The messages from k8s-bot asking whether the PR is reasonable to test don't notify anyone specifically. They should notify the assignee. Ref #869

@bgrant0607
Copy link
Member

ok to test

@k8s-bot
Copy link

k8s-bot commented Jul 31, 2016

GCE e2e build/test failed for commit 144b3c9.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@bgrant0607
Copy link
Member

@mjg59 I'm going to close this for now. Please reopen if/when you would like to continue with this, though I suggest starting with a proposal first.

@bgrant0607 bgrant0607 closed this Jul 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet