-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for gating node creation on TPM-based remote attestation #23834
Conversation
Add a flag to permit a node to be marked as untrusted, permitting later diagnostics.
Pull in the dependencies required to communicate with remote TPMs
Add an admission control plugin for TPMs. This makes it possible to validate that a worker has booted in a known good configuration before permitting it to create or update a node.
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
1 similar comment
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
@@ -1667,6 +1667,9 @@ type NodeSpec struct { | |||
|
|||
// Unschedulable controls node schedulability of new pods. By default node is schedulable. | |||
Unschedulable bool `json:"unschedulable,omitempty"` | |||
|
|||
// Untrusted indicates that the node is not trusted. By default node is trusted. | |||
Untrusted bool `json:"untrusted,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deliberately left out the corresponding autogenerated code updates for ease of review - I'll obviously add them if we want to merge this at some point.
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Adding label:do-not-merge because PR changes docs prohibited to auto merge |
cc @mikedanese |
FYI, I'll be OOO until next week so I won't be able to look at this until then. |
Why is the TPM not checked by the the CSR approval process described in the TLS Bootstrap proposal? It seems like the approval process could very easily challenge the TPM and approve the CSR only if the node passes the challenge. What are the pros of this approach? |
The node TLS serving cert approval process is orthogonal to the node being schedulable. This particular proposal puts a long running operation in the API path. It seems like it would be better modeled as only letting a "node verifier" process mark nodes schedulable, which is a particular example of the need for field-level authorization. If we had that, nodes could be allowed to register themselves but not mark themselves schedulable, and another process could be authorized to mark them schedulable, moving the TPM process out of the API server. |
@mikedanese Possession of the TLS certificates only indicates that the machine was trusted at the time the TLS certificates were issued - it says nothing about whether the machine is currently in a trustworthy state (eg, an attacker who has a means to reboot the system and is able to inject a different kernel into the deployment mechanism can create an untrusted system that may still have trusted TLS certificates on the filesystem) |
@liggitt My understanding of the current codebase is that there's no explicit sense in which a node decides to be schedulable - instead, nodes request that they not be schedulable. So what I imagine in your suggestion is:
with appropriate authorization configuration such that only the independent process has permission to clear this flag? Right now there isn't an obviously good way for us to distinguish between nodes that are unschedulable because they haven't been validated yet and nodes that are unschedulable because they asked to be (eg, a master) so we'd need to make that possible. I'm happy to rework this along those lines if there's broad consensus that keeping this out of process would be preferable. |
This is greatly mitigated by only issuing short lived certs. The tpm is only checked on creation and update of the Node object. IIUC, I can compromise a node and the certs will continue to work as long the node object is unmodified. What coverage does this add? |
The node's TLS serving certs don't really have anything to do with whether a pod can be scheduled to a node,since the node queries the API for its workloads. |
Re. node schedulability, taints are a better answer than the current |
In fact, if taints were used, I don't think we'd need the API field. |
Re: "3) If out of tree, should in-tree API objects be defined or should it use third party resources?" If out of tree and using generic extension mechanisms (ThirdPartyResource, annotations, taints, etc.), it would be easier to prototype. |
Revoking authorization from a node is a great way to make it unscheduled :) Ok ignoring the TLS stuff, I can compromise a node in a way that modifies the tpm hash of the machine and the machine will remain "trusted" (i.e. trusted bit set in nodespec) until the node object is touched. What operations in the system should require a strong guarantee (i.e. checked at the time of executing the operation) that the tpm hash is valid? Is it possible using the proposed scheme to make that guarantee? What operations in the system should require a less strong guarantee (i.e. checked sometime in the last minutes) that the tpm hash is valid? |
@mikedanese TPM state is revalidated periodically, and the node marked as untrusted if validation fails. This is independent of whether the node object has been updated (although in practice the node object is updated frequently due to heartbeats - we don't want to depend on that, though, since a compromised node could simply cease sending them) |
@bgrant0607 Taints look pretty much perfect for this, thanks! My current prototype is using annotations rather than extending the node type, and I think we'll continue using them for now in order to provide information about which TPM policy validated the node. |
@mjg59 PR needs rebase |
Thought about this some more, and reworking it as an admission controller that defaults nodes to a tainted state along with an external agent that performs TPM attestation and clears the taint seems like a preferable approach. It also has the benefit of allowing other trust agents to be plugged in using the same infrastructure. |
Taints PR is very close to being merged -- #24134 |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
2 similar comments
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
the taints PR was merged 0:) I can review if you want to update |
I don't understand the bigger picture for how this provides security. If someone gets access to the node's system software and changes it, or otherwise boots a node with non-approved software, then possibly they can:
Probably I am missing something, but it seem like marking it as "unschedulable" doesn't prevent any of these things from happening. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
9 similar comments
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test". This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry. Otherwise, if this message is too spammy, please complain to ixdy. |
ok to test |
GCE e2e build/test failed for commit 144b3c9. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
@mjg59 I'm going to close this for now. Please reopen if/when you would like to continue with this, though I suggest starting with a proposal first. |
This is a straw-man proposal for the moment - I'm interested in wider feedback on design decisions.
This PR adds an admission controller that flags nodes as unscheduleable and untrusted if they fail to meet an admin-defined TPM policy, making it possible to ensure that systems have booted in a trusted configuration before being used to run jobs. TPM measurements are provided by an external TPM daemon that is part of the go-tspi project and should be run on the workers. A config file is defined for the admission controller, which controls whether the system should permit previously unknown TPMs, whether the system should trigger re-attestation of node state on a regular basis and whether the policy should be sourced from the filesystem or from the k8s API.
The questions I'm looking to resolve are:
This change is