Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: implement an audit container identifier #90

Open
rgbriggs opened this Issue Jun 1, 2018 · 16 comments

Comments

Projects
None yet
3 participants
@rgbriggs
Copy link
Contributor

rgbriggs commented Jun 1, 2018

Split this off from #32, leaving that issue for addressing namespace identifiers in audit records, should they be deemed necessary.

Implement an audit container identifier.

Add the ability to identify a task's assigned container using an audit container identifier. The registration process involves writing a u64 to file audit_containerid in the /proc filesystem under the PID of the target container task. This will result in a CONTAINER_ID record to log the event. Subsequent audit events that involve this task will have an auxiliary record CONTAINER to identify the container involved.

Depends: linux-audit/audit-userspace#51
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

History:

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Jun 6, 2018

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Jul 31, 2018

@nhorman

This comment has been minimized.

Copy link

nhorman commented Dec 21, 2018

I have a question regarding the container id assignment mechanism. Curently the patch in the v4 posted series loosely defines a container by virtue of the fact that a nonce is assigned to it from user space. This is nice for a few reasons, primarily because it excuses the kernel from having to have any definition of what a container is (ostensibly we consider a container to be a unique collection of namespaces, and possibly cgroups, but the kernel wants no knowledge of that). It seems this implementation of container assignment suffers from a few shortcommings however:

  1. There appears to be no mechanism that prevents a container from modifying its own id (presuming CAP_SYS_ADMIN is not removed from its capability set, which I think doesn't occur for trusted containers)
  2. There appears to be no mechanism for preventing a container from changing namespaces once an id is set for it, meaning that the correlation between the id and whatever you want to define in userspace as being a 'container' is lost
  3. There is no current mechanism that prevents multiple unique 'containers' from sharing an id

All of these problems are of course fixable in the current implementation, but for the sake of argument, I'd like to propose an alternate solution that may (I've not 100% thought it out yet) reduce the complexity of the code, make the semantics of the control of the container id from userspace more clear, and solve the above problems

To start, I'd like to define (from a userspace perspective strictly) what we see as a container:

A container is defined by user space as a process that have a specific collection of (namespaces and cgroups) AND does NOT have the abilty to enter new namespaces and cgroups

I would propose that we implement the following in the kernel to enforce that policy:

  1. Create a new capability, CAP_AUDIT_SETNS, which gates the ability to successfully call the setns system call. This capability is inherited by its children. If CAP_AUDIT_SETNS is re-granted to a process, its contid (see 2, below), is reset to AUDIT_CID_UNSET).

  2. Create a new field in the audit struct per task, contid (which exists in this patchset already). This field is assigned a nonce (likely using the ida_simple_get api) if the CAP_AUDIT_SETNS capability is dropped, which would also generate an audit log message (simmilar to the one generated in audit_set_contid)

  3. If a process calls fork/clone with any of the CLONE_ flag set, the CAP_AUDIT_SETNS capability is restored to the child process, and its contid value is reset to AUDIT_CID_UNSET (allowing for a process in a container to start a new container, but not to change its own 'container nature')

I think this approach may have a few advantages:
a) It provides a established mechanism (the capabilities subsystem), to provide a gated point at which the above container definition becomes locked from the perspective of the initial process inside the container. While the parent may re-grant the privlidge, it creates an environment whereby dropping the capability both locks the nature of what the container is (a unique set of namespaces and cgroups), and provides a unique id to the container for audit purposes

b) It removes the need to manage container id's from userspace, reducing the risk for errors in assigning duplicate identifiers, and reduces code size by removing the need to create a new proc file and validate the information passed through it.

c) It reduces complexity in the kernel code. The kernel can guarantee unique identifiers using the gating CAP_AUDIT_SETNS flag as the moment to generate the id.

This also (may) smooth upstream acceptace, because from a kernel standpoint, all we're doing is adding a capability to an established interface, and generating a nonce for the purposes of audit on the dropping of that capability. From user space its nice because we can consider the dropping of the ability to enter a new namespace as the border between a process being outside of, and inside of, a container.

We may have already gone down the path of the existing implementation to consider a change of this magnitude, but I wanted to bring this up before we were locked into a design

@pcmoore

This comment has been minimized.

Copy link
Member

pcmoore commented Dec 22, 2018

There appears to be no mechanism that prevents a container from modifying its own id (presuming CAP_SYS_ADMIN is not removed from its capability set, which I think doesn't occur for trusted containers)

I'm replying strictly from memory here, so I might have some of the minor details wrong, but managing the audit container ID should require CAP_AUDIT_ADMIN. The idea being that only container orchestrator processes would be granted this capability, not the individual containers themselves.

It gets slightly more confusing if you want to allow nested container orchestrators, but you're already dealing with other problems if you are going this route.

There appears to be no mechanism for preventing a container from changing namespaces once an id is set for it, meaning that the correlation between the id and whatever you want to define in userspace as being a 'container' is lost.

Once again, managing an audit container ID is gated by CAP_AUDIT_ADMIN so it is unlikely to be an issue. It is also worth noting that once the process spawns any children, or additional threads, you can't change the audit container ID.

There is no current mechanism that prevents multiple unique 'containers' from sharing an id.

Not in the kernel, that is correct. Like many things, this is something that is left to the container orchestrator.

One of the design constraints, if not the most important design constraint, was to avoid defining "container" in the context of the kernel with the audit container ID work. We defer all logic for setting, and managing the audit container ID to the userspace container orchestrator. In this first round of patches the kernel's only role here is to report the audit container ID as part of the audit event stream, and ensure the audit container ID is inherited properly for newly created threads/processes.

Later the kernel will add some intelligence for routing audit records based on the audit container ID, and allow multiple audit daemons to capture specific audit event streams, but even that will be carefully done so as to not define "container" in the kernel.

@nhorman

This comment has been minimized.

Copy link

nhorman commented Dec 24, 2018

I'm replying strictly from memory here, so I might have some of the minor details wrong, but managing the audit container ID should require CAP_AUDIT_ADMIN. The idea being that only container orchestrator processes would be granted this capability, not the individual containers themselves.
Right, I think its CAP_AUDIT_CONTROL, but no matter, and you are absolutely right, the container id is unwriteable by any process that doesn't have that capability. That said, I was less worried about a contained process changing its ID, and more worried about a contained process changing properties that an orchestrator might associate. i.e. a contained process may have its container id be fixed, but it could easily still call setns on it self, and enter the namespace of another process, breaking whatever association the orchestrator might have assumed would be established. My question was really an attempt to enforce that implied mapping by creating a capability control that could block entry to other processes namespaces in such a way that the orcestrator could preserve its notion of what a container was, without imbuing the kernel with any knowledge of containers (and hopefully at the same time, creating a trigger mechanism that the kernel audit code could use to auto-generate said ids). If you think thats too much information in the kernel, thats fine, but I wanted to ask the question.

It gets slightly more confusing if you want to allow nested container orchestrators, but you're already dealing with other problems if you are going this route.
Yeah, I'm not super worried about that (though bifurcating capability control between setns and unshare might be useful here, at least for the purposes of this conversation)

Once again, managing an audit container ID is gated by CAP_AUDIT_ADMIN so it is unlikely to be an issue. It is also worth noting that once the process spawns any children, or additional threads, you can't change the audit container ID.

Yes, but its not changing the audit container id that I'm asking about, its changing the namespaces that a given set of processes with an immutable container id that I'm concerned with. Maybe it doesn't matter for the purposes of audit, but it seems like it should be. As an example, Process A is spawned by an orchestator, and assigned net namespace 1, and container id 10. If Process A then forks a child with CLONE_NEWNET set, creating process B, and it (process A) then calls setns(, ), we have a situation in which process A has entered a new net namespace, but kept its old container id, all without the orchestrators knoweldge. My question is, does the orchestrator care? I'm assuming here that the entire purpose of creating a container id is to have a simple handle to refer to a unique set of namespaces and cgroups that the orchestrator can track, and allowing this change breaks the implied mapping of that handle to those namespaces. If it doesn't matter, then please let me know, and I can drop this entirely, it just seems like it should.

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Dec 24, 2018

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Dec 24, 2018

@nhorman

This comment has been minimized.

Copy link

nhorman commented Dec 26, 2018

@rgbriggs @pcmoore Forgive me for consolidating your above responses, but the conversation it getting lengthy and I'm having trouble keeping up with all the comments. To abbreviate your thoughts on my proposed design changes @rgbriggs please feel free to reflect and comment on them as you see fit, but my goal with them was really twofold:

  1. To understand what the functional goal was in this patch set from a userspace semantics standpoint (i.e. what does a container id mean to an orchestrator)

  2. To suggest some improvements to the implementation of those semantics, should my assumptions about (1) be correct

If you think there are improvements to be made with my suggestions/thoughts, great. If no, thats also fine.

I think, based on what you have both said, this is my understanding of the user space semantics, as you see them:

a) A container id is a write once nonce, set by an orchestrator on an initial process in a container (for some arbitrary definition of the term container), and inherited by its children. Once set, it is immutable.

b) A container id is assigned to a process and its children, but has no fixed correlation to the same set of namespaces and cgroups. If an orchestrator wishes to make the set of processes with a given container id have a fixed set of namespaces and cgroups, it (the orchestrator) should drop the lead processes CAP_SYS_ADMIN capability prior to it forking any children

c) The uniqueness of a container is managed in userspace. It is the responsibility of an orchestrator to ensure that all containers in a system (for any definition of container is wishes to enforce) have a unique id, or that multiple containers sharing an id do so according to a sane policy.

Do you both agree with points (a),(b), and (c)? If not, please correct me. If you do agree, then the comments below become valid:

  1. Regarding point (a), it makes sense to me, more or less. My goal with my alternate proposal was to take the generation of a container id out of the hands of userspace so as to ease the mechanics of generating said nonce (doing it in the kernel allows for uniqueness very easily, but requires embedding policy in the kernel to trigger its generation based on a set of events, which is tantamount to the kernel enforcing what a container is). I would like to point out that what you are describing with this nonce also sounds very similar to a session id to me (i.e. a process that calls setsid() to start a new session could be considered a container in the same way that your container id would denote it, and potentially be usable without any kernel changes). Just some food for thought.

  2. Regarding point (b), I'm fine with that. Userspace can very easily drop CAP_SYS_ADMIN to prevent the unsharing of namespaces within a process tree. That said, while the fork system call gates the unsharing of namespaces with that capability, the unshare and setns system calls do not appear to, so there is I think some additional work required here to enforce this capability as it pertains to namespaces. As a philosophy however, I'm definately on board with the idea of using this capability to gate namespace creation and assignment.

  3. Regarding point (c), This actually worries me alot. While I understand the desire to manage container id assignement in user space, It relies on the assumption that there is a single orchestrator running in userspace at one time. Any single orchestrator is capable of ensuring each container receives a unique id, but the interface as designed makes no allowance for the parallel execution of two orchestrators. It would simple to obfuscate the audit logs by simply having two copies of openshift running. Any sufficiently privileged process can write the container id of any process, and duplicate an existing container id, leading that field in the audit log becoming useless or worse, intentionally misleading. I think some rework is called for there.

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Dec 27, 2018

@nhorman

This comment has been minimized.

Copy link

nhorman commented Dec 27, 2018

@rgbriggs Hey, thanks for the response. Answers to your thoughts:

I will have to look at setsid more closely. I assumed you were talking about the audit sessionid which has been raised during the design proposals along with loginuid, but they aren't quite the same.

No, setsid() is the systemcall that assigns a unique session id to a process group leader in the namespace of the process. If called prior to entering any new namespaces, it is unique within the process namespace of the orchestrator, and as such, could be used as an audit container id that is guaranteed to be unique for the lifetime of the container. Using it might also be nice because it uses existing in frastructure to assign a unique id to a process group, which It seems, based on your prior answers, is more or less what you are considering a container. As before, just some food for thought

I believe they are all covered. clone(2) checks CAP_SYS_ADMIN if any CLONE_NEW* flags are present, setns(2) does in each of the ns->ops->install() calls, and unshare(2) checks in unshare_nsproxy_namespaces()

Yep, you're right, I hadn't dug deeply enough, apologies.

This would be an issue with parallel or nested orchestrators alike. Parallel orchestrators on one machine had not been considered. This was the reason for my preference of a serial contid generated in the kernel or pseudo-random UUID contid generated by the orchestrator that would be checked for uniqueness upon set. However, my understanding is that would prevent the orchestrator from injecting commands into a container it previously spawned. We had considered allowing an orchestrator to set the contid only of its own descendants.

I'm not sure I followthe reasoning above. A serial or random UUID I think works just as well as my setsid() suggestion above (arguably better), especially if it allows for a uniqueness guarantee. Is the concern that if the kernel generates a random id, that the orchestrator won't know what the id is, thereby preventing a mapping of the id in the audit log to the process set? If so, thats an easy fix, your write-only proc file can become a read only proc file that exports the random value to the orchestrator. Or is there something more going on here?

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Dec 27, 2018

@nhorman

This comment has been minimized.

Copy link

nhorman commented Dec 27, 2018

I would strongly agree with that. Even if the kernel is not responsible for computation of a unique id, it should be able to validate the uniqueness of an id to ensure the integrity of the audit log in the presence of multiple orchestrators. And allowing that container id to be read back is essential in the event that an orchestrator restarts with containers outstanding, so that the process->container id map can be rebuilt

fcicq pushed a commit to fcicq/chromiumos-third_party-kernel that referenced this issue Jan 20, 2019

BACKPORT: FROMLIST: audit: add container id
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
  type=CONTAINER_ID msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 old-contid=18446744073709551615 contid=123456 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes

The "op" field indicates an initial set.  The "pid" to "ses" fields are
the orchestrator while the "opid" field is the object's PID, the process
being "contained".  Old and new audit container identifier values are
given in the "contid" fields, while res indicates its success.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

See: linux-audit/audit-kernel#90
See: linux-audit/audit-userspace#51
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
(am from https://patchwork.kernel.org/patch/10551315/)

BUG=chromium:918980
TEST=Build, boot and GCP internal testing.

Changed the return value of the default audit_get_contid as the kuid_t
is a 32-bit value where the other version is a u64 failing compilation
on 32-bit kernels.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Change-Id: Iee61e96d015715f1dde24f92c230f14410cb5a79
Reviewed-on: https://chromium-review.googlesource.com/1379655
Reviewed-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-by: Robert Kolchmeyer <rkolchmeyer@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
@rgbriggs

This comment has been minimized.

@rgbriggs

This comment has been minimized.

@rgbriggs

This comment has been minimized.

Copy link
Contributor Author

rgbriggs commented Apr 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.