Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: implement an audit container identifier #90

Open
rgbriggs opened this issue Jun 1, 2018 · 24 comments
Open

RFE: implement an audit container identifier #90

rgbriggs opened this issue Jun 1, 2018 · 24 comments

Comments

@rgbriggs
Copy link
Member

rgbriggs commented Jun 1, 2018

Split this off from #32, leaving that issue for addressing namespace identifiers in audit records, should they be deemed necessary.

Implement an audit container identifier.

Add the ability to identify a task's assigned container using an audit container identifier. The registration process involves writing a u64 to file audit_containerid in the /proc filesystem under the PID of the target container task. This will result in a CONTAINER_ID record to log the event. Subsequent audit events that involve this task will have an auxiliary record CONTAINER to identify the container involved.

Depends: linux-audit/audit-userspace#51
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

History:

@rgbriggs
Copy link
Member Author

rgbriggs commented Jun 6, 2018

@rgbriggs
Copy link
Member Author

@nhorman
Copy link
Contributor

nhorman commented Dec 21, 2018

I have a question regarding the container id assignment mechanism. Curently the patch in the v4 posted series loosely defines a container by virtue of the fact that a nonce is assigned to it from user space. This is nice for a few reasons, primarily because it excuses the kernel from having to have any definition of what a container is (ostensibly we consider a container to be a unique collection of namespaces, and possibly cgroups, but the kernel wants no knowledge of that). It seems this implementation of container assignment suffers from a few shortcommings however:

  1. There appears to be no mechanism that prevents a container from modifying its own id (presuming CAP_SYS_ADMIN is not removed from its capability set, which I think doesn't occur for trusted containers)
  2. There appears to be no mechanism for preventing a container from changing namespaces once an id is set for it, meaning that the correlation between the id and whatever you want to define in userspace as being a 'container' is lost
  3. There is no current mechanism that prevents multiple unique 'containers' from sharing an id

All of these problems are of course fixable in the current implementation, but for the sake of argument, I'd like to propose an alternate solution that may (I've not 100% thought it out yet) reduce the complexity of the code, make the semantics of the control of the container id from userspace more clear, and solve the above problems

To start, I'd like to define (from a userspace perspective strictly) what we see as a container:

A container is defined by user space as a process that have a specific collection of (namespaces and cgroups) AND does NOT have the abilty to enter new namespaces and cgroups

I would propose that we implement the following in the kernel to enforce that policy:

  1. Create a new capability, CAP_AUDIT_SETNS, which gates the ability to successfully call the setns system call. This capability is inherited by its children. If CAP_AUDIT_SETNS is re-granted to a process, its contid (see 2, below), is reset to AUDIT_CID_UNSET).

  2. Create a new field in the audit struct per task, contid (which exists in this patchset already). This field is assigned a nonce (likely using the ida_simple_get api) if the CAP_AUDIT_SETNS capability is dropped, which would also generate an audit log message (simmilar to the one generated in audit_set_contid)

  3. If a process calls fork/clone with any of the CLONE_ flag set, the CAP_AUDIT_SETNS capability is restored to the child process, and its contid value is reset to AUDIT_CID_UNSET (allowing for a process in a container to start a new container, but not to change its own 'container nature')

I think this approach may have a few advantages:
a) It provides a established mechanism (the capabilities subsystem), to provide a gated point at which the above container definition becomes locked from the perspective of the initial process inside the container. While the parent may re-grant the privlidge, it creates an environment whereby dropping the capability both locks the nature of what the container is (a unique set of namespaces and cgroups), and provides a unique id to the container for audit purposes

b) It removes the need to manage container id's from userspace, reducing the risk for errors in assigning duplicate identifiers, and reduces code size by removing the need to create a new proc file and validate the information passed through it.

c) It reduces complexity in the kernel code. The kernel can guarantee unique identifiers using the gating CAP_AUDIT_SETNS flag as the moment to generate the id.

This also (may) smooth upstream acceptace, because from a kernel standpoint, all we're doing is adding a capability to an established interface, and generating a nonce for the purposes of audit on the dropping of that capability. From user space its nice because we can consider the dropping of the ability to enter a new namespace as the border between a process being outside of, and inside of, a container.

We may have already gone down the path of the existing implementation to consider a change of this magnitude, but I wanted to bring this up before we were locked into a design

@pcmoore
Copy link
Member

pcmoore commented Dec 22, 2018

There appears to be no mechanism that prevents a container from modifying its own id (presuming CAP_SYS_ADMIN is not removed from its capability set, which I think doesn't occur for trusted containers)

I'm replying strictly from memory here, so I might have some of the minor details wrong, but managing the audit container ID should require CAP_AUDIT_ADMIN. The idea being that only container orchestrator processes would be granted this capability, not the individual containers themselves.

It gets slightly more confusing if you want to allow nested container orchestrators, but you're already dealing with other problems if you are going this route.

There appears to be no mechanism for preventing a container from changing namespaces once an id is set for it, meaning that the correlation between the id and whatever you want to define in userspace as being a 'container' is lost.

Once again, managing an audit container ID is gated by CAP_AUDIT_ADMIN so it is unlikely to be an issue. It is also worth noting that once the process spawns any children, or additional threads, you can't change the audit container ID.

There is no current mechanism that prevents multiple unique 'containers' from sharing an id.

Not in the kernel, that is correct. Like many things, this is something that is left to the container orchestrator.

One of the design constraints, if not the most important design constraint, was to avoid defining "container" in the context of the kernel with the audit container ID work. We defer all logic for setting, and managing the audit container ID to the userspace container orchestrator. In this first round of patches the kernel's only role here is to report the audit container ID as part of the audit event stream, and ensure the audit container ID is inherited properly for newly created threads/processes.

Later the kernel will add some intelligence for routing audit records based on the audit container ID, and allow multiple audit daemons to capture specific audit event streams, but even that will be carefully done so as to not define "container" in the kernel.

@nhorman
Copy link
Contributor

nhorman commented Dec 24, 2018

I'm replying strictly from memory here, so I might have some of the minor details wrong, but managing the audit container ID should require CAP_AUDIT_ADMIN. The idea being that only container orchestrator processes would be granted this capability, not the individual containers themselves.
Right, I think its CAP_AUDIT_CONTROL, but no matter, and you are absolutely right, the container id is unwriteable by any process that doesn't have that capability. That said, I was less worried about a contained process changing its ID, and more worried about a contained process changing properties that an orchestrator might associate. i.e. a contained process may have its container id be fixed, but it could easily still call setns on it self, and enter the namespace of another process, breaking whatever association the orchestrator might have assumed would be established. My question was really an attempt to enforce that implied mapping by creating a capability control that could block entry to other processes namespaces in such a way that the orcestrator could preserve its notion of what a container was, without imbuing the kernel with any knowledge of containers (and hopefully at the same time, creating a trigger mechanism that the kernel audit code could use to auto-generate said ids). If you think thats too much information in the kernel, thats fine, but I wanted to ask the question.

It gets slightly more confusing if you want to allow nested container orchestrators, but you're already dealing with other problems if you are going this route.
Yeah, I'm not super worried about that (though bifurcating capability control between setns and unshare might be useful here, at least for the purposes of this conversation)

Once again, managing an audit container ID is gated by CAP_AUDIT_ADMIN so it is unlikely to be an issue. It is also worth noting that once the process spawns any children, or additional threads, you can't change the audit container ID.

Yes, but its not changing the audit container id that I'm asking about, its changing the namespaces that a given set of processes with an immutable container id that I'm concerned with. Maybe it doesn't matter for the purposes of audit, but it seems like it should be. As an example, Process A is spawned by an orchestator, and assigned net namespace 1, and container id 10. If Process A then forks a child with CLONE_NEWNET set, creating process B, and it (process A) then calls setns(, ), we have a situation in which process A has entered a new net namespace, but kept its old container id, all without the orchestrators knoweldge. My question is, does the orchestrator care? I'm assuming here that the entire purpose of creating a container id is to have a simple handle to refer to a unique set of namespaces and cgroups that the orchestrator can track, and allowing this change breaks the implied mapping of that handle to those namespaces. If it doesn't matter, then please let me know, and I can drop this entirely, it just seems like it should.

@rgbriggs
Copy link
Member Author

rgbriggs commented Dec 24, 2018 via email

@rgbriggs
Copy link
Member Author

rgbriggs commented Dec 24, 2018 via email

@nhorman
Copy link
Contributor

nhorman commented Dec 26, 2018

@rgbriggs @pcmoore Forgive me for consolidating your above responses, but the conversation it getting lengthy and I'm having trouble keeping up with all the comments. To abbreviate your thoughts on my proposed design changes @rgbriggs please feel free to reflect and comment on them as you see fit, but my goal with them was really twofold:

  1. To understand what the functional goal was in this patch set from a userspace semantics standpoint (i.e. what does a container id mean to an orchestrator)

  2. To suggest some improvements to the implementation of those semantics, should my assumptions about (1) be correct

If you think there are improvements to be made with my suggestions/thoughts, great. If no, thats also fine.

I think, based on what you have both said, this is my understanding of the user space semantics, as you see them:

a) A container id is a write once nonce, set by an orchestrator on an initial process in a container (for some arbitrary definition of the term container), and inherited by its children. Once set, it is immutable.

b) A container id is assigned to a process and its children, but has no fixed correlation to the same set of namespaces and cgroups. If an orchestrator wishes to make the set of processes with a given container id have a fixed set of namespaces and cgroups, it (the orchestrator) should drop the lead processes CAP_SYS_ADMIN capability prior to it forking any children

c) The uniqueness of a container is managed in userspace. It is the responsibility of an orchestrator to ensure that all containers in a system (for any definition of container is wishes to enforce) have a unique id, or that multiple containers sharing an id do so according to a sane policy.

Do you both agree with points (a),(b), and (c)? If not, please correct me. If you do agree, then the comments below become valid:

  1. Regarding point (a), it makes sense to me, more or less. My goal with my alternate proposal was to take the generation of a container id out of the hands of userspace so as to ease the mechanics of generating said nonce (doing it in the kernel allows for uniqueness very easily, but requires embedding policy in the kernel to trigger its generation based on a set of events, which is tantamount to the kernel enforcing what a container is). I would like to point out that what you are describing with this nonce also sounds very similar to a session id to me (i.e. a process that calls setsid() to start a new session could be considered a container in the same way that your container id would denote it, and potentially be usable without any kernel changes). Just some food for thought.

  2. Regarding point (b), I'm fine with that. Userspace can very easily drop CAP_SYS_ADMIN to prevent the unsharing of namespaces within a process tree. That said, while the fork system call gates the unsharing of namespaces with that capability, the unshare and setns system calls do not appear to, so there is I think some additional work required here to enforce this capability as it pertains to namespaces. As a philosophy however, I'm definately on board with the idea of using this capability to gate namespace creation and assignment.

  3. Regarding point (c), This actually worries me alot. While I understand the desire to manage container id assignement in user space, It relies on the assumption that there is a single orchestrator running in userspace at one time. Any single orchestrator is capable of ensuring each container receives a unique id, but the interface as designed makes no allowance for the parallel execution of two orchestrators. It would simple to obfuscate the audit logs by simply having two copies of openshift running. Any sufficiently privileged process can write the container id of any process, and duplicate an existing container id, leading that field in the audit log becoming useless or worse, intentionally misleading. I think some rework is called for there.

@rgbriggs
Copy link
Member Author

rgbriggs commented Dec 27, 2018 via email

@nhorman
Copy link
Contributor

nhorman commented Dec 27, 2018

@rgbriggs Hey, thanks for the response. Answers to your thoughts:

I will have to look at setsid more closely. I assumed you were talking about the audit sessionid which has been raised during the design proposals along with loginuid, but they aren't quite the same.

No, setsid() is the systemcall that assigns a unique session id to a process group leader in the namespace of the process. If called prior to entering any new namespaces, it is unique within the process namespace of the orchestrator, and as such, could be used as an audit container id that is guaranteed to be unique for the lifetime of the container. Using it might also be nice because it uses existing in frastructure to assign a unique id to a process group, which It seems, based on your prior answers, is more or less what you are considering a container. As before, just some food for thought

I believe they are all covered. clone(2) checks CAP_SYS_ADMIN if any CLONE_NEW* flags are present, setns(2) does in each of the ns->ops->install() calls, and unshare(2) checks in unshare_nsproxy_namespaces()

Yep, you're right, I hadn't dug deeply enough, apologies.

This would be an issue with parallel or nested orchestrators alike. Parallel orchestrators on one machine had not been considered. This was the reason for my preference of a serial contid generated in the kernel or pseudo-random UUID contid generated by the orchestrator that would be checked for uniqueness upon set. However, my understanding is that would prevent the orchestrator from injecting commands into a container it previously spawned. We had considered allowing an orchestrator to set the contid only of its own descendants.

I'm not sure I followthe reasoning above. A serial or random UUID I think works just as well as my setsid() suggestion above (arguably better), especially if it allows for a uniqueness guarantee. Is the concern that if the kernel generates a random id, that the orchestrator won't know what the id is, thereby preventing a mapping of the id in the audit log to the process set? If so, thats an easy fix, your write-only proc file can become a read only proc file that exports the random value to the orchestrator. Or is there something more going on here?

@rgbriggs
Copy link
Member Author

rgbriggs commented Dec 27, 2018 via email

@nhorman
Copy link
Contributor

nhorman commented Dec 27, 2018

I would strongly agree with that. Even if the kernel is not responsible for computation of a unique id, it should be able to validate the uniqueness of an id to ensure the integrity of the audit log in the presence of multiple orchestrators. And allowing that container id to be read back is essential in the event that an orchestrator restarts with containers outstanding, so that the process->container id map can be rebuilt

fcicq pushed a commit to fcicq/chromiumos-third_party-kernel that referenced this issue Jan 20, 2019
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
  type=CONTAINER_ID msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 old-contid=18446744073709551615 contid=123456 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes

The "op" field indicates an initial set.  The "pid" to "ses" fields are
the orchestrator while the "opid" field is the object's PID, the process
being "contained".  Old and new audit container identifier values are
given in the "contid" fields, while res indicates its success.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

See: linux-audit/audit-kernel#90
See: linux-audit/audit-userspace#51
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
(am from https://patchwork.kernel.org/patch/10551315/)

BUG=chromium:918980
TEST=Build, boot and GCP internal testing.

Changed the return value of the default audit_get_contid as the kuid_t
is a 32-bit value where the other version is a u64 failing compilation
on 32-bit kernels.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Change-Id: Iee61e96d015715f1dde24f92c230f14410cb5a79
Reviewed-on: https://chromium-review.googlesource.com/1379655
Reviewed-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-by: Robert Kolchmeyer <rkolchmeyer@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
@rgbriggs
Copy link
Member Author

rgbriggs commented Apr 9, 2019

@rgbriggs
Copy link
Member Author

rgbriggs commented Apr 9, 2019

@rgbriggs
Copy link
Member Author

Test case v1 PR: linux-audit/audit-testsuite#83

fengguang pushed a commit to 0day-ci/linux that referenced this issue Jan 3, 2020
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
  type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615

The "op" field indicates an initial set.  The "opid" field is the
object's PID, the process being "contained".  New and old audit
container identifier values are given in the "contid" fields.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Jan 3, 2020
Create a new audit record AUDIT_CONTAINER_ID to document the audit
container identifier of a process if it is present.

Called from audit_log_exit(), syscalls are covered.

A sample raw event:
type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
type=CWD msg=audit(1519924845.499:257): cwd="/root"
type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Jun 26, 2020
This defines the message number for the audit container identifier
registration record should the kernel headers not be up to date, gives
the record number a name for printing and allows the record to be
interpreted since it is in the 1000 range like AUDIT_LOGIN.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Jun 26, 2020
This defines the message number for the audit container identifier
information record should the kernel headers not be up to date and gives
the record number a name for printing.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Jun 26, 2020
Add support to be able to set a capability to allow a task to set the
audit container identifier of descendants.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Add the audit_get_capcontid() and audit_set_capcontid() calls analogous
to CAP_AUDIT_CONTROL for descendant user namespaces.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Jun 27, 2020
Add support to be able to set a capability to allow a task to set the
audit container identifier of descendants.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Add the audit_get_capcontid() and audit_set_capcontid() calls analogous
to CAP_AUDIT_CONTROL for descendant user namespaces.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Jun 27, 2020
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
  type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615

The "op" field indicates an initial set.  The "opid" field is the
object's PID, the process being "contained".  New and old audit
container identifier values are given in the "contid" fields.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

Store the audit container identifier in a refcounted kernel object that
is added to the master list of audit container identifiers.  This will
allow multiple container orchestrators/engines to work on the same
machine without danger of inadvertantly re-using an existing identifier.
It will also allow an orchestrator to inject a process into an existing
container by checking if the original container owner is the one
injecting the task.  A hash table list is used to optimize searches.

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Jun 27, 2020
Create a new audit record AUDIT_CONTAINER_ID to document the audit
container identifier of a process if it is present.

Called from audit_log_exit(), syscalls are covered.

Include target_cid references from ptrace and signal.

A sample raw event:
type=SYSCALL msg=audit(1519924845.499:257): arch=c000003e syscall=257 success=yes exit=3 a0=ffffff9c a1=56374e1cef30 a2=241 a3=1b6 items=2 ppid=606 pid=635 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=3 comm="bash" exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="tmpcontainerid"
type=CWD msg=audit(1519924845.499:257): cwd="/root"
type=PATH msg=audit(1519924845.499:257): item=0 name="/tmp/" inode=13863 dev=00:27 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0 nametype= PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PATH msg=audit(1519924845.499:257): item=1 name="/tmp/tmpcontainerid" inode=17729 dev=00:27 mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1519924845.499:257): proctitle=62617368002D6300736C65657020313B206563686F2074657374203E202F746D702F746D70636F6E7461696E65726964
type=CONTAINER_ID msg=audit(1519924845.499:257): contid=123458

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Nov 26, 2020
Add support to be able to set a capability to allow a task to set the
audit container identifier of descendants.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Add the audit_get_capcontid() and audit_set_capcontid() calls analogous
to CAP_AUDIT_CONTROL for descendant user namespaces.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Nov 26, 2020
This defines the message number for the audit container identifier
registration record should the kernel headers not be up to date, gives
the record number a name for printing and allows the record to be
interpreted since it is in the 1000 range like AUDIT_LOGIN.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Nov 26, 2020
This defines the message number for the audit container identifier
information record should the kernel headers not be up to date and gives
the record number a name for printing.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Nov 26, 2020
Add support to be able to set a capability to allow a task to set the
audit container identifier of descendants.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Add the audit_get_capcontid() and audit_set_capcontid() calls analogous
to CAP_AUDIT_CONTROL for descendant user namespaces.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Dec 18, 2020
This defines the message number for the audit container identifier
registration record should the kernel headers not be up to date, gives
the record number a name for printing and allows the record to be
interpreted since it is in the 1000 range like AUDIT_LOGIN.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Dec 18, 2020
This defines the message number for the audit container identifier
information record should the kernel headers not be up to date and gives
the record number a name for printing.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
rgbriggs added a commit to rgbriggs/audit-userspace that referenced this issue Dec 18, 2020
Add support to be able to set a capability to allow a task to set the
audit container identifier of descendants.

See: linux-audit#51
See: linux-audit/audit-kernel#90
See: linux-audit/audit-testsuite#64
See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Add the audit_get_capcontid() and audit_set_capcontid() calls analogous
to CAP_AUDIT_CONTROL for descendant user namespaces.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Dec 22, 2020
The audit-related parameters in struct task_struct should ideally be
collected together and accessed through a standard audit API and the audit
structures made opaque to other kernel subsystems.

Collect the existing loginuid, sessionid and audit_context together in a
new opaque struct audit_task_info called "audit" in struct task_struct.

Use kmem_cache to manage this pool of memory.
Un-inline audit_free() to be able to always recover that memory.

Please see the upstream github issues
linux-audit/audit-kernel#81
linux-audit/audit-kernel#90

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Dec 22, 2020
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
time->Thu Nov 26 10:24:46 2020
type=PROCTITLE msg=audit(1606404286.956:174546): proctitle=2F7573722F62696E2F7065726C002D7700636F6E7461696E657269642F74657374
type=SYSCALL msg=audit(1606404286.956:174546): arch=c000003e syscall=1 success=yes exit=19 a0=6 a1=557446a6a650 a2=13 a3=8 items=0 ppid=6827 pid=8724 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="perl" exe="/usr/bin/perl" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=CONTAINER_OP msg=audit(1606404286.956:174546): op=set opid=8771 contid=4112973747854606336 old-contid=-1

The "op" field indicates an initial set.  The "opid" field is the
object's PID, the process being "contained".  New and old audit
container identifier values are given in the "contid" fields.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

Store the audit container identifier in a refcounted kernel object that
is added to the master list of audit container identifiers.  This will
allow multiple container orchestrators/engines to work on the same
machine without danger of inadvertantly re-using an existing identifier.
It will also allow an orchestrator to inject a process into an existing
container by checking if the original container owner is the one
injecting the task.  A hash table list is used to optimize searches.

audit: log drop of contid on exit of last task

Since the life of each audit container indentifier is being tracked, we can
match the creation event with the destruction event.  Log the destruction of
the audit container identifier when the last process in that container exits.

Add support for reading the audit container identifier from the proc
filesystem.  This is a read from the proc entry of the form
/proc/PID/audit_containerid where PID is the process ID of the task
whose audit container identifier is sought.  The read expects up to a u64 value
(unset: 18446744073709551615).  This read requires CAP_AUDIT_CONTROL.

Add an entry to Documentation/ABI for /proc/$pid/audit_containerid.

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Dec 22, 2020
Create a new audit record AUDIT_CONTAINER_ID to document the audit
container identifier of a process if it is present.

Called from audit_log_exit(), syscalls are covered.

Include target_cid references from ptrace and signal.

A sample raw event:
time->Thu Nov 26 10:24:40 2020
type=PROCTITLE msg=audit(1606404280.226:174542): proctitle=2F7573722F62696E2F7065726C002D7700636F6E7461696E657269642F74657374
type=PATH msg=audit(1606404280.226:174542): item=1 name="/tmp/audit-testsuite-dir-8riQ/testsuite-1606404267-WNldVJCr" inode=428 dev=00:1f mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1606404280.226:174542): item=0 name="/tmp/audit-testsuite-dir-8riQ/" inode=427 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1606404280.226:174542): cwd="/root/rgb/git/audit-testsuite/tests"
type=SYSCALL msg=audit(1606404280.226:174542): arch=c000003e syscall=257 success=yes exit=6 a0=ffffff9c a1=557446bd5f10 a2=80241 a3=1b6 items=2 ppid=8724 pid=8758 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="perl" exe="/usr/bin/perl" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="testsuite-1606404267-WNldVJCr" record=1
type=CONTAINER_ID msg=audit(1606404280.226:174542): record=1 contid=527940429489930240

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
@rgbriggs
Copy link
Member Author

rgbriggs commented Jan 7, 2021

2020-12-21
post v10 kernel
https://www.redhat.com/archives/linux-audit/2020-December/msg00047.html
https://lkml.org/lkml/2020/12/21/338
post v10 user
https://www.redhat.com/archives/linux-audit/2020-December/msg00059.html
https://lkml.org/lkml/2020/12/21/361
This was quickly addressed by the upstream kernel audit maintainer that ACKs on the first patch were questionable, which I acknowledged as being out of date triggering another version.

@rgbriggs
Copy link
Member Author

fengguang pushed a commit to 0day-ci/linux that referenced this issue Jan 12, 2021
The audit-related parameters in struct task_struct should ideally be
collected together and accessed through a standard audit API and the audit
structures made opaque to other kernel subsystems.

Collect the existing loginuid, sessionid and audit_context together in a
new opaque struct audit_task_info called "audit" in struct task_struct.

Use kmem_cache to manage this pool of memory.
Un-inline audit_free() to be able to always recover that memory.

Please see the upstream github issues
linux-audit/audit-kernel#81
linux-audit/audit-kernel#90

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Jan 12, 2021
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce an event such as this with the new CONTAINER_OP record:
  time->Thu Nov 26 10:24:27 2020
  type=PROCTITLE msg=audit(1606404267.551:174524): proctitle=2F7573722F62696E2F7065726C002D7700636F6E7461696E657269642F74657374
  type=SYSCALL msg=audit(1606404267.551:174524): arch=c000003e syscall=1 success=yes exit=20 a0=6 a1=557446aa9180 a2=14 a3=100 items=0 ppid=6827 pid=8724 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="perl" exe="/usr/bin/perl" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=CONTAINER_OP msg=audit(1606404267.551:174524): op=set opid=8730 contid=4515122123205246976 old-contid=-1

The "op" field indicates an initial set.  The "opid" field is the
object's PID, the process being "contained".  New and old audit
container identifier values are given in the "contid" fields.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

Store the audit container identifier in a refcounted kernel object that
is added to the master list of audit container identifiers.  This will
allow multiple container orchestrators/engines to work on the same
machine without danger of inadvertantly re-using an existing identifier.
It will also allow an orchestrator to inject a process into an existing
container by checking if the original container owner is the one
injecting the task.  A hash table list is used to optimize searches.

Since the life of each audit container indentifier is being tracked, we match
the creation event with the destruction event.  Log the drop of the audit
container identifier when the last process in that container exits.

Add support for reading the audit container identifier from the proc
filesystem.  This is a read from the proc entry of the form
/proc/PID/audit_containerid where PID is the process ID of the task
whose audit container identifier is sought.  The read expects up to a u64 value
(unset: (u64)-1).  This read requires CAP_AUDIT_CONTROL.

Add an entry to Documentation/ABI for /proc/$pid/audit_containerid.

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
fengguang pushed a commit to 0day-ci/linux that referenced this issue Jan 12, 2021
Create a new audit record AUDIT_CONTAINER_ID to document the audit
container identifier of a process if it is present.

Called from audit_log_exit(), syscalls are covered.

Include target_cid references from ptrace and signal.

A sample raw event:
  time->Thu Nov 26 10:24:40 2020
  type=PROCTITLE msg=audit(1606404280.226:174542): proctitle=2F7573722F62696E2F7065726C002D7700636F6E7461696E657269642F74657374
  type=PATH msg=audit(1606404280.226:174542): item=1 name="/tmp/audit-testsuite-dir-8riQ/testsuite-1606404267-WNldVJCr" inode=428 dev=00:1f mode=0100644 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
  type=PATH msg=audit(1606404280.226:174542): item=0 name="/tmp/audit-testsuite-dir-8riQ/" inode=427 dev=00:1f mode=040700 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
  type=CWD msg=audit(1606404280.226:174542): cwd="/root/rgb/git/audit-testsuite/tests"
  type=SYSCALL msg=audit(1606404280.226:174542): arch=c000003e syscall=257 success=yes exit=6 a0=ffffff9c a1=557446bd5f10 a2=80241 a3=1b6 items=2 ppid=8724 pid=8758 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 ses=1 comm="perl" exe="/usr/bin/perl" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="testsuite-1606404267-WNldVJCr" record=1
  type=CONTAINER_ID msg=audit(1606404280.226:174542): record=1 contid=527940429489930240

Please see the github audit kernel issue for the main feature:
  linux-audit/audit-kernel#90
Please see the github audit userspace issue for supporting additions:
  linux-audit/audit-userspace#51
Please see the github audit testsuiite issue for the test case:
  linux-audit/audit-testsuite#64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
pcmoore pushed a commit that referenced this issue Jan 22, 2024
bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
It should be freed by kvfree not kfree.
Or umount will triger:

[  406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
[  406.830676 ] #PF: supervisor read access in kernel mode
[  406.831643 ] #PF: error_code(0x0000) - not-present page
[  406.832487 ] PGD 0 P4D 0
[  406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI
[  406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G           OE      6.7.0-rc7-custom+ #90
[  406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  406.835796 ] RIP: 0010:kfree+0x62/0x140
[  406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
[  406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
[  406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
[  406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
[  406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
[  406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
[  406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
[  406.840451 ] FS:  00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
[  406.840851 ] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
[  406.841464 ] Call Trace:
[  406.841583 ]  <TASK>
[  406.841682 ]  ? __die+0x1f/0x70
[  406.841828 ]  ? page_fault_oops+0x159/0x470
[  406.842014 ]  ? fixup_exception+0x22/0x310
[  406.842198 ]  ? exc_page_fault+0x1ed/0x200
[  406.842382 ]  ? asm_exc_page_fault+0x22/0x30
[  406.842574 ]  ? bch2_fs_release+0x54/0x280 [bcachefs]
[  406.842842 ]  ? kfree+0x62/0x140
[  406.842988 ]  ? kfree+0x104/0x140
[  406.843138 ]  bch2_fs_release+0x54/0x280 [bcachefs]
[  406.843390 ]  kobject_put+0xb7/0x170
[  406.843552 ]  deactivate_locked_super+0x2f/0xa0
[  406.843756 ]  cleanup_mnt+0xba/0x150
[  406.843917 ]  task_work_run+0x59/0xa0
[  406.844083 ]  exit_to_user_mode_prepare+0x197/0x1a0
[  406.844302 ]  syscall_exit_to_user_mode+0x16/0x40
[  406.844510 ]  do_syscall_64+0x4e/0xf0
[  406.844675 ]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  406.844907 ] RIP: 0033:0x7f0a2664e4fb

Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants