Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Restrict device access at upper node of container's cgroup #7531

Merged
merged 6 commits into from Nov 8, 2023

Conversation

justxuewei
Copy link
Member

The target is to guarantee that containers couldn't escape to access extra
devices, like vm rootfs, etc.

Motivation

Assume that there is a cgroup, such as /A/B. The B is container cgroup,
and the A is what we called pod cgroup. No matter what permissions are
set for the container (B), the A's permission is always a *:* rwm
with current implementation. It leads that containers could acquire
permission to access to other devices in VM that not belongs to themselves.

Details

In order to set devices cgroup properly, the order of setting cgroups is
that the pod cgroup comes first and the container cgroup comes after.

The Sandbox has a new field, devcg_info, to save cgroup states. To
avoid setting container cgroup too early, an initialization should be done
carefully. inited, one of the states, is a boolean to indicate if the pod
cgroup is initialized. If no, the pod cgroup should be created firstly, and
set default permissions. After that, the pause container cgroup is created
and inherits the permissions from the pod cgroup.

If whitelist mode which allows containers to access all devices in VM is
enabled, then device resources from OCI spec are ignored.

Limitation

This feature not supports systemd cgroup and cgroup v2, since:

Fixes: #7507

@katacontainersbot katacontainersbot added the size/huge Largest and most complex task (probably needs breaking into small pieces) label Aug 3, 2023
@katacontainersbot
Copy link
Contributor

Can one of the admins verify this patch?

@katacontainersbot katacontainersbot added size/tiny Smallest and simplest task and removed size/huge Largest and most complex task (probably needs breaking into small pieces) labels Aug 3, 2023
@justxuewei justxuewei force-pushed the device-cgroup branch 2 times, most recently from a1ca9e4 to 1c87c3a Compare August 7, 2023 02:52
@katacontainersbot katacontainersbot added size/huge Largest and most complex task (probably needs breaking into small pieces) and removed size/tiny Smallest and simplest task labels Aug 7, 2023
@justxuewei justxuewei force-pushed the device-cgroup branch 2 times, most recently from 7ea5eaf to 993a8f3 Compare August 7, 2023 09:14
@katacontainersbot katacontainersbot added size/tiny Smallest and simplest task and removed size/huge Largest and most complex task (probably needs breaking into small pieces) labels Aug 7, 2023
@katacontainersbot katacontainersbot added size/huge Largest and most complex task (probably needs breaking into small pieces) and removed size/tiny Smallest and simplest task labels Aug 7, 2023
@katacontainersbot katacontainersbot added size/tiny Smallest and simplest task and removed size/huge Largest and most complex task (probably needs breaking into small pieces) labels Aug 7, 2023
@katacontainersbot katacontainersbot added size/huge Largest and most complex task (probably needs breaking into small pieces) and removed size/tiny Smallest and simplest task labels Aug 8, 2023
@katacontainersbot katacontainersbot added size/tiny Smallest and simplest task and removed size/huge Largest and most complex task (probably needs breaking into small pieces) labels Aug 8, 2023
@justxuewei justxuewei force-pushed the device-cgroup branch 2 times, most recently from bc01dfa to ab57fe6 Compare August 8, 2023 08:33
@katacontainersbot katacontainersbot added size/huge Largest and most complex task (probably needs breaking into small pieces) and removed size/tiny Smallest and simplest task labels Aug 8, 2023
@justxuewei justxuewei force-pushed the device-cgroup branch 3 times, most recently from 263cd64 to 189ed95 Compare October 13, 2023 09:28
@justxuewei justxuewei force-pushed the device-cgroup branch 2 times, most recently from e136a6c to c2f6bdc Compare October 25, 2023 03:54
@liubin
Copy link
Member

liubin commented Oct 25, 2023

/test

@lifupan
Copy link
Member

lifupan commented Oct 27, 2023

/test jenkins


if let Some(devices_group_info) = devices_group_info.as_mut() {
// Cgroup path of parent of container
let pod_cpath = PathBuf::from(cpath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about container cgroup didn't have a parent path? such as "/container_cgroup"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Device cgroup won't work if cgroup path is "/" or "/{anything}".

/// Check if OCI spec contains a whiltlist rule.
///
/// # Returns
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the whitelist and blacklist? Maybe the white list is: deny all first and then allow the specified devices one by one; and the black list is on the opposite side: allow all first, and then deny the specified devices one by one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The previous whitelist is replaced with allowed_all. The blacklist is replaced with whitelist.

@justxuewei justxuewei force-pushed the device-cgroup branch 8 times, most recently from 9a96b76 to a21fad3 Compare November 7, 2023 08:36
The target is to guarantee that containers couldn't escape to access extra
devices, like vm rootfs, etc.

Assume that there is a cgroup, such as `/A/B`. The `B` is container cgroup,
and the `A` is what we called pod cgroup. No matter what permissions are
set for the container (`B`), the `A`'s permission is always `a *:* rwm`. It
leads that containers could acquire permission to access to other devices
in VM that not belongs to themselves.

In order to set devices cgroup properly, the order of setting cgroups is
that the pod cgroup comes first and the container cgroup comes after.

The `Sandbox` has a new field, `devcg_info`, to save cgroup states. To
avoid setting container cgroup too early, an initialization should be done
carefully. `inited`, one of the states, is a boolean to indicate if the pod
cgroup is initialized. If no, the pod cgroup should be created firstly, and
set default permissions. After that, the pause container cgroup is created
and inherits the permissions from the pod cgroup.

If whitelist mode which allows containers to access all devices in VM is
enabled,  then device resources from OCI spec are ignored.

This feature not supports systemd cgroup and cgroup v2, since:

- Systemd cgroup implemented on Agent hasn't supported devices subsystem so
  far, see: kata-containers#7506.
- Cgroup v2's device controller depends on eBPF programs, which is out of
  scope of cgroup.

Fixes: kata-containers#7507

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The runk is a standard OCI runtime that isnt' aware of concept of sandbox.
Therefore, the `devcg_info` argument of `LinuxContainer::new()` is
unneccessary to be provided.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes include:

- Change to debug logging level for resources after processed.
- Remove a todo for pod cgroup cleanup.
- Add an anyhow context to `get_paths_and_mounts()`.
- Remove code which denys access to VMROOTFS since it won't take effect. If
  blackmode is in use, the VMROOTFS will be denyed as default. Otherwise,
  device cgroups won't be updated in whitelist mode.
- Add a unit test for `default_allowed_devices()`.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
FSManager of systemd cgroup manager is responsible for setting up cgroup
path. The container launching will be failed if the FSManager is in
read-only mode.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
`TestDeviceCgroup` is added to cri-containerd's integration tests. The test
launches two containers. Each container has a block device. It checks the
validity of device cgroup.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Disable device cgroup restriction while pod cgroup is not available.
- Remove balcklist-related names and change whitelist-related names to
  allowed_all.

Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
@justxuewei
Copy link
Member Author

/test

1 similar comment
@justxuewei
Copy link
Member Author

/test

@lifupan lifupan merged commit 100a73d into kata-containers:main Nov 8, 2023
140 of 142 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/huge Largest and most complex task (probably needs breaking into small pieces)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: Restrict device access at upper node of container's cgroup
6 participants