Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tap device on virt handler #3290

Merged

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Apr 13, 2020

What this PR does / why we need it:
Currently, libvirt is responsible for creating & configuring the tap devices that will be used by the kubevirt VMs. As such, virt-launcher pods require the NET_ADMIN capability.

This PR moves the tap device creation / configuration stage to the virt-handler pod, which will create / configure the tap devices on behalf of the virt-launchers (and in their netns).

To allow a non-privileged libvirt to consume a pre-configured tap device, the kubevirt's VM domxml specification has to be updated.

Which issue(s) this PR fixes:
Partially fixes #3085

Special notes for your reviewer:
This PR is part of the change-set to remove the NET_ADMIN capability from the virt-launcher pod.

Release note:

Have virt-handler (KubeVirt agent) create the tap devices on behalf of the virt-launchers.
This is part of the work to remove the NET_ADMIN capability from the virt-launcher pods.

The virt-launcher process now runs as virt_launcher.process selinux context by default.

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Apr 13, 2020
@maiqueb maiqueb force-pushed the create_tap_device_on_virt_handler branch from 77bd8ca to e7bd1e3 Compare April 13, 2020 15:58
@mlsorensen
Copy link
Contributor

Can't wait to try this out. Does the change set referenced address #3085 and I assume there will be more commits to the DHCP portion and actually remove NET_ADMIN from the launcher pod?

@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 14, 2020

Can't wait to try this out. Does the change set referenced address #3085 and I assume there will be more commits to the DHCP portion and actually remove NET_ADMIN from the launcher pod?

It does address #3085 .

I'd rather have this PR focus exclusively on having non-privileged libvirt consume a pre-provisioned tap device; once that's done, I can focus on the next step - handling DHCP - which would finally enable the removal of the NET_ADMIN capability from the launcher pod.

@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 14, 2020

/retest

@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 15, 2020

After analyzing the failed tests, I noticed that libvirt fails to attach to the pre-configured tap device being created by virt-handler.

Upon looking at the audit logs, I've found suspicious lines, quite similar to the ones listed in this bug:

type=AVC msg=audit(1586956552.265:513): avc:  denied  { relabelfrom } for  pid=27423 comm="libvirtd" scontext=system_u:system_r:container_t:s0:c143,c582 tcontext=system_u:system_r:spc_t:s0 tclass=tun_socket permissive=0

I noticed that you've CC'ed yourself on that bug @fabiand a very short time ago. Are you hitting this as well ?

I confirm that allowing libvirtd to 'relabelfrom' and 'relabelto' fixes this issue, enabling libvirt to consume the pre-configured tap devices - i.e. grep libvirtd /var/log/audit/audit.log | audit2allow -M mypol && semodule -i mypol.pp

Given my lack of selinux knowledge I'm not sure this is a selinux-policy bug in centos8.1 or if I need to provision those policies in virt-handler ... What do you think @fabiand , @rmohr ?

EDIT: the policy file that enables libvirt to consume the pre-proviosioned tap device can be found below:


module mypol 1.0;

require {
	type container_t;
	type spc_t;
	class tun_socket { relabelfrom relabelto };
}

#============= container_t ==============
allow container_t self:tun_socket relabelto;

#!!!! This avc is allowed in the current policy
allow container_t spc_t:tun_socket relabelfrom;

@rmohr
Copy link
Member

rmohr commented Apr 15, 2020

@maiqueb can we try to label the tab device in a way from virt-handler, so that libvirt will not have to change the labels?

@vladikr
Copy link
Member

vladikr commented Apr 15, 2020

We have a similar issue also related to a tun_socket relabelling. I think virt-handler could relabel the /dev/tapX that you are creating to system_u:object_r:tun_tap_device_t:s0, but my fear is that libvirt will still try to label[1] it and may fail. But let's try :)

https://github.com/stefanberger/libvirt-tpm/blob/master/src/security/security_selinux.c#L3271

@vladikr
Copy link
Member

vladikr commented Apr 15, 2020

btw @maiqueb where is this log coming from? From the compute container or the node?

@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 15, 2020

btw @maiqueb where is this log coming from? From the compute container or the node?

The node.

EDIT: Wait, I think I misunderstood you; this log comes from the 'node' - as in kubectl get nodes ... I'm now confused if you didn't mean the compute container ...

@maiqueb maiqueb force-pushed the create_tap_device_on_virt_handler branch from e7bd1e3 to 85b52eb Compare April 16, 2020 12:06
@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Apr 16, 2020
@maiqueb maiqueb force-pushed the create_tap_device_on_virt_handler branch from 85b52eb to 9683a50 Compare April 16, 2020 12:14
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Apr 16, 2020
@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 16, 2020

We have a similar issue also related to a tun_socket relabelling. I think virt-handler could relabel the /dev/tapX that you are creating to system_u:object_r:tun_tap_device_t:s0, but my fear is that libvirt will still try to label[1] it and may fail. But let's try :)

https://github.com/stefanberger/libvirt-tpm/blob/master/src/security/security_selinux.c#L3271

@vladikr I don't think it would do, for 2 reasons:

  • I don't see any /dev/tapX device anywhere - not in the k8s node, nor in the virt-launcher pod. I think the tap device, is a simple opened file descriptor, unlike macvtap, which is backed up by a character device.
  • that 'object' already has that allow rule in the k8s node:
./cluster-up/ssh.sh node02 "sudo semanage fcontext -l" | grep "/dev/tap"
/dev/tap.*                                         character device   system_u:object_r:tun_tap_device_t:s0 
/dev/tape.*                                        character device   system_u:object_r:tape_device_t:s0 

Ultimately, while I think we all agree the real solution is to properly tag the tap device, I still don't know what is the object here.... I think it's the interface itself, but that's half 'feeling' half educated guess.

Also, I'm not sure what's the tag to use.

For the time being, I've added the rules that 'make this work' that I pasted on #3290 (comment). I'll keep looking into this matter.

log.Log.Reason(err).Criticalf("Failed to create tap device %s. Reason: %s", tapName, out)
return "", err
}
log.Log.Infof("Created tap device: %s", tapName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the device that's being generated here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, sorry but I don't understand the question. You mean the type of device ?... The name ?...

Comment on lines 12 to 13
(allow container_t self (tun_socket (relabelto)))
(allow container_t spc_t (tun_socket (relabelfrom)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going forward, we are planning to remove this custom policy, hence we need to find a solution that allows virt-launcher to use this tap device while it is labeled as container_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is just here because it felt a little bit less wrong than in the virt_launcher.cil file .

I'm looking for a better way of getting selinux to co-exist w/ the tap device; this is just temporary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this doesn't make it better. This file is a base for virt_launcher.cil, if anything it should be there. However, this will soon be removed and virt-launcher will not have a special policy.

@vladikr
Copy link
Member

vladikr commented Apr 16, 2020

  • I don't see any /dev/tapX device anywhere - not in the k8s node, nor in the virt-launcher pod. I think the tap device, is a simple opened file descriptor, unlike macvtap, which is backed up by a character device.

Afaik, when you create a tap device you will have a corresponding char device.
macvtap should work the same way as a tap device, in that sense.

Could it be that this device has been created on node01 instead of node02?

@maiqueb
Copy link
Contributor Author

maiqueb commented Apr 16, 2020

  • I don't see any /dev/tapX device anywhere - not in the k8s node, nor in the virt-launcher pod. I think the tap device, is a simple opened file descriptor, unlike macvtap, which is backed up by a character device.

Afaik, when you create a tap device you will have a corresponding char device.
macvtap should work the same way as a tap device, in that sense.

It can be easily seen in the snippet below (my local setup, no kubevirt, fc31):

$ sudo ip tuntap add mode tap name magicTap0

$ ip a show magicTap0
244: magicTap0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0e:73:19:fe:b5:de brd ff:ff:ff:ff:ff:ff

$ ls -la /dev/tap*
zsh: no matches found: /dev/tap*

$ sudo ls -la /dev/tap*
zsh: no matches found: /dev/tap*

Could it be that this device has been created on node01 instead of node02?

It wasn't .

@maiqueb maiqueb force-pushed the create_tap_device_on_virt_handler branch from 9683a50 to 5f89948 Compare April 16, 2020 15:02
@vladikr
Copy link
Member

vladikr commented Apr 16, 2020

  • I don't see any /dev/tapX device anywhere - not in the k8s node, nor in the virt-launcher pod. I think the tap device, is a simple opened file descriptor, unlike macvtap, which is backed up by a character device.

Afaik, when you create a tap device you will have a corresponding char device.
macvtap should work the same way as a tap device, in that sense.

It can be easily seen in the snippet below (my local setup, no kubevirt, fc31):

I didn't go into the specific for a while now. Perhaps the device is being created when you open() /dev/net/tun... I'm not sure. I'll need to look into this.

Do you see the file descriptor of the tap device you've created?
If so, maybe you can follow libvirt's code[1] and use fsetfilecon_raw on this fd?

[1] https://github.com/stefanberger/libvirt-tpm/blob/master/src/security/security_selinux.c#L3271

@vladikr
Copy link
Member

vladikr commented Apr 17, 2020

It can be easily seen in the snippet below (my local setup, no kubevirt, fc31):

Unfortunately, I didn't have much time today to look deeper into this.

In general, the reason why there is no /dev/tapX device at this stage is that ip tuntap... only registers a network interface.
In order to use the tap device, one should create a tap device using fd =open("/dev/net/tun", ...) and then associate the created device to the created interface using ioctl(fd, TUNSETIFF, &interface...

The above is still happening in the virt-launcher (in libvirt[1] and then [2]) and it's the association of the device to the interface what's probably failing (ioctl(fd, TUNSETIFF)

I didn't find if/how it's possible to relabel the registered tap interface...

However, I do wonder if we can use macvtap (bridge mode) here. It will create a /dev/tapX that we could relabel. Perhaps, we could create a dummy device and put the macvtap on it. Then let the dhcp use this instead of the bridge?
Just a thought, I didn't really try.

[1] https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_interface.c#L443
[2] https://github.com/libvirt/libvirt/blob/master/src/util/virnetdevtap.c#L278

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2020
@maiqueb maiqueb force-pushed the create_tap_device_on_virt_handler branch from aad2ab7 to 0c49faa Compare April 17, 2020 13:40
Copy link
Member

@vladikr vladikr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me already.
@phoracek comments make sense. Besides these, I would avoid introducing additional logic or refactoring into this PR.

@@ -7,4 +7,5 @@
(allow process hugetlbfs_t (dir (add_name create write remove_name rmdir setattr)))
(allow process hugetlbfs_t (file (create unlink)))
(allow process sysfs_t (file (write)))
(allow process tun_tap_device_t (chr_file (read write open ioctl)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that we are aiming to get a correctly labeled interface in the virt-launcher which will not require any additional privileges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having removed the selinux categories from the virt-handler pod, this is no longer needed; I think you've reviewed an outdated version of the code.

Comment on lines 68 to 69
createVirtualNetworkingDevice(deviceName string, isMultiqueue bool, launcherPID int) error
configureVirtualNetworkingDevice() error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
maybe just call it createAndBindTapToBridge ? :)

In order to make virt-handler create the tap device on behalf of
virt-launcher, a set of changes were required:
  - updated the VIF struct to feature the name of the created tap
    device. This way, pod networking configuration phase kubevirt#2 can
    instruct libvirt that it should use a pre-existent tap device
    (via 'target' parameter). This will happen in a follow-up
    commit.
  - added a method to the `NetworkHandler` interface that
    creates a tap device. This method receives the tap interface
    name as parameter (to handle multus based scenarios, where one
    VM is plugged multiple taps).
  - create a tap device from the `Bridge` and `Masquerade`
    `BindMechanism`s, using the aforementioned `NetworkHandler`
    method. The name of the created devices will be cached in the
    VIF structure. This tap device will be configured a follow-up
    commit.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
With this change, libvirt will use a non-managed, pre-configured
tap device (which does not require NET_ADMIN capabilities) instead
of creating a new one (which does require NET_ADMIN capabilities).

This commit adds a method to configure a tap device to the
`NetworkHandler` and `BindMechanism` interfaces. This
configuration is common for all binding mechanism that require a
bridge: it configures the tap device, and binds it to the in-pod
bridge. This configuration step takes place in the pod networking
configuration phase#1.

Lastly, updates were made enabling this pre-configured tap device
to be consumed by libvirt; the generated domxml specifies interface
type 'ethernet' instead of 'bridge'. It also must feature the tap
device name as the interface 'target', and use the 'managed'
attribute to indicate that the tap device already exists, and that
no further configuration is to take place.

The unit tests had to be updated accordingly, enabling
the travis lane to pass.

This commit is part of the change-set to remove the NET_ADMIN
capability from the virt-launcher pods.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
When requested in the vmi spec, create the tap devices w/ multi
queue support.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Create the tap device via a golang library instead of relying on
the iproute tool. This library only performs the minimal required
set of syscalls on the tap's FD.

It is planned to eventually replace this library (currently songgao
water - [0]) for netlink's tuntap implementation [1], which
currently does not support creating the tap device with the desired
owner / group.

Those attributes will be used by libvirt on the consuming end, and
the kernel requires the NET_ADMIN capability whenever the owner or
group doesn't match (libvirt will open them with the `qemu` user,
which translated into uid / gid 107.

[0] - https://github.com/songgao/water
[1] - https://github.com/vishvananda/netlink

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
This patch reads the selinux labels of both virt-handler and
virt-launcher.

It then proceeds to create the tap-device using the label of
virt-launcher, so it can be consumed by libvirt without requiring
any special selinux policies.

Once that is done, the original context is reset, so all other
binaries can be executed in the virt-handler selinux context.

Calls to SetExecLabel should be wrapped in Lock /Unlock OS thread
until the program execution finishes, to guarantee another
goroutine does not migrate to the current thread before execution
is complete, as per
https://godoc.org/github.com/opencontainers/selinux/go-selinux#SetExecLabel

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
There are some tests in the `security_features` suite that check
that KubeVirt runs successfully when the virt-launcher's
`selinuxLauncherType` is defined as `container_t`.

When that happens, we fall victim to libvirt blindly re-labeling
the tun socket, and we hit the following AVC:
```
type=AVC msg=audit(1596645906.167:1471): avc:  denied  \
{ relabelfrom } for  pid=20383 comm="libvirtd" \
scontext=system_u:system_r:container_t:s0:c803,c1014 \
tcontext=system_u:system_r:container_t:s0:c803,c1014 \
tclass=tun_socket permissive=0
```

Since updating the selinux policy to allow for that is not
desired, we're left with no choice but to require
virt_launcher.process selinux contexts for networking.

A follow up patch will update the default value, setting it to the
escalated privilege version.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Not having categories in the virt-handler allows us to greatly
simplify the tap making solution, since it allows the process that
creates the tap device (which runs with virt-launcher's categories)
to access the clone device available on virt-handler.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Whenever a file descriptor is opened and a new process is forked
or exec'ed, it will inherit all the FDs of the parent.

In this case, the leaked FDs have the `spc_t` type, and the child
process is running with lesser privileges, which will result in
plenty of AVCs in the selinux audit log.

By setting the close on exec flag of the file descriptors, they
will close themselves on the child, as per this article:
https://danwalsh.livejournal.com/53603.html

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Use `virt_launcher.process` as the default selinux launcher type.

Update the functional tests accordingly.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
Add a struct that allows creating the tap device in 2 steps:
  - get the selinux labels and build the `exec.Command(...)`
  - selinux label handling and executing the tap making command

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
@maiqueb maiqueb force-pushed the create_tap_device_on_virt_handler branch from b3e43ef to fbec96d Compare August 19, 2020 10:13
@phoracek
Copy link
Member

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 19, 2020
@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 19, 2020

/hold cancel

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 19, 2020
@phoracek
Copy link
Member

/retest

1 similar comment
@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 19, 2020

/retest

@vladikr
Copy link
Member

vladikr commented Aug 19, 2020

/lgtm

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 19, 2020

/retest

3 similar comments
@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 20, 2020

/retest

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 20, 2020

/retest

@phoracek
Copy link
Member

/retest

@kubevirt-commenter-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 21, 2020

/retest

@kubevirt-bot kubevirt-bot merged commit 83cb6a6 into kubevirt:master Aug 21, 2020
@vladikr
Copy link
Member

vladikr commented Aug 21, 2020

@maiqueb Congratulation! Thank you for bearing with us! It wasn't an easy PR. Kudos!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/build-change Categorizes PRs as related to changing build files of virt-* components lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Complete migration of network configuration from launcher into handler pod