Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime-rs: refactor qemu driver #9353

Merged
merged 8 commits into from Apr 16, 2024
Merged

Conversation

pmores
Copy link
Contributor

@pmores pmores commented Mar 26, 2024

Since PR #8866 has been merged the qemu driver in runtime-rs features a reasonable support of networking. However the implementation leaves a bunch of things to be desired. It looks much more like govmm than runtime-rs's own qemu code, contains numerous stub abstractions that are unused and only make code harder to understand, conflicts with QemuCmdLine interface & implementation principles etc.

This PR aims to fix those shortcomings and make network device support an integral part of qemu-rs. No changes in actual behaviour are foreseen, this is a pure refactor.

As a logical part of this effort, ownership and management of file descriptors to be passed to qemu is moved to QemuCmdLine (so far they have been managed by the QemuInner::start_vm() function's topmost scope which was quite inappropriate given the way, way higher level of abstraction this function operates at). Also this change is just a refactor, the expected runtime behaviour of the code does not change (the descriptors were and are closed on start_vm() exit).

The idea of this function is to make sure O_CLOEXEC is not set on file
descriptors that should be inherited by a child (=hypervisor) process.
The approach so far is however rather heavy-handed - clearing *all* flags
is unjustifiably aggresive for a low-level function with no knowledge of
context whatsoever.

This commit refactors the function so that it only does what's expected
and renames it accordingly.  It also clarifies some of its call sites.

Signed-off-by: Pavel Mores <pmores@redhat.com>
generate_netdev_fds() takes NetworkConfig from which it however only needs
a host-side network device name.  This commit makes it take the device name
directly, making the function useful to callers who don't have the whole
NetworkConfig but do have the requisite device name.

Signed-off-by: Pavel Mores <pmores@redhat.com>
In keeping with architecture of QemuCmdLine implementation we split the
functionality into two objects: Netdev to represent and generate the
-netdev part and DeviceVirtioNet for the -device virtio-net-<transport>
part.

This change is a pure refactor, existing functionality does not change.
However, we do remove some stub generalizations and govmm-isms, notably:
- we remove the NetDev enum since the only network interface types that
  kata seems to use with qemu are tuntap and macvtap, both of which are
  implemented by the same -netdev tap
- enum DeviceDriver is also left out since it doesn't seem reasonable to
  try to represent VFIO NICs (which are completely different from
  virtio-net ones) with the same struct as virtio-net
- we also remove VirtioTransport because there's no use for it so far, but
  with the expectation that it will be added soon.

We also make struct Netdev the owner of any vhost-net and queue file
descriptors so that their lifetime is tied ultimately to the lifetime of
QemuCmdLine automatically, instead of returning the fds to the caller and
forcing it to achieve the equivalent functionality but manually.

Signed-off-by: Pavel Mores <pmores@redhat.com>
…tioNet

This commit replaces the existing NetDevice-based implementation with one
using Netdev and DeviceVirtioNet.

Signed-off-by: Pavel Mores <pmores@redhat.com>
is_running_in_vm() is enough to figure out whether to disable_modern but
it's clumsy and verbose to use.  should_disable_modern() streamlines the
usage by encapsulating the verbosity.

Signed-off-by: Pavel Mores <pmores@redhat.com>
add_network_device() doesn't need to be passed NetworkInfo since it
already has access to the full HypervisorConfig.

Also, one of the goals of QemuCmdLine interface's design is to avoid
coupling between QemuCmdLine and the hypervisor crate's device module,
if at all possible.  That's why add_network_device() shouldn't take
device module's NetworkConfig but just parts that are useful in
add_network_device()'s implementation.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Make file descriptors to be passed to qemu owned by QemuCmdLine.  See
commit 52958f17cd for more explanation.

Signed-off-by: Pavel Mores <pmores@redhat.com>
The remaining code in network.rs was mostly moved to utils.rs which seems
better home for these utility functions anyway (and a closely related
function open_named_tuntap() has already lived there).

ToString implementation for Address was removed after some consideration.
Address should probably ideally implement Display (as per RFC 565) which
would also supply a ToString implementation, however it implements Debug
instead, probably to enable automatic implementation of Debug for anything
that Address is a member of, if for no other reason.  Rather than having
two identical functions this commit simply switches to using the Debug
implementation for printing Address on qemu command line.

Fixes kata-containers#9352

Signed-off-by: Pavel Mores <pmores@redhat.com>
@katacontainersbot katacontainersbot added the size/huge Largest and most complex task (probably needs breaking into small pieces) label Mar 26, 2024
Copy link
Member

@BbolroC BbolroC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @pmores. I was also thinking of why we need a separate network.rs for the command construction. I also verified that runtime-rs works with this change on s390x.

@pmores
Copy link
Contributor Author

pmores commented Mar 27, 2024

Thanks for your review @BbolroC ! network.rs served its purpose well. It might be argued that cmdline_generator.rs is getting big which is true, however it's also very regular and predictable, or even repetitive in its structure so I prefer not to break it down for now just yet.

I'm not saying though that we won't ever split it into a bunch of files at some point! When that happens though I'd like it to be done in a clean and symmetric way. :-)

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments but I don't see anything that would prevent merging, so let's go for it !

Thanks @pmores !

cur_flags
))?;
new_flags.remove(fcntl::FdFlag::FD_CLOEXEC);
if let Err(err) = fcntl::fcntl(rawfd, fcntl::FcntlArg::F_SETFD(new_flags)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth to note that FD_CLOEXEC has been the only flag forever and I'm not sure it will ever change 😉 but this is indeed how things should done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this might look a bit pedantic. Still, not everybody is necessarily aware of how many possible flags there are (I know I wasn't until fairly recently ;-)), so I considered it better to state the intent directly, for the benefit of those readers if for no other reason.

impl DeviceVirtioNet {
fn new(netdev_id: &str, mac_address: Address) -> DeviceVirtioNet {
DeviceVirtioNet {
device_driver: "virtio-net-pci".to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this only works for PCI based platforms, i.e. x86_64 and POWER. s390x would need "virtio-net-ccw" here. DeviceVirtioNet should be added a bus_type: VirtioBusType field just like DeviceVirtioBlk for this to work.

I'm not asking this to be addressed now since this PR claims to be pure refactoring.

Cc @BbolroC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is unfortunately how much of this kind of work has to be done though, now and in the future, since I can't really develop support for archs I have no access to and thus cannot test. Exactly as you write, this is partial support for PCI only, to be completed later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I can see the following log from go-runtime:

-device driver=virtio-net-ccw,netdev=network-0,mac=d6:f0:4c:10:8d:65,mq=on,devno=fe.0.0006

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't an actual scenario that mandates to have a PCI device on the QEMU command line for s390x, I'd rather not support PCI at all there. @BbolroC can hopefully shed some light on what is needed for Z.

};
if nested {
virtio_net_device.set_disable_modern(true);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kinda invalidates the "No changes in actual behaviour are foreseen, this is a pure refactor." claim in the PR description but since this is a valuable fix and this is still experimental code, I guess it's okay to hijack the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I didn't realise that.

@@ -65,15 +64,11 @@ impl QemuInner {
info!(sl!(), "Starting QEMU VM");
let netns = self.netns.clone().unwrap_or_default();

// CAUTION: since 'cmdline' contains file descriptors that have to stay
// open until spawn() is called to launch qemu later in this function,
// 'cmdline' has to live at least until spawn() is called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would obviously be the case if ownership goes to the command variable using https://crates.io/crates/command-fds and this comment wouldn't be needed.

Possible improvement in a follow-up PR. In anycase, the change does the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, I only added the comment because I thought you'd appreciate it. ;-)

This was discussed off-line so just to summarise my stance for the public record: for this to become a problem, someone would need to drop cmdline and before doing so, store the cmdline's output somewhere. I don't quite see how this could happen inadvertently, seeing as someone would need to work to make the problem happen.

Also, if that did actually happen somehow, the problem would be very conspicuous since it would break everything (among other problems, it would prevent runtime-rs from talking with kata-agent so launch would fail immediately and consistently).

All in all, if I'm wrong and this problem actually materialises I'll be very curious to know how it happened.

@gkurz
Copy link
Member

gkurz commented Apr 16, 2024

/test

@gkurz gkurz merged commit aca6a1b into kata-containers:main Apr 16, 2024
305 of 311 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/huge Largest and most complex task (probably needs breaking into small pieces)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants