-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
versions: Upgrade to Cloud Hypervisor v39.0 #9575
Conversation
/test |
@jodh-intel As we discussed offline, let's move Cloud Hypervisor to the latest release so that it will have the fixes for several security advisories from its dependencies. We will make sure Cloud Hypervisor related CI workers pass for both golang and rust-lang runtime, particularly the ones recently added from #9525. Meanwhile, we will follow-up with additional PRs to tackle #9522. |
Thanks @likebreath. Fwics, the API changes are simply additions and since the rust definition of CH's Speaking of defaults, I did notice one potential issue though:
However, after sorting the files, I did notice a couple of differences:
|
I agree. We can definitely take "lazy evaluation" approach here to add them only when they are needed. This also keeps the
These are good catch. I took a closer look at them and found they are caused by inconsistencies from the openapi spec in Cloud Hypervisor, e.g. default values do not make sense for required fields. Thanks for reporting it. Just sent a PR to fix it from CH side: cloud-hypervisor/cloud-hypervisor#6431. |
@jodh-intel @amshinde While most CI workers are working fine (including ones on runtime-rs), the set of tests e.g.
Logs: https://github.com/kata-containers/kata-containers/actions/runs/8899253970/job/24474391384?pr=9575 |
@likebreath , sharing this just in case you don't know this already: the cbl-mariner Host VMs are using /dev/mshv instead of /dev/kvm. Let me know in case you need help with these Hosts. |
@danmihai1 That's very helpful context. Given these tests are constantly failing, we need to look into them manually. Are you able to help with setting up an environment to reproduce the errors? Or is there a better person (say the maintainer of these CI workers) I should talk with? Thank you. |
@likebreath , these are AKS Nodes/Kata Hosts that CI is dynamically provisioning. So, setting up one of them is not trivial. I will add below my hacky script to simulate what Kata CI is doing - just in case anyone is brave enough to try these steps. But, either I or @sprt should be able to try your changes in our clusters if you want. On the CI cluster that I have just created, it appears that the problem took place during the cl.BootVM() call:
|
These were the hacky steps I used to simulate the Kata CI set-up:
Get your cluster name from your Azure Resource Group UI. It looks similarly to: k8s-9566-f04a7a55ed9b-clh-cbl-mariner-amd64-n-o Then:
|
@danmihai1 Thank you for reproducing it locally. This is a crucial step. It suggests that the VM created by Cloud Hypervisor failed to boot due to an invalid Since I don't have the access to the reproducing environment, would you or @sprt please help with debugging? First, I would check if this error is reproducible with Cloud Hypevisor v39.0 by itself (e.g. without Kata involved). You can simply try to launch a VM with Cloud Hypervisor on the CI environment. In case you need some references, you can take a look this: https://github.com/cloud-hypervisor/cloud-hypervisor?tab=readme-ov-file#firmware-booting If you can, it will be much easier to bisect and locate the breaking commit for further investigation out of Kata's environment. Otherwise, we will need to bisect with Kata. This is still doable as I think the Cloud Hypervisor binary is easily swappable when using Kata to reproduce the failed tests. Since this is likely an issue concerning mshv support in Cloud Hypervisor, will any of you can help with the debugging here too? @liuw @russell-islam @jinankjain |
@danmihai1 is this blocking? Perhaps we should upgrade the test host? The breakage could be due to our recent changes in the kernel IOCTL interfaces. CC @NunoDasNeves |
@liuw @NunoDasNeves yes, it is blocking. I will ping you tomorrow to make a plan together. |
@danmihai1 yes it is probably due to cloud-hypervisor/cloud-hypervisor#6339 |
The CH v39 upgrade in kata-containers#9575 is currently blocked because of a bug in the Mariner host kernel. To address this, we temporarily tweak the Mariner CI to use an Ubuntu host and the Kata guest kernel, while retaining the Mariner initrd. This is tracked in kata-containers#9594. Importantly, this allows us to preserve CI for genpolicy. We had to tweak the default rules.rego however, as the OCI version is now different in the Ubuntu host. This is tracked in kata-containers#9593. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The CH v39 upgrade in kata-containers#9575 is currently blocked because of a bug in the Mariner host kernel. To address this, we temporarily tweak the Mariner CI to use an Ubuntu host and the Kata guest kernel, while retaining the Mariner initrd. This is tracked in kata-containers#9594. Importantly, this allows us to preserve CI for genpolicy. We had to tweak the default rules.rego however, as the OCI version is now different in the Ubuntu host. This is tracked in kata-containers#9593. This change has been tested together with CH v39 in kata-containers#9588. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Hi @likebreath, I implemented the workaround in #9592 - once that gets merged we'll be able to proceed here. We're adapting the Mariner CI to unblock the CH v39 upgrade, I've tested that that change works with CH v39. |
That's great to know. Thank you for the quick fix. |
As the @sprt @danmihai1 Does the
|
@likebreath Once #9592 is merged, you'll have to rebase this PR on top of it, as right now this PR is still testing code without the fix. The Note there's a failing Jenkins test that's currently blocking #9592 and other PRs, looking into it. |
Understand. I will rebase once your PR is landed.
Yes, I saw that too. The @sprt I am assuming you mean the |
@likebreath Could you rebase your PR on top of main? The Mariner fix was just merged. |
I looked into the The real reason for the
This is error is reported while hot-plugging a VFIO device to Cloud Hypervisor [1 ]. We need to reproduce and debug this offline, as we need to collect more log to understand the issue further. Any good contact of points for setting up a reproducing environment for the |
dffe674
to
13bb415
Compare
This patch upgrades Cloud Hypervisor to v39.0 from v36.0, which contains fixes of several security advisories from dependencies. Details can be found from kata-containers#9574. Fixes: kata-containers#8694, kata-containers#9574 Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch re-generates the client code for Cloud Hypervisor v39.0. Note: The client code of cloud-hypervisor's OpenAPI is automatically generated by openapi-generator. Fixes: kata-containers#8694, kata-containers#9574 Signed-off-by: Bo Chen <chen.bo@intel.com>
13bb415
to
2398442
Compare
I noticed that the Since the vfio-clh test was the only blocker (which now is lifted), this PR now passes all CLH related workers and is ready to be landed. Note that: I went through the failed workers and I don't think any of them are related to the changes here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @likebreath
This patch upgrades Cloud Hypervisor to v39.0 from v36.0, which contains
fixes of several security advisories from dependencies. Details can be
found from #9574.
Fixes: #8694, #9574
Signed-off-by: Bo Chen chen.bo@intel.com