-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VMI backend storage and persistent TPM support #8156
Add VMI backend storage and persistent TPM support #8156
Conversation
Skipping CI for Draft Pull Request. |
/test pull-kubevirt-e2e-k8s-1.24-sig-compute |
4352a2c
to
a079ca9
Compare
/test pull-kubevirt-e2e-k8s-1.24-sig-compute |
a079ca9
to
49820b5
Compare
/test pull-kubevirt-e2e-k8s-1.24-sig-compute |
/test pull-kubevirt-e2e-k8s-1.24-sig-storage |
4fadc5b
to
4297728
Compare
4297728
to
6b2f799
Compare
6b2f799
to
cdc94b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Jed.
I have a few questions, mainly about live migration.
} | ||
pvc := &v1.PersistentVolumeClaim{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: PVCPrefix + vmi.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be deleted together with the VM object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want it to, yes.
I set its owner to the same value(s) as the VMI's owner, not sure if that's enough, I need to add a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wasn't the case anymore. I just pushed a commit to set the PVC owner to the VM object (or whatever the VMI owner is).
For discrete VMIs (i.e. not tied to a VM), I decided to set to owner of the PVC to the VMI, rendering it useless, but also avoiding garbage PVCs and security concerns.
PTAL!
|
||
// BackendStorageClass is the name of the storage class to use for the PVCs created to preserve VM state, like TPM. | ||
// The storage class must support RWX in filesystem mode. | ||
BackendStorageClass string `json:"backendStorageClass,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me BackendStorageClass
sounds too generic and needs some kind of a prefix to explain what it's meant for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I've had a hard time naming this thing, suggestions are welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just VMStateStorageClass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -843,7 +846,11 @@ func (c *MigrationController) handleTargetPodCreation(key string, migration *vir | |||
return nil | |||
} | |||
} | |||
return c.createTargetPod(migration, vmi, sourcePod) | |||
backendStoragePVC, err := backendstorage.CreateIfNeeded(vmi, c.clusterConfig, c.clientset, backendstorage.IsMigration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit lost. Could you please explain why do we need to do this?
I mean why can't we simply skip this volume during migration and re-use the original on the destination? ( same as we do with other volumes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice but I don't think we can.
When talking about volumes used by the VM it's not a problem, since the VM only ever runs in one place.
However, in this case, the contents of the PVC would be in use in 2 places at once (source container and target container).
In the case of swtpm, we could end up with corrupt data. In fact, swtpm takes a file-based lock to prevent exactly that, so the migration actually fails when using the original in the destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, in this case, the contents of the PVC would be in use in 2 places at once (source container and target container).
How can this be in use in two places? During the migration, the target domain is paused and will only be unpaused when the source is dead.
Why would this case be any different?
In the case of swtpm, we could end up with corrupt data. In fact, swtpm takes a file-based lock to prevent exactly that, so the migration actually fails when using the original in the destination.
We would end up in this situation for any migration, but this is not happening because of what I've described above. Also, source qemu normally locks so the destination won't write into it accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, in this case, the contents of the PVC would be in use in 2 places at once (source container and target container).
How can this be in use in two places? During the migration, the target domain is paused and will only be unpaused when the source is dead. Why would this case be any different?
Yes, the target domain is paused but not the target pod! This case is different because the contents of the PVC are actually accessed by the pod itself, not the domain.
In the case of swtpm, we could end up with corrupt data. In fact, swtpm takes a file-based lock to prevent exactly that, so the migration actually fails when using the original in the destination.
We would end up in this situation for any migration, but this is not happening because of what I've described above. Also, source qemu normally locks so the destination won't write into it accidentally.
Yup, this again applies to the domain but not to its hosting pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, in this case, the contents of the PVC would be in use in 2 places at once (source container and target container).
How can this be in use in two places? During the migration, the target domain is paused and will only be unpaused when the source is dead. Why would this case be any different?
Yes, the target domain is paused but not the target pod! This case is different because the contents of the PVC are actually accessed by the pod itself, not the domain.
Can you please explain why the pod is accessing this content (especially if the domain is not running) if this content belongs to the domain?
Filesystem volumes that we attach to the VMIs should behave similarly. These volumes are mounted in the pod (compute container) and have a diskX.img file that is being used by qemu.
It seems to me like a similar scenario. If not, can you please explain why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see your point now... Even when coming from the domain, most I/O operations ultimately happen in the backend. I guess a more accurate answer to your question is that qemu is equipped to handle cases where the same volume is mounted on both the source and the target, and swtpm is not... We could open an issue against swtpm and switch to using just 1 PVC if/when the issue gets resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see your point now... Even when coming from the domain, most I/O operations ultimately happen in the backend. I guess a more accurate answer to your question is that qemu is equipped to handle cases where the same volume is mounted on both the source and the target, and swtpm is not... We could open an issue against swtpm and switch to using just 1 PVC if/when the issue gets resolved.
Qemu is able to handle this mainly because the domain is paused and will not be unpaused until the source is destroyed. The same should happen with swptm. What is the reason this service would generate any I/O if qemu, which is its only consumer, is paused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great question. It almost feels like that lock swtpm has in place is there to block this very scenario, I think we should ask about it, I'll open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which lock would that be, but if this code has been accepted into libvirt then Im sure this volume should be treated as any other.
Honestly, I'd be against us developing an additional mechanism to handle PVCs , instead this volume should be treated the same way as any other filesystem volume we attach to the guests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock (inside virt-launcher/compute, running as non-root): /var/run/kubevirt-private/libvirt/qemu/swtpm/09d086e6-ba5d-4647-8f7d-c16d17ae20c0/tpm2/.lock
Please keep in mind that we do not attach this volume to the guest! Libvirt is not aware of it, and the libvirt TPM option do not have a way to provide a specific disk or even location for storing the TPM state.
@@ -65,11 +77,63 @@ func CreateIfNeeded(vmi *corev1.VirtualMachineInstance, clusterConfig *virtconfi | |||
VolumeMode: &modeFile, | |||
}, | |||
} | |||
// Whatever owns the VMI (i.e. a VM) now owns the PVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the owner is deleted, do we need also to delete the PVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm not sure how this mechanism works, I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PVC indeed gets deleted with the VM, I added a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, what happen if we simpy create a VMI? This doesn't have a owner reference. What happen to the PVC in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case the PVC stays, since we don't know if/when that VMI will be started again...
The alternative is to prevent the creation of VMIs with persistent settings enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially yes. Same thing if a VM if stopped and you create a VMI with its name.
However:
- If the libvirt UUID is different, the existing TPM will be wiped out and a blank one will be created
- If the original VMI took ownership of the TPM, the new VMI will need the administrator password to do anything on it
- If the secrets were sealed against a set of PCRs that are unique to the original VMI, the new one will not be able to see them
- Most importantly, this should be seen as a convenience feature rather than a security one. The whole premise of TPM is the hardware vault protecting its flash memory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for clarification. So:
If the libvirt UUID is different, the existing TPM will be wiped out and a blank one will be created
This basically means that there is no point in allowing persistent TPM settings for VMIs since UUID will be different every time, am I right?
And BTW, just curious: I never checked that, but if I create a VM, then each time that I start/stop it the UUID will be the same, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I've been wondering about and didn't find a proper answer to, I should probably dive into the libvirt code.
Observations showed that creating/deleting a VMI from the same yaml file multiple times does produce the same UUID. Same about VM start-stop.
Maybe the UUID is based on a hash of the XML spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a new UUID was regenerated each time. Interesting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the libvirt docs says a random UUID is generated for each new domain (and every boot of a VM/VMI effectively creates a new domain).
However, I found this, though it is specific to VMs:
https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-controller/watch/vm.go#L1395
I just tried to verify my claim that VMIs keep getting the same UUID and actually got different values across reboots! Not sure what changed or maybe my initial observation was bogus...
However, even for VMIs we do request a specific UUID in the domain XML, so we have control over it. Good to know in case we want to ensure it's always the same value or something.
/hold |
cdc94b3
to
a78867f
Compare
Added as a new commit. |
a6e2435
to
08d558f
Compare
So, cephfs does support snapshot, so I was able to test my rough implementation and it didn't quite work. I have a couple ideas on how to fix that, but I definitely think it should be done as a separate effort. @rmohr WDYT? |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jean-edouard! It looks great overall. I added some comments below.
Let me know what do you think about them. Thanks :)
tests/vmi_tpm_test.go
Outdated
@@ -85,3 +101,182 @@ var _ = Describe("[sig-compute]vTPM", decorators.SigCompute, func() { | |||
}) | |||
}) | |||
}) | |||
|
|||
var _ = Describe("[sig-storage]vTPM", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add decorators.SigStorage
too, otherwise these tests are not covered in the storage lane
var _ = Describe("[sig-storage]vTPM", func() { | |
var _ = Describe("[sig-storage]vTPM", decorators.SigStorage, func() { |
@@ -0,0 +1,98 @@ | |||
package backendstorage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing header
if util.IsNonRootVMI(vmi) { | ||
// For non-root VMIs, the TPM state lives under /var/run/kubevirt-private/libvirt/qemu/swtpm | ||
// To persist it, we need the persistent PVC to be mounted under that location. | ||
// /var/run/kubevirt-private is an emptyDir, and k8s wwould automatically create the right sub-directories under it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mega nit(typo):
wwould
@@ -330,6 +332,61 @@ func withAccessCredentials(accessCredentials []v1.AccessCredential) VolumeRender | |||
} | |||
} | |||
|
|||
func withTPM(vmi *v1.VirtualMachineInstance) VolumeRendererOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a unit test for this?
storageClass, exists := libstorage.GetRWXFileSystemStorageClass() | ||
Expect(exists).To(BeTrue(), "No RWX FS storage class found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an inconsistent behavior between this Describe
and the subsequent It
:
Here we will fail if the storageClass does not exist, later we silently Skip the test.
Maybe this can converge in the same behavior inside the BeforeEach
block.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the BeforeEach
you could also directly adjust the KubevirtCR (marking the whole Context
as Serial), and the vmi itself
tests/vmi_tpm_test.go
Outdated
if op1 == "migrate" { | ||
migrateVMI(vmi) | ||
} else if op1 == "restart" { | ||
restartVM(vm) | ||
} | ||
|
||
checkTPM(vmi) | ||
|
||
if op2 == "migrate" { | ||
migrateVMI(vmi) | ||
} else if op2 == "restart" { | ||
restartVM(vm) | ||
} | ||
|
||
checkTPM(vmi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of op1
and op2
you can switch using ops ...string
and then loop through it with:
if op1 == "migrate" { | |
migrateVMI(vmi) | |
} else if op1 == "restart" { | |
restartVM(vm) | |
} | |
checkTPM(vmi) | |
if op2 == "migrate" { | |
migrateVMI(vmi) | |
} else if op2 == "restart" { | |
restartVM(vm) | |
} | |
checkTPM(vmi) | |
for _, op := range ops{ | |
switch op { | |
case "migrate": | |
migrateVMI(vmi) | |
case "restart": | |
restartVM(vm) | |
} | |
checkTPM(vmi) | |
} |
|
||
migrateVMI := func(vmi *v1.VirtualMachineInstance) { | ||
By("Migrating the VMI") | ||
checks.SkipIfMigrationIsNotPossible() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this Skip
can be done at the beginning of the test to save some time and resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a trick to still run the beginning of the test, and fail on error, even on clusters where migration is not possible.
So here the restart + migrate test will still test restart before getting skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Thanks for the explanation 👍
err = virtClient.VirtualMachine(vm.Namespace).Stop(context.Background(), vm.Name, &v1.StopOptions{}) | ||
ExpectWithOffset(1, err).ToNot(HaveOccurred()) | ||
EventuallyWithOffset(1, func() error { | ||
_, err = virtClient.VirtualMachineInstance(vm.Namespace).Get(context.Background(), vm.Name, &k8smetav1.GetOptions{}) | ||
return err | ||
}, 300*time.Second, 1*time.Second).ShouldNot(Succeed()) | ||
|
||
By("Starting the VM") | ||
err = virtClient.VirtualMachine(vm.Namespace).Start(context.Background(), vm.Name, &v1.StartOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask why didn't you use .Restart()
?
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important here to ensure that the virt-launcher pod and the VMI object both get destroyed to properly test the persistence of the data.
I would be worried that .Restart()
might do any kind of soft reboot, anywhere from keeping the same pod to staying on the same node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! In this case maybe you can rename the function to stopAndStart
🙂 But feel free to ignore this
tests/vmi_tpm_test.go
Outdated
By("Ensuring the PVC gets deleted") | ||
Eventually(func() error { | ||
_, err = virtClient.VirtualMachine(vm.Namespace).Get(context.Background(), vm.Name, &k8smetav1.GetOptions{}) | ||
if !errors.IsNotFound(err) { | ||
return fmt.Errorf("VM %s not removed: %v", vm.Name, err) | ||
} | ||
_, err = virtClient.CoreV1().PersistentVolumeClaims(vm.Namespace).Get(context.Background(), backendstorage.PVCForVMI(vmi), k8smetav1.GetOptions{}) | ||
if !errors.IsNotFound(err) { | ||
return fmt.Errorf("PVC %s not removed: %v", backendstorage.PVCForVMI(vmi), err) | ||
} | ||
return nil | ||
}, 300*time.Second, 1*time.Second).ShouldNot(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you can avoid this check since the subsequent test will ensure the same
Signed-off-by: Jed Lejosne <jed@redhat.com>
Yes, sounds good to me. I am just also not sure when looking into block volumes, and mayb having pre-populated PVCs if this feature is ready yet for general consumption? I have not plans right now to go deeper into reviewing your work here, so feel free to continue without my approval, but maybe move this for now behind a feature gate until we have more contrete plans on the missing gaps like snapshot/restore/block pvc/making pvc name discoverable/... ? |
Signed-off-by: Jed Lejosne <jed@redhat.com>
Signed-off-by: Jed Lejosne <jed@redhat.com>
Signed-off-by: Jed Lejosne <jed@redhat.com>
…ories Signed-off-by: Jed Lejosne <jed@redhat.com>
Signed-off-by: Jed Lejosne <jed@redhat.com>
Signed-off-by: Jed Lejosne <jed@redhat.com>
Signed-off-by: Jed Lejosne <jed@redhat.com>
08d558f
to
c1da6d9
Compare
Thank you @fossedihelm for the great review! |
Signed-off-by: Jed Lejosne <jed@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
@jean-edouard Put an hold to give the opportunity to respond to comments.
Feel free to unhold.
Thanks so much for this great work
@@ -107,6 +107,7 @@ func AdjustKubeVirtResource() { | |||
virtconfig.VMExportGate, | |||
virtconfig.KubevirtSeccompProfile, | |||
virtconfig.HotplugNetworkIfacesGate, | |||
virtconfig.VMPersistentState, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to enable this fg for the whole test suite? Or do we want to restrict it to the only persistent vTPM tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it falls into the same category as the other FGs here, i.e. not all tests need them, but no test requires them to be absent.
There's also no reason for the FG to impact other tests, and if it ever does we want to know about it!
@fossedihelm please cancel the hold if you agree!
@vasiliy-ul heads-up, you approved this PR a while back, there has been a fair amount of changes since then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jean-edouard Agreed! I leave it to @vasiliy-ul to cancel the hold since a new review of it may be needed
/retest |
Looks good. /unhold |
@jean-edouard: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What this PR does / why we need it:
This PR contains a first commit that implements a way to persist VM files on the backend side.
The second commit then uses that to add persistent TPM support.
The third commit modifies the behavior of the backend storage PVC to add support for migration.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
This is better reviewed commit by commit. I can split into 2 PRs but could only open £2 after £1 is merged.
I mostly followed the design proposal: kubevirt/community#170
Release note: