New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online VM Snapshot with guest agent #5942
Online VM Snapshot with guest agent #5942
Conversation
fd6dd62
to
8dc6793
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far so good! Great start!
if err != nil { | ||
return 0, err | ||
} | ||
err = ctrl.freezeGuestFSIfNeeded(vmSnapshot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this block or take a while? If so, we should probably increase controller goroutines or spawn a dedicated goroutine or have a pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, supposedly the call is immediate, sends the command which should happen and return, I'm not sure if with busy fs it might take more time. I don't think we should call the freeze, continue with other snapshots and return to this snapshot when it completes, but i'm not sure.. how can we check? what do you think is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess let's not get too complicated now. I think it would be wise to increase snapshot controller threads here: https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-controller/watch/application.go#L430. Maybe double?
Should also think about a way to log/track how long that call is taking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks I fixed all the other comments and add two commits one for functional tests and another for the increase of the number of the threads and added logging of time for the freeze/unfreeze, from what I saw the freeze unfreeze of vms in the tests took around 100ms to freeze and 25ms to unfreeze
@@ -559,6 +583,13 @@ func (ctrl *VMSnapshotController) getVolumeSnapshotClass(storageClassName string | |||
return "", fmt.Errorf("%d matching VolumeSnapshotClasses for %s", len(matches), storageClassName) | |||
} | |||
|
|||
func (ctrl *VMSnapshotController) updateVMSnapshotError(vmSnapshot *snapshotv1.VirtualMachineSnapshot, reason string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I havn't returned here and error cause I figured it is being called when there was already an error, so if we fail to updated the error on the vmsnapshot it just another error which we have nothing to do but to update on the vmsnapshot and return it, so we already return another probably more important error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller can always choose to ignore the error returned. Or just log it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sounds good
@@ -821,13 +926,32 @@ func checkVMRunning(vm *kubevirtv1.VirtualMachine) (bool, error) { | |||
return rs != kubevirtv1.RunStrategyHalted, nil | |||
} | |||
|
|||
func (ctrl *VMSnapshotController) checkVMIRunning(vm *kubevirtv1.VirtualMachine) (bool, error) { | |||
func (ctrl *VMSnapshotController) getVMSnapshot(vmSnapshotContent *snapshotv1.VirtualMachineSnapshotContent) (*snapshotv1.VirtualMachineSnapshot, error) { | |||
vmSnapshotName := *vmSnapshotContent.Spec.VirtualMachineSnapshotName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will add a check
8dc6793
to
cd528cb
Compare
cd528cb
to
bf36db8
Compare
/retest |
1 similar comment
/retest |
vmName := vmSnapshot.Spec.Source.Name | ||
log.Log.V(3).Infof("Freezing vm %s file system before taking the snapshot", vmName) | ||
|
||
defer timeTrack(time.Now(), "Freezing vm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this time span will also include updateVMSnapshotError
or replaceGuestAgentIndication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know.. didnt think its a problem, but I changed it to only wrap the freeze call
bf36db8
to
02d84c6
Compare
/retest |
@@ -81,7 +81,7 @@ const ( | |||
|
|||
ephemeralDiskDir = virtShareDir + "-ephemeral-disks" | |||
|
|||
defaultControllerThreads = 3 | |||
defaultControllerThreads = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's only increase threads for the snapshot controller.
|
||
}) | ||
|
||
It("should restore a vm from an online snapshot with guest agent", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about another case where we snapshot/restore boot disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see the progress on the PR. I only looked at the PR superficial, but it looks to me that the VMI does not indicate when the guest agent freeze command was invoked. Can it be that this is missing?
If so, can there be cases where e.g. the snapshot controller has fundamental issues, and VMs are therefore not unfrozen, but we don't see it in the VMI? With other words, would the VMI look perfectly healthy from the outside? Also, would it make sense to indicate on the VMI in general somehow that currently a snapshot is invoked which may have impact on the functionality?
@rmohr In a previous PR that was already merged there is a separate mechanism in the virt-handler and launcher which polls for the guest fsfreeze status, so in case the fs is frozen it will mark it on the vmi object and once unfrozen it will remove the frozen label. |
I see thanks. You probably talk about #5777. Can you please add dedicated e2e tests just for freeze, since there were no tests added in 5777? @mhenriks FYI. |
02d84c6
to
f2e047c
Compare
@rmohr in the PR #5777 I only added the update of fsfreezestatus to the vmi object without any actual action of freeze, thats why I only added UT. In this PR I added the integration of the guest agent to the vm snapshot. In this PR I added tests for online snapshot with guest agent which freeze and unfreeze the vm during the snapshot. I can write in a different PR a test that runs freeze and unfreeze inside the virt-launcher pod using the guest agent and only check that the vmi object fsfreeze status was updated but I thought the UT in the previous PR basically covered all the parts in this test. |
20d41ef
to
3dd01c8
Compare
febb1fb
to
66c9900
Compare
The command is exposed in virt handler which tunnels it through to the virt launcher to be executed via guest agent to the guest Signed-off-by: Shelly Kagan <skagan@redhat.com>
This will allow calling freeze and unfreeze to a specific vmi. Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Update the indications when initializing the snapshot to be able to check these indications during the snapshot process. Signed-off-by: Shelly Kagan <skagan@redhat.com>
before snapshoting the volumes of the vm we check if guest agent is available, if so we freeze the guest fs before we take the snapshots. after completing the volumes snapshot we unfreeze the fs. Signed-off-by: Shelly Kagan <skagan@redhat.com>
If the snapsho is online then we dont logout and we dont shut down the vm so no need to remount. Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
The condition will mark when the source of the vmsnapshot is frozen and then that it was thawed. It will prevent from unfreezing the source all the time that the snapshot is already completed. Signed-off-by: Shelly Kagan <skagan@redhat.com>
66c9900
to
ebeb645
Compare
/retest |
Signed-off-by: Shelly Kagan <skagan@redhat.com>
In the new location we make sure we unfreeze when the vm snapshot is no longer in progress, wheteher we got error, if it was deleted or completed as expected Signed-off-by: Shelly Kagan <skagan@redhat.com>
The test makes sure that even if the snapshot was deleted before it was completed the vm will still get unfrozen. The test also checks that the vm FSFreezeStatus is being updated to freeze un unfreeze. Signed-off-by: Shelly Kagan <skagan@redhat.com>
When vmsnapshot is deleting during progress it markes the vmsnapshot with error, in such case we dont want to continue and create the volume snapshots. Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
ebeb645
to
a217d50
Compare
/retest |
2 similar comments
/retest |
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mhenriks The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR integrates guest agent with online vm snapshot.
In case guest agent is available and the snapshot is taken when the VM is running the guest fs will be frozen before the snapshot
is taken an unfrozen after.
The freeze/unfreeze api is exposed via virt handler endpoint which tunnels the command to the virt launcher which sends it using the guest agent to the guest.
*working on adding some functional tests
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Release note: