Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online VM Snapshot with guest agent #5942

Merged
merged 14 commits into from Jul 22, 2021

Conversation

ShellyKa13
Copy link
Contributor

@ShellyKa13 ShellyKa13 commented Jun 28, 2021

What this PR does / why we need it:
This PR integrates guest agent with online vm snapshot.
In case guest agent is available and the snapshot is taken when the VM is running the guest fs will be frozen before the snapshot
is taken an unfrozen after.
The freeze/unfreeze api is exposed via virt handler endpoint which tunnels the command to the virt launcher which sends it using the guest agent to the guest.
*working on adding some functional tests

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Integrate guest agent to online VM snapshot

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 28, 2021
Copy link
Member

@mhenriks mhenriks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good! Great start!

if err != nil {
return 0, err
}
err = ctrl.freezeGuestFSIfNeeded(vmSnapshot)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this block or take a while? If so, we should probably increase controller goroutines or spawn a dedicated goroutine or have a pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, supposedly the call is immediate, sends the command which should happen and return, I'm not sure if with busy fs it might take more time. I don't think we should call the freeze, continue with other snapshots and return to this snapshot when it completes, but i'm not sure.. how can we check? what do you think is best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess let's not get too complicated now. I think it would be wise to increase snapshot controller threads here: https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-controller/watch/application.go#L430. Maybe double?

Should also think about a way to log/track how long that call is taking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhenriks I fixed all the other comments and add two commits one for functional tests and another for the increase of the number of the threads and added logging of time for the freeze/unfreeze, from what I saw the freeze unfreeze of vms in the tests took around 100ms to freeze and 25ms to unfreeze

@@ -559,6 +583,13 @@ func (ctrl *VMSnapshotController) getVolumeSnapshotClass(storageClassName string
return "", fmt.Errorf("%d matching VolumeSnapshotClasses for %s", len(matches), storageClassName)
}

func (ctrl *VMSnapshotController) updateVMSnapshotError(vmSnapshot *snapshotv1.VirtualMachineSnapshot, reason string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havn't returned here and error cause I figured it is being called when there was already an error, so if we fail to updated the error on the vmsnapshot it just another error which we have nothing to do but to update on the vmsnapshot and return it, so we already return another probably more important error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller can always choose to ignore the error returned. Or just log it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds good

@@ -821,13 +926,32 @@ func checkVMRunning(vm *kubevirtv1.VirtualMachine) (bool, error) {
return rs != kubevirtv1.RunStrategyHalted, nil
}

func (ctrl *VMSnapshotController) checkVMIRunning(vm *kubevirtv1.VirtualMachine) (bool, error) {
func (ctrl *VMSnapshotController) getVMSnapshot(vmSnapshotContent *snapshotv1.VirtualMachineSnapshotContent) (*snapshotv1.VirtualMachineSnapshot, error) {
vmSnapshotName := *vmSnapshotContent.Spec.VirtualMachineSnapshotName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok will add a check

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 30, 2021
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 30, 2021
@ShellyKa13
Copy link
Contributor Author

/retest

1 similar comment
@ShellyKa13
Copy link
Contributor Author

/retest

vmName := vmSnapshot.Spec.Source.Name
log.Log.V(3).Infof("Freezing vm %s file system before taking the snapshot", vmName)

defer timeTrack(time.Now(), "Freezing vm")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this time span will also include updateVMSnapshotError or replaceGuestAgentIndication

Copy link
Contributor Author

@ShellyKa13 ShellyKa13 Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know.. didnt think its a problem, but I changed it to only wrap the freeze call

@ShellyKa13
Copy link
Contributor Author

/retest

@@ -81,7 +81,7 @@ const (

ephemeralDiskDir = virtShareDir + "-ephemeral-disks"

defaultControllerThreads = 3
defaultControllerThreads = 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's only increase threads for the snapshot controller.


})

It("should restore a vm from an online snapshot with guest agent", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about another case where we snapshot/restore boot disk?

Copy link
Member

@rmohr rmohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see the progress on the PR. I only looked at the PR superficial, but it looks to me that the VMI does not indicate when the guest agent freeze command was invoked. Can it be that this is missing?

If so, can there be cases where e.g. the snapshot controller has fundamental issues, and VMs are therefore not unfrozen, but we don't see it in the VMI? With other words, would the VMI look perfectly healthy from the outside? Also, would it make sense to indicate on the VMI in general somehow that currently a snapshot is invoked which may have impact on the functionality?

@ShellyKa13
Copy link
Contributor Author

@rmohr In a previous PR that was already merged there is a separate mechanism in the virt-handler and launcher which polls for the guest fsfreeze status, so in case the fs is frozen it will mark it on the vmi object and once unfrozen it will remove the frozen label.
Also there is a mark that snapshot is in progress.
Furthermore one of my next tasks is to create a mechanism that incase there is something wrong with the snapshot controller and the vm is froazen for more the X time then the vm will be unfrozen by the virt handler.

@rmohr
Copy link
Member

rmohr commented Jul 8, 2021

@rmohr In a previous PR that was already merged there is a separate mechanism in the virt-handler and launcher which polls for the guest fsfreeze status, so in case the fs is frozen it will mark it on the vmi object and once unfrozen it will remove the frozen label.
Also there is a mark that snapshot is in progress.
Furthermore one of my next tasks is to create a mechanism that incase there is something wrong with the snapshot controller and the vm is froazen for more the X time then the vm will be unfrozen by the virt handler.

I see thanks. You probably talk about #5777. Can you please add dedicated e2e tests just for freeze, since there were no tests added in 5777? @mhenriks FYI.

@ShellyKa13
Copy link
Contributor Author

@rmohr in the PR #5777 I only added the update of fsfreezestatus to the vmi object without any actual action of freeze, thats why I only added UT. In this PR I added the integration of the guest agent to the vm snapshot. In this PR I added tests for online snapshot with guest agent which freeze and unfreeze the vm during the snapshot. I can write in a different PR a test that runs freeze and unfreeze inside the virt-launcher pod using the guest agent and only check that the vmi object fsfreeze status was updated but I thought the UT in the previous PR basically covered all the parts in this test.

@ShellyKa13 ShellyKa13 force-pushed the snapshot-with-guest-agent branch 4 times, most recently from 20d41ef to 3dd01c8 Compare July 13, 2021 16:15
@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jul 15, 2021
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jul 20, 2021
The command is exposed in virt handler which tunnels it
through to the virt launcher to be executed via guest
agent to the guest

Signed-off-by: Shelly Kagan <skagan@redhat.com>
This will allow calling freeze and unfreeze to a
specific vmi.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Update the indications when initializing the
snapshot to be able to check these indications during
the snapshot process.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
before snapshoting the volumes of the vm we check if
guest agent is available, if so we freeze the guest
fs before we take the snapshots.
after completing the volumes snapshot we unfreeze
the fs.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
If the snapsho is online then we dont logout and we dont
shut down the vm so no need to remount.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
The condition will mark when the source of the
vmsnapshot is frozen and then that it was thawed.
It will prevent from unfreezing the source all the
time that the snapshot is already completed.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
@mhenriks
Copy link
Member

/retest

Signed-off-by: Shelly Kagan <skagan@redhat.com>
In the new location we make sure we unfreeze when the vm
snapshot is no longer in progress, wheteher we got error,
if it was deleted or completed as expected

Signed-off-by: Shelly Kagan <skagan@redhat.com>
The test makes sure that even if the snapshot was deleted
before it was completed the vm will still get unfrozen.
The test also checks that the vm FSFreezeStatus is being
updated to freeze un unfreeze.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
When vmsnapshot is deleting during progress it markes
the vmsnapshot with error, in such case we dont want to continue
and create the volume snapshots.

Signed-off-by: Shelly Kagan <skagan@redhat.com>
Signed-off-by: Shelly Kagan <skagan@redhat.com>
@ShellyKa13
Copy link
Contributor Author

/retest

2 similar comments
@ShellyKa13
Copy link
Contributor Author

/retest

@ShellyKa13
Copy link
Contributor Author

/retest

@mhenriks
Copy link
Member

/lgtm
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2021
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mhenriks

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2021
@kubevirt-bot kubevirt-bot merged commit d32b8f9 into kubevirt:main Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants