Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't log error if restic volume is empty #1480

Merged
merged 2 commits into from
May 30, 2019

Conversation

carlisia
Copy link
Contributor

Fixes #1266.

Couple of notes:

I thought this error would happen here: https://github.com/heptio/velero/blob/ee8f8ca1db5917b631547b5f59e4075caee66b20/pkg/controller/pod_volume_backup_controller.go#L233

But the execution path doesn't go thru there.

I made small mods to some error messages to make them different from one another. Let me know if it's best to leave them as they were.

Signed-off-by: Carlisia carlisiac@vmware.com

@skriss skriss added this to the v1.x milestone May 20, 2019
@skriss
Copy link
Member

skriss commented May 20, 2019

@carlisia it definitely looks to me like the error is occurring on L232-233 in the pod volume backup controller. Here's the error I see in the log of the restic pod that ran the backup:

time="2019-05-20T19:32:09Z" level=error msg="Error running command=restic backup --repo=azure:velero-rc1:/restic/nginx --password-file=/tmp/velero-restic-credentials-nginx253408481 --cache-dir=/scratch/.cache/restic . --tag=volume=nginx-logs --tag=backup=empty-dir --tag=backup-uid=f3568c2b-7b35-11e9-bf06-be4461881575 --tag=ns=nginx --tag=pod=nginx-deployment-f56465b95-tkbjr --tag=pod-uid=4167f363-7aa8-11e9-9e1c-d22329cbafb3 --host=velero, stdout=open repository\n, stderr=Fatal: unable to save snapshot: snapshot is empty\n" backup=velero/empty-dir controller=pod-volume-backup error="exit status 1" error.file="/go/src/github.com/heptio/velero/pkg/controller/pod_volume_backup_controller.go:232" error.function="github.com/heptio/velero/pkg/controller.(*podVolumeBackupController).processBackup" logSource="pkg/controller/pod_volume_backup_controller.go:232" name=empty-dir-gmr2d namespace=velero

Remember, the pod volume backup controller is not running in the main velero deployment pod; it's running in the restic daemonset.

So I think that's the appropriate place to check for this error, not in resource_backupper.go.

Additionally, we need to decide what to do with the PodVolumeBackup custom resource when this occurs. Right now it goes into a Failed phase; I think it would make more sense to have it go to Completed. However, we'd then have Completed PVBs that didn't have SnapshotIDs - would need to consider whether this is an issue or not. I'm not sure off the top of my head what the right thing is to do here, but please look into it and propose something.

@carlisia
Copy link
Contributor Author

I know why I didn't see the changes I had made to ln pod_volume_backup_controller.go:232: I was looking exclusively in the backup log. And there I did see the error that came from the resource_backupper. So that log is showing on the restic pod log. I'll fix it in the proper place then.

@carlisia
Copy link
Contributor Author

As for the phase: this is not a failure. So it leaves us with Complete of PartiallyFailed. Since we will end up with PVBs that don't have SnapshotIDs, I wouldn't expect to see Complete. By rule of elimination, it leaves us with PartiallyFailed with a proper message indicating why. This is what I would do. WDYT?

@skriss
Copy link
Member

skriss commented May 21, 2019

So, couple things:

  • PodVolumeBackups don't currently have a phase of PartiallyFailed. It would be a breaking change to add.
  • From the user's perspective, I would not consider this scenario to be a failure or partial failure at all - so I'd say the overall backup should end as Completed (assuming no other errors).
  • For the PodVolumeBackup, I think it probably makes sense to mark it as Completed, with no SnapshotID, and something in the status.message field indicating that the volume was empty. This is a non-breaking change, that is still reasonably obvious to the user.
  • https://github.com/heptio/velero/blob/master/pkg/restic/backupper.go#L168-L176 is where the main velero deployment process waits for PVBs to complete, and returns results. We'd need to update this to properly handle Completed PVBs with no snapshot ID (probably by removing them from the map that's returned). Take a look at where this function is called - pkg/backup uses the returned results to write annotations to the pod, which are later used during restore to know which volumes need to be restored using restic. Since we won't have a snapshot for the empty volumes, we won't want to write the annotation for them onto the pod.

Signed-off-by: Carlisia <carlisiac@vmware.com>
@carlisia
Copy link
Contributor Author

This is what the output looks like:

image

@carlisia carlisia requested review from skriss and nrb May 29, 2019 01:05
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, two small comments. Let's add a changelog for this as well.

@@ -168,6 +168,10 @@ ForEachVolume:
case res := <-resultsChan:
switch res.Status.Phase {
case velerov1api.PodVolumeBackupPhaseCompleted:
if res.Status.SnapshotID == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment here explaining that this would be the case if the volume was empty and therefore we don't actually have a restic snapshot, since it's not immediately obvious.

r.Status.Phase = velerov1api.PodVolumeBackupPhaseCompleted
r.Status.SnapshotID = snapshotID
if emptySnapshot {
r.Status.Message = "backup completed with an empty volume"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like "volume was empty so no snapshot was taken" would be slightly clearer?

Signed-off-by: Carlisia <carlisiac@vmware.com>
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can be squashed on merge.

@nrb nrb merged commit 0804f34 into vmware-tanzu:master May 30, 2019
@carlisia carlisia deleted the c-restic-log-error branch May 30, 2019 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

don't error when taking a restic backup if volume is empty
3 participants