Add group volume snapshot CRD#222
Conversation
d708f1f to
f8c44c9
Compare
disrani-px
left a comment
There was a problem hiding this comment.
Still reviewing the controller
|
|
||
| credID := getCredIDFromSnapshot(groupSnap.Spec.Options) | ||
|
|
||
| resp, err := p.volDriver.CloudBackupGroupCreate(&api.CloudBackupGroupCreateRequest{ |
There was a problem hiding this comment.
If something crashes after triggering the cloudsnap but before we store the task IDs those cloudsnaps will be orphaned. We should pass in a task ID for this API similar to the Migration API. Doesn't need to be part of this change.
There was a problem hiding this comment.
You mean pass a task ID to the OSD API to allow it to resume from a previously triggered backup?
There was a problem hiding this comment.
Yes, we should pass in the same taskID for the same cloudsnap when we resume. The response should return an appropriate error if the same taskID has already been triggered and we can deal with it here.
There was a problem hiding this comment.
Can you file a bug to track this?
There was a problem hiding this comment.
| // event is already being processed. In such situation, the operator framework | ||
| // with provide a groupSnapshot which is the same version as the previous groupSnapshot | ||
| // If we reprocess an outdated object, this can throw off the status checks in the snapshot stage | ||
| m.minResourceVersion = groupSnapshot.ResourceVersion |
There was a problem hiding this comment.
I don't think this is required, it doesn't happen in the migration controller.
There was a problem hiding this comment.
I verified this by printing the actual objects between 2 updates. This was happening very regularly when the sync period was 15 seconds.
First print showed the I was updating the snapshots in the status of the group snapshot using sdk.Update().
Second print showed the new handle event coming in at the same timestamp. The groupSnapshot object in the event did not have the snapshots in status and had the same Metadata.ResourceVersion.
| validateCRDTimeout time.Duration = 1 * time.Minute | ||
| resyncPeriod = 60 * time.Second | ||
|
|
||
| updateCRD = true |
There was a problem hiding this comment.
These seem odd, its clearer to just return true or false from the function this is being used. If you still want to use it, just define one variable and use !updateCRD
There was a problem hiding this comment.
it was done to make the returns more readable. Made change to just have one variable.
cmd/stork/stork.go
Outdated
| Recorder: recorder, | ||
| } | ||
| if err := groupsnapshotInst.Init(); err != nil { | ||
| log.Fatalf("Error initializing groupsnapshot controller due to: %v", err) |
There was a problem hiding this comment.
Remove due to, keep it consistent with the other messages above.
drivers/volume/portworx/portworx.go
Outdated
| case crdv1.PortworxSnapshotTypeCloud: | ||
| return p.getGroupCloudSnapStatus(snap) | ||
| case crdv1.PortworxSnapshotTypeLocal: | ||
| return nil, fmt.Errorf("status API not supported for local group snapshots") |
There was a problem hiding this comment.
This can't return an error. The caller wouldn't know how to deal with a legit error as opposed to this error.
There was a problem hiding this comment.
Changed this to errors.ErrNotSupported. However, no change in controller handling. It still needs to record the err as an event so it clearly shows up in the snapshot events. The controller won't treat driver errors as sync errors so it won't return err on the .Handle().
|
|
||
| credID := getCredIDFromSnapshot(groupSnap.Spec.Options) | ||
|
|
||
| resp, err := p.volDriver.CloudBackupGroupCreate(&api.CloudBackupGroupCreateRequest{ |
There was a problem hiding this comment.
Yes, we should pass in the same taskID for the same cloudsnap when we resume. The response should return an appropriate error if the same taskID has already been triggered and we can deal with it here.
| return !updateCRD, err | ||
| } | ||
| } else { | ||
| logrus.Infof("Creating new group snapshot") |
There was a problem hiding this comment.
Log with GroupSnapshotLog
| response, err = m.Driver.GetGroupSnapshotStatus(groupSnap) | ||
| if err != nil { | ||
| logrus.Errorf("group snapshot status returned err: %v", err) | ||
| m.Recorder.Event(groupSnap, |
There was a problem hiding this comment.
This will log multiple events? Handle() is also recording events when this function returns an error.
Same for other places in this function where event is being recorded.
| func (m *GroupSnapshotController) handleDelete(groupSnap *stork_api.GroupVolumeSnapshot) error { | ||
| err := m.Driver.DeleteGroupSnapshot(groupSnap) | ||
| if err != nil { | ||
| m.Recorder.Event(groupSnap, |
There was a problem hiding this comment.
I think the object is gone at this time, so the event won't really show up.
pkg/log/log.go
Outdated
| appv1beta1 "k8s.io/api/apps/v1beta1" | ||
| appv1beta2 "k8s.io/api/apps/v1beta2" | ||
| "k8s.io/api/core/v1" | ||
| v1 "k8s.io/api/core/v1" |
| destroyAndWait(t, ctx) | ||
| } | ||
|
|
||
| func groupSnapshotTest(t *testing.T) { |
There was a problem hiding this comment.
We need to have similar tests for the new groupsnapshot CRD
There was a problem hiding this comment.
Will add the integration tests after this and portworx/sched-ops#98 merges.
|
Also need to add sub commands to storkctl |
Created #226 for this. storkctl support was absent for prior groupsnaps too. We are already too late for the 2.0.2 release. So out of scope for new changes. |
16bf491 to
589cd9c
Compare
|
Have addressed the review comments. |
| @@ -0,0 +1,50 @@ | |||
| apiVersion: v1 | |||
There was a problem hiding this comment.
specs is for the deployments. Can you move these under /examples so that they are easier to find.
drivers/volume/portworx/portworx.go
Outdated
| if csStatus.status == api.CloudBackupStatusDone { | ||
| conditions = getReadySnapshotConditions() | ||
| doneTasks = append(doneTasks, taskID) | ||
| snapIDsPendingRevert = append(snapIDsPendingRevert, csStatus.cloudSnapID) |
There was a problem hiding this comment.
My point was it's confusing that we are saying that completed cloudsnaps need to be reverted. The decision that it needs to be reverted is being made somewhere else. This should just keep a track of snaps that are done and the variable name should reflect that.
|
Addressed last 2 comments. |
8de4da0 to
3054ece
Compare
Signed-off-by: Harsh Desai <harsh@portworx.com>
Signed-off-by: Harsh Desai <harsh@portworx.com> Address review comments - process cloudsnap failed tasks - revert local and cloudsnaps as soon as the first failure is observed - allow deletes of legacy group snapshots - group snapshot controller part of snapshot controller - fix cassandra restore pvcs - fix v1 imports - fix duplicate events - revert active cloudsnapshots too - use groupsnap logger - move specs to examples - use explicit variables to track done and active IDs Signed-off-by: Harsh Desai <harsh@portworx.com>
Signed-off-by: Harsh Desai <harsh@portworx.com>
3054ece to
725a456
Compare
This change adds CRD and controller for Group local and cloud snapshots including 3D snaps.
This version also deprecates group snapshots using legacy volumesnapshots. Hence specs have been removed from integration tests.
Integration tests will be added for GroupVolumeSnapshots once this and portworx/sched-ops#98 are merged.