Skip to content

Add group volume snapshot CRD#222

Merged
harsh-px merged 3 commits intolibopenstorage:masterfrom
harsh-px:group-snap-crd
Jan 14, 2019
Merged

Add group volume snapshot CRD#222
harsh-px merged 3 commits intolibopenstorage:masterfrom
harsh-px:group-snap-crd

Conversation

@harsh-px
Copy link
Copy Markdown

@harsh-px harsh-px commented Dec 21, 2018

This change adds CRD and controller for Group local and cloud snapshots including 3D snaps.

This version also deprecates group snapshots using legacy volumesnapshots. Hence specs have been removed from integration tests.

Integration tests will be added for GroupVolumeSnapshots once this and portworx/sched-ops#98 are merged.

@harsh-px harsh-px self-assigned this Dec 21, 2018
@harsh-px harsh-px requested a review from disrani-px December 21, 2018 01:27
@harsh-px harsh-px force-pushed the group-snap-crd branch 2 times, most recently from d708f1f to f8c44c9 Compare January 11, 2019 06:18
Copy link
Copy Markdown
Contributor

@disrani-px disrani-px left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the controller


credID := getCredIDFromSnapshot(groupSnap.Spec.Options)

resp, err := p.volDriver.CloudBackupGroupCreate(&api.CloudBackupGroupCreateRequest{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something crashes after triggering the cloudsnap but before we store the task IDs those cloudsnaps will be orphaned. We should pass in a task ID for this API similar to the Migration API. Doesn't need to be part of this change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean pass a task ID to the OSD API to allow it to resume from a previously triggered backup?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should pass in the same taskID for the same cloudsnap when we resume. The response should return an appropriate error if the same taskID has already been triggered and we can deal with it here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file a bug to track this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// event is already being processed. In such situation, the operator framework
// with provide a groupSnapshot which is the same version as the previous groupSnapshot
// If we reprocess an outdated object, this can throw off the status checks in the snapshot stage
m.minResourceVersion = groupSnapshot.ResourceVersion
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required, it doesn't happen in the migration controller.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this by printing the actual objects between 2 updates. This was happening very regularly when the sync period was 15 seconds.

First print showed the I was updating the snapshots in the status of the group snapshot using sdk.Update().
Second print showed the new handle event coming in at the same timestamp. The groupSnapshot object in the event did not have the snapshots in status and had the same Metadata.ResourceVersion.

validateCRDTimeout time.Duration = 1 * time.Minute
resyncPeriod = 60 * time.Second

updateCRD = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem odd, its clearer to just return true or false from the function this is being used. If you still want to use it, just define one variable and use !updateCRD

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was done to make the returns more readable. Made change to just have one variable.

Recorder: recorder,
}
if err := groupsnapshotInst.Init(); err != nil {
log.Fatalf("Error initializing groupsnapshot controller due to: %v", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove due to, keep it consistent with the other messages above.

case crdv1.PortworxSnapshotTypeCloud:
return p.getGroupCloudSnapStatus(snap)
case crdv1.PortworxSnapshotTypeLocal:
return nil, fmt.Errorf("status API not supported for local group snapshots")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't return an error. The caller wouldn't know how to deal with a legit error as opposed to this error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to errors.ErrNotSupported. However, no change in controller handling. It still needs to record the err as an event so it clearly shows up in the snapshot events. The controller won't treat driver errors as sync errors so it won't return err on the .Handle().


credID := getCredIDFromSnapshot(groupSnap.Spec.Options)

resp, err := p.volDriver.CloudBackupGroupCreate(&api.CloudBackupGroupCreateRequest{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should pass in the same taskID for the same cloudsnap when we resume. The response should return an appropriate error if the same taskID has already been triggered and we can deal with it here.

return !updateCRD, err
}
} else {
logrus.Infof("Creating new group snapshot")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log with GroupSnapshotLog

response, err = m.Driver.GetGroupSnapshotStatus(groupSnap)
if err != nil {
logrus.Errorf("group snapshot status returned err: %v", err)
m.Recorder.Event(groupSnap,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will log multiple events? Handle() is also recording events when this function returns an error.

Same for other places in this function where event is being recorded.

func (m *GroupSnapshotController) handleDelete(groupSnap *stork_api.GroupVolumeSnapshot) error {
err := m.Driver.DeleteGroupSnapshot(groupSnap)
if err != nil {
m.Recorder.Event(groupSnap,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the object is gone at this time, so the event won't really show up.

pkg/log/log.go Outdated
appv1beta1 "k8s.io/api/apps/v1beta1"
appv1beta2 "k8s.io/api/apps/v1beta2"
"k8s.io/api/core/v1"
v1 "k8s.io/api/core/v1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required

destroyAndWait(t, ctx)
}

func groupSnapshotTest(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have similar tests for the new groupsnapshot CRD

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add the integration tests after this and portworx/sched-ops#98 merges.

@disrani-px
Copy link
Copy Markdown
Contributor

Also need to add sub commands to storkctl

@harsh-px
Copy link
Copy Markdown
Author

Also need to add sub commands to storkctl

Created #226 for this. storkctl support was absent for prior groupsnaps too. We are already too late for the 2.0.2 release. So out of scope for new changes.

@harsh-px
Copy link
Copy Markdown
Author

Have addressed the review comments.

@@ -0,0 +1,50 @@
apiVersion: v1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specs is for the deployments. Can you move these under /examples so that they are easier to find.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

if csStatus.status == api.CloudBackupStatusDone {
conditions = getReadySnapshotConditions()
doneTasks = append(doneTasks, taskID)
snapIDsPendingRevert = append(snapIDsPendingRevert, csStatus.cloudSnapID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was it's confusing that we are saying that completed cloudsnaps need to be reverted. The decision that it needs to be reverted is being made somewhere else. This should just keep a track of snaps that are done and the variable name should reflect that.

@harsh-px
Copy link
Copy Markdown
Author

Addressed last 2 comments.

Harsh Desai added 3 commits January 14, 2019 14:46
Signed-off-by: Harsh Desai <harsh@portworx.com>
Signed-off-by: Harsh Desai <harsh@portworx.com>

Address review comments

- process cloudsnap failed tasks
- revert local and cloudsnaps as soon as the first failure is observed
- allow deletes of legacy group snapshots
- group snapshot controller part of snapshot controller
- fix cassandra restore pvcs
- fix v1 imports
- fix duplicate events
- revert active cloudsnapshots too
- use groupsnap logger
- move specs to examples
- use explicit variables to track done and active IDs

Signed-off-by: Harsh Desai <harsh@portworx.com>
Signed-off-by: Harsh Desai <harsh@portworx.com>
@harsh-px harsh-px merged commit 2d46472 into libopenstorage:master Jan 14, 2019
@harsh-px harsh-px deleted the group-snap-crd branch January 14, 2019 23:17
@disrani-px disrani-px added enhancement release-note Information about this change needs to be added to the release note labels Feb 19, 2019
@disrani-px disrani-px added this to the 2.1.0 milestone Feb 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement release-note Information about this change needs to be added to the release note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants