-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
br: clean volumes when restore volume failed #5634
br: clean volumes when restore volume failed #5634
Conversation
/run-pull-e2e-kind-br |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5634 +/- ##
===========================================
- Coverage 61.46% 47.81% -13.65%
===========================================
Files 235 219 -16
Lines 30397 30307 -90
===========================================
- Hits 18683 14492 -4191
- Misses 9840 14095 +4255
+ Partials 1874 1720 -154
|
/retest |
/retest |
/run-pull-e2e-kind-basic |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BornChanger, csuzhangxc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/cherry-pick release-1.5 |
@csuzhangxc: new pull request created to branch In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks for working on it. Left a couple of comments.
@@ -130,6 +130,19 @@ func IsRestoreVolumeComplete(restore *Restore) bool { | |||
return condition != nil && condition.Status == corev1.ConditionTrue | |||
} | |||
|
|||
// IsRestoreVolumeFailed returns true if a Restore for volume is Failed | |||
func IsRestoreVolumeFailed(restore *Restore) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand this what exactly does IsRestoreVolumeFailed entail.
We should only delete volumes, if restore fails in the first step (ie RestoreVolume which creates the volumes), and before creating PV/PVCs which attach volumes to ec2 nodes. Its the volumes which are never attached to an ec2 node (via PVCs) which cause a leak. After PVCs are created, they are tracked and deleted by csi driver.
Also If they restore fails during warmup, or restore_data, its actually useful to investigate the tikvs on what happened. If the volumes are deleted, then it would be hard to investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create pv/pvc and warm up volumes and restore data only if the restore CR has the status RestoreVolumeComplete. And if restore volume failed, the status of restore CR is Failed. So this condition can ensure the restore is failed at restore volume step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So !IsRestoreVolumeComplete(restore)
is the condition which ensure restorevolume step has failed.
@@ -204,6 +205,68 @@ func (e *EC2Session) DeleteSnapshots(snapIDMap map[string]string, deleteRatio fl | |||
return nil | |||
} | |||
|
|||
func (e *EC2Session) DeleteVolumes(volumeIDs []string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only delete volumes, which are unattached. How is the validated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the volume is attached, the delete request will fail. It means, we can't delete an attached volume. In my test, if the tikv pod is running, the volume can't be deleted. This is the reason why I deleted the tikv statefulset in my test steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Basically we rely in aws api contract to prevent attached volumes from getting deleted. Deletes the specified EBS volume. The volume must be in the available state (not attached to an instance).
https://docs.aws.amazon.com/cli/latest/reference/ec2/delete-volume.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the responses.
@@ -130,6 +130,19 @@ func IsRestoreVolumeComplete(restore *Restore) bool { | |||
return condition != nil && condition.Status == corev1.ConditionTrue | |||
} | |||
|
|||
// IsRestoreVolumeFailed returns true if a Restore for volume is Failed | |||
func IsRestoreVolumeFailed(restore *Restore) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So !IsRestoreVolumeComplete(restore)
is the condition which ensure restorevolume step has failed.
@@ -204,6 +205,68 @@ func (e *EC2Session) DeleteSnapshots(snapIDMap map[string]string, deleteRatio fl | |||
return nil | |||
} | |||
|
|||
func (e *EC2Session) DeleteVolumes(volumeIDs []string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Basically we rely in aws api contract to prevent attached volumes from getting deleted. Deletes the specified EBS volume. The volume must be in the available state (not attached to an instance).
https://docs.aws.amazon.com/cli/latest/reference/ec2/delete-volume.html
What problem does this PR solve?
Add cleaning volumes function when restore volume failed. Then we can avoid volume leak
Closes #5638
What is changed and how does it work?
Code changes
Tests
manual test steps:
VolumeComplete
toFailed
to mock restore volume failed scenariotidb.pingcap.com/tikv-volumes-ready
to block tikv creation, then delete the tikv statefulset to detach the EBS volumesSide effects
Related changes
Release Notes
Please refer to Release Notes Language Style Guide before writing the release note.