br: clean volumes when restore volume failed #5634

WangLe1321 · 2024-04-24T13:54:14Z

What problem does this PR solve?

Add cleaning volumes function when restore volume failed. Then we can avoid volume leak

Closes #5638

What is changed and how does it work?

Code changes

Has Go code change
Has CI related scripts change

Tests

Unit test
E2E test
Manual test
No code

manual test steps:

create a cluster
create a volume backup
create a restore cluster
create a volume restore using the volume backup in step 2. in this step, restore volume is successful and the tikv pods are started
edit the restore CR, modify its status from VolumeComplete to Failed to mock restore volume failed scenario
edit the tc CR, remove the annotation tidb.pingcap.com/tikv-volumes-ready to block tikv creation, then delete the tikv statefulset to detach the EBS volumes
wait for the volumes detached and deleted by tidb-operator

Side effects

Breaking backward compatibility
Other side effects:

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release Notes

Please refer to Release Notes Language Style Guide before writing the release note.

WangLe1321 · 2024-04-29T11:44:41Z

/run-pull-e2e-kind-br

codecov-commenter · 2024-04-29T12:46:54Z

Codecov Report

Attention: Patch coverage is 0% with 83 lines in your changes are missing coverage. Please review.

Project coverage is 47.81%. Comparing base (3897095) to head (bf2e796).
Report is 3 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5634       +/-   ##
===========================================
- Coverage   61.46%   47.81%   -13.65%     
===========================================
  Files         235      219       -16     
  Lines       30397    30307       -90     
===========================================
- Hits        18683    14492     -4191     
- Misses       9840    14095     +4255     
+ Partials     1874     1720      -154

Flag	Coverage Δ
e2e	`47.81% <0.00%> (?)`
unittest	`?`

BornChanger · 2024-05-04T02:50:49Z

/retest

BornChanger · 2024-05-04T14:27:54Z

/retest

BornChanger · 2024-05-04T18:45:17Z

/run-pull-e2e-kind-basic

ti-chi-bot · 2024-05-06T03:40:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BornChanger, csuzhangxc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [BornChanger,csuzhangxc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2024-05-06T03:40:45Z

[LGTM Timeline notifier]

Timeline:

2024-05-04 02:50:36.183869862 +0000 UTC m=+671189.941005436: ☑️ agreed by BornChanger.
2024-05-04 02:51:05.269796707 +0000 UTC m=+671219.026932276: ✖️🔁 reset by ti-chi-bot[bot].
2024-05-06 03:40:45.043388677 +0000 UTC m=+846998.800524249: ☑️ agreed by csuzhangxc.

csuzhangxc · 2024-05-06T03:44:16Z

/cherry-pick release-1.5

ti-chi-bot · 2024-05-06T03:45:09Z

@csuzhangxc: new pull request created to branch release-1.5: #5639.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Co-authored-by: WangLe1321 <wangle1321@163.com>

nkg-

THanks for working on it. Left a couple of comments.

nkg- · 2024-05-06T18:50:19Z

pkg/apis/pingcap/v1alpha1/restore.go

@@ -130,6 +130,19 @@ func IsRestoreVolumeComplete(restore *Restore) bool {
 	return condition != nil && condition.Status == corev1.ConditionTrue
 }

+// IsRestoreVolumeFailed returns true if a Restore for volume is Failed
+func IsRestoreVolumeFailed(restore *Restore) bool {


Trying to understand this what exactly does IsRestoreVolumeFailed entail.

We should only delete volumes, if restore fails in the first step (ie RestoreVolume which creates the volumes), and before creating PV/PVCs which attach volumes to ec2 nodes. Its the volumes which are never attached to an ec2 node (via PVCs) which cause a leak. After PVCs are created, they are tracked and deleted by csi driver.

Also If they restore fails during warmup, or restore_data, its actually useful to investigate the tikvs on what happened. If the volumes are deleted, then it would be hard to investigate.

We create pv/pvc and warm up volumes and restore data only if the restore CR has the status RestoreVolumeComplete. And if restore volume failed, the status of restore CR is Failed. So this condition can ensure the restore is failed at restore volume step.

Got it. So !IsRestoreVolumeComplete(restore) is the condition which ensure restorevolume step has failed.

nkg- · 2024-05-06T18:57:02Z

pkg/backup/util/aws_ebs.go

@@ -204,6 +205,68 @@ func (e *EC2Session) DeleteSnapshots(snapIDMap map[string]string, deleteRatio fl
 	return nil
 }

+func (e *EC2Session) DeleteVolumes(volumeIDs []string) error {


We should only delete volumes, which are unattached. How is the validated here.

If the volume is attached, the delete request will fail. It means, we can't delete an attached volume. In my test, if the tikv pod is running, the volume can't be deleted. This is the reason why I deleted the tikv statefulset in my test steps.

Got it. Basically we rely in aws api contract to prevent attached volumes from getting deleted. Deletes the specified EBS volume. The volume must be in the available state (not attached to an instance).

https://docs.aws.amazon.com/cli/latest/reference/ec2/delete-volume.html

nkg-

Thanks for the responses.

nkg- · 2024-05-07T19:57:14Z

pkg/apis/pingcap/v1alpha1/restore.go

@@ -130,6 +130,19 @@ func IsRestoreVolumeComplete(restore *Restore) bool {
 	return condition != nil && condition.Status == corev1.ConditionTrue
 }

+// IsRestoreVolumeFailed returns true if a Restore for volume is Failed
+func IsRestoreVolumeFailed(restore *Restore) bool {


Got it. So !IsRestoreVolumeComplete(restore) is the condition which ensure restorevolume step has failed.

nkg- · 2024-05-07T19:59:13Z

pkg/backup/util/aws_ebs.go

@@ -204,6 +205,68 @@ func (e *EC2Session) DeleteSnapshots(snapIDMap map[string]string, deleteRatio fl
 	return nil
 }

+func (e *EC2Session) DeleteVolumes(volumeIDs []string) error {


Got it. Basically we rely in aws api contract to prevent attached volumes from getting deleted. Deletes the specified EBS volume. The volume must be in the available state (not attached to an instance).

https://docs.aws.amazon.com/cli/latest/reference/ec2/delete-volume.html

br: clean volumes when restore volume failed

e0a0d5c

ti-chi-bot bot added the size/L label Apr 24, 2024

br: modify comment

a4588c7

BornChanger added the area/ebs-br label Apr 24, 2024

WangLe1321 added 3 commits April 29, 2024 17:12

br: implement CleanVolumes method in GCPSnapshotter

9596797

br: remove maxResults param

6e792e8

br: modify DescribeVolumes parameters

1d6858f

BornChanger approved these changes May 4, 2024

View reviewed changes

ti-chi-bot bot added lgtm approved labels May 4, 2024

Merge branch 'master' into feat/clean-restore-volumes

bf2e796

ti-chi-bot bot removed the lgtm label May 4, 2024

BornChanger added the type/cherry-pick-for-release-1.5 label May 6, 2024

csuzhangxc approved these changes May 6, 2024

View reviewed changes

ti-chi-bot bot added the lgtm label May 6, 2024

ti-chi-bot bot merged commit b9b80c7 into pingcap:master May 6, 2024
13 checks passed

ti-chi-bot mentioned this pull request May 6, 2024

br: clean volumes when restore volume failed (#5634) #5639

Merged

10 tasks

csuzhangxc pushed a commit that referenced this pull request May 6, 2024

br: clean volumes when restore volume failed (#5634) (#5639)

b21b886

Co-authored-by: WangLe1321 <wangle1321@163.com>

nkg- reviewed May 6, 2024

View reviewed changes

nkg- reviewed May 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

br: clean volumes when restore volume failed #5634

br: clean volumes when restore volume failed #5634

WangLe1321 commented Apr 24, 2024 •

edited

WangLe1321 commented Apr 29, 2024

codecov-commenter commented Apr 29, 2024 •

edited

BornChanger commented May 4, 2024

BornChanger commented May 4, 2024

BornChanger commented May 4, 2024

ti-chi-bot bot commented May 6, 2024

ti-chi-bot bot commented May 6, 2024

csuzhangxc commented May 6, 2024

ti-chi-bot commented May 6, 2024

nkg- left a comment

nkg- May 6, 2024

WangLe1321 May 7, 2024

nkg- May 7, 2024

nkg- May 6, 2024

WangLe1321 May 7, 2024

nkg- May 7, 2024

nkg- left a comment

nkg- May 7, 2024

nkg- May 7, 2024

br: clean volumes when restore volume failed #5634

br: clean volumes when restore volume failed #5634

Conversation

WangLe1321 commented Apr 24, 2024 • edited

What problem does this PR solve?

What is changed and how does it work?

Code changes

Tests

Side effects

Related changes

Release Notes

WangLe1321 commented Apr 29, 2024

codecov-commenter commented Apr 29, 2024 • edited

Codecov Report

BornChanger commented May 4, 2024

BornChanger commented May 4, 2024

BornChanger commented May 4, 2024

ti-chi-bot bot commented May 6, 2024

ti-chi-bot bot commented May 6, 2024

[LGTM Timeline notifier]

csuzhangxc commented May 6, 2024

ti-chi-bot commented May 6, 2024

nkg- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkg- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WangLe1321 commented Apr 24, 2024 •

edited

codecov-commenter commented Apr 29, 2024 •

edited